Reading PDF Files in Sahi

During automation we sometimes need to verify contents of static PDF files and editable PDF files. A good open source library for reading PDF content is Apache PDFbox. Below sample codes uses Apache PDFbox version 3.0.xx.

Prerequisites

Reading static pdf file

For reading static PDF files, extracting the data, and validating it, Sahi Pro utilizes Apache PDFBox. This tool streamlines the process of extracting text content from PDF files via the command line.

To run the below code: Code

We are using pdfbox-app-x.x.x.jar to convert pdf to text using below code.
The showPDFText function reads pdf file and shows the text in the current web page, by replacing the html body.
The getPDFText function reads pdf file and returns the contents of the pdf as a string.

/**
Reads PDF file and shows the text in the current web page.
Useful for adding assertions
@param $pdf- PDF file path relative to the current script directory
*/
function showPDFText($pdf) {
  var $pdfboxAppJarPath = _userDataPath("extlib\\pdfbox-app-3.x.x.jar");
  var $pdf = _resolvePath($pdf);
  var $txt =  _resolvePath("PDFAstext.txt");
  var $data = _execute("java -jar " + $pdfboxAppJarPath + " export:text -html -i=" + $pdf + " -o=" +$txt, true);
  var $text = _readFile($txt);
  _call(document.body.innerHTML = $text);
}

/**
Reads PDF file and returns the contents as a string
@param $pdf- PDF file path relative to the current script directory
@returns string text contents of the PDF file
*/
function getPDFText($pdf,$txt) {
  var $pdfboxAppJarPath = _userDataPath("extlib\\pdfbox-app-3.x.x.jar");
  var $pdf = _resolvePath($pdf);
  var $txt =  _resolvePath("PDFAstext.txt");
  var $data =  _execute("java -jar " + $pdfboxAppJarPath + " export:text -i=" + $pdf + " -o=" +$txt, true);
  var $text = _readFile($txt);
  return $text;
}

var $pdf = "Gandhiji.pdf"; // Gandhiji.pdf needs to be present in the current script directory
showPDFText($pdf);
_assertContainsText("2 October 1869", _paragraph("/Born/"));
_assertContainsText("30 January 1948", _paragraph("/Died/"));
var $data = getPDFText($pdf);
_assertTrue($data.includes("2 October 1869"));


Reading values of editable fields from pdf file

To read editable PDF files, extract the data entered by users, validate the data entered into editable PDF files, Sahi Pro typically utilize libraries from Apache PDFbox that support the manipulation of editable PDFs.

To run the below code: Code

The below code uses library functions like load(), printAll(), getValue(), isChecked(), which are defined in pdf_fields_lib.sah file.

_include("pdf_fields_lib.sah")
var $pdf = new SahiFrameWork_PDFFields();
var $file = _resolvePath("editablePDF.pdf");
$pdf.load($file);
$pdf.printAll();
_assertEqual("Sahi Pro",  $pdf.getValue("Name"));
_assertTrue($pdf.isChecked("Option 1"));