Reading PDF Files in Sahi
During automation we sometimes need to verify contents of static PDF files and editable PDF files. A good open source library for reading PDF content is Apache PDFbox. Below sample codes uses Apache PDFbox version 3.0.xx.Prerequisites
- Download pdfbox-app-x.x.x.jar file from http://pdfbox.apache.org/download.cgi
- Copy it to
SahiPro/userdata/extlibfolder (create if required).
infoNote: The jar file has to be copied into
SahiPro/userdata/extlib folder, not in SahiPro/extlib folder.Reading static pdf file
For reading static PDF files, extracting the data, and validating it, Sahi Pro utilizes Apache PDFBox. This tool streamlines the process of extracting text content from PDF files via the command line.To run the below code:
- Copy the code below to
SahiPro/userdata/scripts/pdf_pdfboxapp.sahfile. - Copy Gandhiji.pdf and save it in
SahiPro/userdata/scriptsfolder.
We are using pdfbox-app-x.x.x.jar to convert pdf to text using below code.
The
showPDFText function reads pdf file and shows the text in the current web page, by replacing the html body.The
getPDFText function reads pdf file and returns the contents of the pdf as a string.
/**
Reads PDF file and shows the text in the current web page.
Useful for adding assertions
@param $pdf- PDF file path relative to the current script directory
*/
function showPDFText($pdf) {
var $pdfboxAppJarPath = _userDataPath("extlib\\pdfbox-app-3.x.x.jar");
var $pdf = _resolvePath($pdf);
var $txt = _resolvePath("PDFAstext.txt");
var $data = _execute("java -jar " + $pdfboxAppJarPath + " export:text -html -i=" + $pdf + " -o=" +$txt, true);
var $text = _readFile($txt);
_call(document.body.innerHTML = $text);
}
/**
Reads PDF file and returns the contents as a string
@param $pdf- PDF file path relative to the current script directory
@returns string text contents of the PDF file
*/
function getPDFText($pdf,$txt) {
var $pdfboxAppJarPath = _userDataPath("extlib\\pdfbox-app-3.x.x.jar");
var $pdf = _resolvePath($pdf);
var $txt = _resolvePath("PDFAstext.txt");
var $data = _execute("java -jar " + $pdfboxAppJarPath + " export:text -i=" + $pdf + " -o=" +$txt, true);
var $text = _readFile($txt);
return $text;
}
var $pdf = "Gandhiji.pdf"; // Gandhiji.pdf needs to be present in the current script directory
showPDFText($pdf);
_assertContainsText("2 October 1869", _paragraph("/Born/"));
_assertContainsText("30 January 1948", _paragraph("/Died/"));
var $data = getPDFText($pdf);
_assertTrue($data.includes("2 October 1869"));
Reading values of editable fields from pdf file
To read editable PDF files, extract the data entered by users, validate the data entered into editable PDF files, Sahi Pro typically utilize libraries from Apache PDFbox that support the manipulation of editable PDFs.To run the below code:
- Copy the code below to
SahiPro/userdata/scripts/read_pdf_fields.sahfile. - Copy editablePDF.pdf and save it in
SahiPro/userdata/scriptsfolder. - Copy pdf_fields_lib.sah and save it in
SahiPro/userdata/scriptsfolder.
The below code uses library functions like load(), printAll(), getValue(), isChecked(), which are defined in
pdf_fields_lib.sah file.
_include("pdf_fields_lib.sah")
var $pdf = new SahiFrameWork_PDFFields();
var $file = _resolvePath("editablePDF.pdf");
$pdf.load($file);
$pdf.printAll();
_assertEqual("Sahi Pro", $pdf.getValue("Name"));
_assertTrue($pdf.isChecked("Option 1"));