Reading PDF Files in Sahi
During automation we sometimes need to verify contents of static PDF files and editable PDF files. A good open source library for reading PDF content is Apache PDFbox. Below sample codes uses Apache PDFbox version 3.0.xx.Prerequisites
- Download pdfbox-app-x.x.x.jar file from http://pdfbox.apache.org/download.cgi
- Copy it to
SahiPro/userdata/extlib
folder (create if required).
infoNote: The jar file has to be copied into
SahiPro/userdata/extlib
folder, not in SahiPro/extlib
folder.Reading static pdf file
For reading static PDF files, extracting the data, and validating it, Sahi Pro utilizes Apache PDFBox. This tool streamlines the process of extracting text content from PDF files via the command line.To run the below code:
- Copy the code below to
SahiPro/userdata/scripts/pdf_pdfboxapp.sah
file. - Copy Gandhiji.pdf and save it in
SahiPro/userdata/scripts
folder.
We are using pdfbox-app-x.x.x.jar to convert pdf to text using below code.
The
showPDFText
function reads pdf file and shows the text in the current web page, by replacing the html body.The
getPDFText
function reads pdf file and returns the contents of the pdf as a string.
/**
Reads PDF file and shows the text in the current web page.
Useful for adding assertions
@param $pdf- PDF file path relative to the current script directory
*/
function showPDFText($pdf) {
var $pdfboxAppJarPath = _userDataPath("extlib\\pdfbox-app-3.x.x.jar");
var $pdf = _resolvePath($pdf);
var $txt = _resolvePath("PDFAstext.txt");
var $data = _execute("java -jar " + $pdfboxAppJarPath + " export:text -html -i=" + $pdf + " -o=" +$txt, true);
var $text = _readFile($txt);
_call(document.body.innerHTML = $text);
}
/**
Reads PDF file and returns the contents as a string
@param $pdf- PDF file path relative to the current script directory
@returns string text contents of the PDF file
*/
function getPDFText($pdf,$txt) {
var $pdfboxAppJarPath = _userDataPath("extlib\\pdfbox-app-3.x.x.jar");
var $pdf = _resolvePath($pdf);
var $txt = _resolvePath("PDFAstext.txt");
var $data = _execute("java -jar " + $pdfboxAppJarPath + " export:text -i=" + $pdf + " -o=" +$txt, true);
var $text = _readFile($txt);
return $text;
}
var $pdf = "Gandhiji.pdf"; // Gandhiji.pdf needs to be present in the current script directory
showPDFText($pdf);
_assertContainsText("2 October 1869", _paragraph("/Born/"));
_assertContainsText("30 January 1948", _paragraph("/Died/"));
var $data = getPDFText($pdf);
_assertTrue($data.includes("2 October 1869"));
Reading values of editable fields from pdf file
To read editable PDF files, extract the data entered by users, validate the data entered into editable PDF files, Sahi Pro typically utilize libraries from Apache PDFbox that support the manipulation of editable PDFs.To run the below code:
- Copy the code below to
SahiPro/userdata/scripts/read_pdf_fields.sah
file. - Copy editablePDF.pdf and save it in
SahiPro/userdata/scripts
folder. - Copy pdf_fields_lib.sah and save it in
SahiPro/userdata/scripts
folder.
The below code uses library functions like load(), printAll(), getValue(), isChecked(), which are defined in
pdf_fields_lib.sah
file.
_include("pdf_fields_lib.sah")
var $pdf = new SahiFrameWork_PDFFields();
var $file = _resolvePath("editablePDF.pdf");
$pdf.load($file);
$pdf.printAll();
_assertEqual("Sahi Pro", $pdf.getValue("Name"));
_assertTrue($pdf.isChecked("Option 1"));