|
by Cameron Laird and Kathryn Soraiz
Many of the questions correspondents send us have to do with Portable Document Format (PDF), the main subject of last summer's "Yes You Can" column. This winter's publication of Perl Graphics Programming makes for an apt time to explain more of the possibilities.
PDF basics
Here's a bit of review: PDF is the proprietary document format Adobe Systems Incorporated first published in 1992. PDF instances can work on (nearly) all platforms, including Unix, Windows, and Macintosh, and they "look good" when targeted to a variety of display devices, including common printers and graphics monitors. Most computer users think of PDF as just another suffix, like HTML or JPG, common among browsable archives. PDF is commonly selected as a "non-native" output format of standard office-automation applications, including word processors and spreadsheets.
You don't have to handle your PDF that way, though. Even if most PDF is done "by hand", there are plenty of opportunities for industrious programmers to automate PDF management. In the past, we've discussed the PJ, ReportLab, PDFlib, and JavaPDF libraries. All these, along with several other proprietary products, are valuable, and we're currently using ReportLab daily in our own work. The release of Perl Graphics Programming makes it timely to demonstrate how powerful Perl can be in PDF work.
This installment of "Regular Expressions" focuses on Chapter 12 of Perl Graphics Programming, "Creating PDF Documents with Perl", along with related opportunities the Comprehensive Perl Archive Network-archived (CPAN) PDF application programming interface (API) provides.
The modern way to use Perl with PDF is through the PDF::API2 module that network engineer Alfred Reibenschuh maintains, and on which he holds copyright. Reibenschuh emphasizes his debt to the earlier Text:DF work done by Martin Hosken, on which PDF::API2 depends. PDF::API2 has an object-oriented API; Reibenschuh finds this makes for easier maintenance than his earlier procedural Text:DF::API attempt.
One aspect of the PDF::API2 module that Perl Graphics Programming fails to cover because it's so new, and that's nearly absent in other common libraries, is introspection. PDF is largely a write-only format. It's quite difficult, for example, to automate retrieval of text formatted in PDF. This is currently the domain of proprietary products such as CZ-Pdf2Txt. PDF::API2 can report on several document attributes, though:
$filename = "some_document.pdf";
$my_PDF_object = PDF::API2->open($filename);
$page_count = $my_PDF_object->pages();
%infohash = $my_PDF_object->info();
$author = $infohash{'Author'};
print "$filename has $pagecount pages.\n";
print "$author wrote $filename.\n";
Reibenschuh speculates that other introspections, such as one on the entries of the PDF "outline" or "bookmark" structure, might appear during 2003. He credits the internal Text:DF with sufficient flexibility that, in his words, "any functionality could be added to it given enough time to write the code."
PDF::API2's facility for generation of PDF instances appears to be the most fine-grained of any of the well-supported freely available PDF libraries. It's easy, for example, to extract a single page from an existing document and insert it into a new one:
...
# Make a copy of page 3 from the old document, and ...
$source_page = 3;
# Paste it in as page 5 of the new one.
$target_page = 5;
$old_document = PDF::API2->open("my_old_document.pdf");
$new_document->importpage($old_document, $source_page, $target_page);
PDF::API2 also operates at a considerably lower level, with an abundance of methods to manage fonts, colors, bounding boxes, stroke styles, and the images and texts made from these.
Use case
Suppose you're responsible for a weekly status report, to be published as PDF. A conventional solution might be to update a word-processing template each week with new information, export it as PDF, and email the result to a circulation list. With PDF::API2 or a comparable library, though, you can construct "boilerplate PDF" into which you pour the latest financial quantities. A bit more work automating your data sources, and a modest script can yield "lights-out" operation, which generates and mails out the report on schedule each week, without human intervention.
Our larger point, one we repeat most months, is that you should be on the lookout for ways to make your computers work for you. "Scripting languages" have interfaces to most computer operations humans can do, and generally provide straightforward ways to program common tasks. That's true, and valuable, even for domains like PDF that are mostly done "manually". |
|