PDF’s, Scanning, and File Sizes

I’ve been playing around with PDF’s for the past few weeks and have noticed a very interesting thing: A PDF is *not* a PDF is *not* a PDF is *not* a PDF, ad nauseum, and it would seem, ad infinitum. At least, so it would seem. Part of me almost wonders if the only distinguishing feature of a PDF is the .pdf extension at the end of the file. In “researching” this post I have learned what I knew already; PDF boils down to being simply a container format.

Lately I have been scanning some annual reports from years past for an organization I belong to, and due to the ways xsane 0.997 that comes with Fedora 12 scans pages — which I will concede straight out of the gate I have only explored enough to get it to do what I want and to learn how it does things “its way” — the PDF file sizes are “fairly” large.

In order to find this out, I first found out about one of the quirks in xsane 0.997: Something about the settings with xsane doesn’t have it stop between pages for me to change pages; at least, I haven’t gotten around to finding where the settings in xsane are to have it pause between pages. This is important because my scanner doesn’t have an automatic page feeder. The first page of results of a google search indicate several comments about this problem, but not a solution. At first glance the second page of results is of no help.

So I end up scanning pages one at a time, and then use GhostScript to join them all up at the end to make a single PDF.

Without having added up file sizes, it was obvious that the total size of all the scanned pages at 75 dpi and in black and white was sufficiently larger than the single PDF with all the pages joined. This did not bother me since, again without having added things up, the difference didn’t seem *too* great, and I assumed that the savings were principally due to adminstrative redundancies being eliminated by having one “container” as opposed to having 25 to 30 “containers” for each individual page.

Then this week a curious thing occurred: I scanned a six page magazine article, and then separately, another two page magazine article, at 100 dpi and colour, and whaddya know, the combined PDF of each set is smaller than any of the original source files. Significantly so. In fact, the largest page from the first set of six pages is double the size of the final integrated PDF, and in the case of the second set of two pages, each of the original pages are triple the size of the combined PDF. I’m blown away.

Discussing this with someone who knows the insides of computers way more than I, I learn something: It would appear that xsane probably creates PDF’s using the TIFF format (for image quality) as opposed to what I imagine Ghostscript does when joining files, which would seem to be to do what it can to reduce filesizes, and as such in this case I imagine convert the TIFF’s inside the PDF’s into JPEG’s. A bit of googling indeed appears to associate tiffs and PDF’s when it comes to xsane; indeed a check on the “multipage” settings shows three output file formats — PDF, PostScript and TIFF. And looking in Preferences/Setup/Filetype under the TIFF Zip Compression Rate, it’s set at 6 out of 9.

So I google PDF sizing, and one result led me to an explanation of the difference between using “Save” and “Save As …” options when editing a PDF: “Save” will typically append metadata on top of metadata (including *not* replacing the expired metadata in the “same” fields!); “Save As”, well, that’s what you really want to do to avoid a bloated file since all that should be will be replaced.

Another result begins describing (what is no doubt but a taste of) the various possible settings in a PDF file, and how using a given PDF editing application, you can go through a PDF, remove some setings, correct others, etc., and reduce the size of PDF’s by essentially eliminating redundant or situationally irrelevant — such as fields with null values — information whose presence would have the effect of bloating the file unecessarily.

I’ve known for a few years that PDF’s are a funny beast by nature when it comes to size: For me the best example by far used to be the use of “non-standard fonts” in the source file, oh say any open-source font that isn’t in the standard list of “don’t bother embedding the font since we all know that nine out of ten computers on the planet has it”. In and of itself this isn’t a problem; why not allow for file size savings when it is a reasonable presumption that many text PDF’s are based on a known set of fonts, and most people have said known set of fonts installed already on their system. However, when one uses a non-standard font or uses one of the tenth computers, when one constantly creates four to 6 page PDF text documents ten times the size of source documents, frustration sets in; having wondered if designating a font substitution along the lines of “use a Roman font such as Times New Roman” when such a font is used — such as in my case, Liberation Serif or occasionally Nimbus Roman No9 L — I asked my “person in the know”. Apparently, Fedora 12’s default GhostScript install, whose settings I have not modified, seems to do just that.

I guess what really gets me about this is how complicated the PDF standard must be, and how wildly variable the implementations are — at least, given that Adobe licences PDF creation for free provided that the implementations respect the complete standard — or more to the point, how wildly variable the assumptions and settings are in all sorts of software when creating a PDF. I bet that were I to take the same source and change one thing such as equipment or software that the results would be wildly different.

So, concurrent to the above scanning project, I happened to experiment with a portable scanner — a fun challenge in and of itself to make it work, but it did without “too much fuss”. And I found out something interesting, which I knew had nothing to do with PDF’s but (I presume) rather with scanners, drivers, and xsane. I tried scanning some pages of one of the said annual reports with the portable scanner on an identical Fedora 12 setup using xsane, and the PDF’s that were produced were far greater in size than those scanned with my desktop flatbed scanner. My flatbed scanner would scan the text and the page immediately surrounding the text, but correctly identified the “blank” part of the page as being blank, and did not scan in those areas, thereby significantly reducing the image scanned size. The other scanner, a portable model, did no such thing and created images from the whole page, blank spaces rendered, in this case, to a dull grey and all, thereby creating significantly larger PDF files than the scans of the same pages created on my flatbed scanner. However, as I mentioned, I assume that this is a function of the individual scanners and their drivers, and possibly how xsane interacts with them, and in my mind is not a function per se of how xsane creates PDF files.

Another interesting lesson.

Printing PDFs

I’ve just had an interesting object lesson in the differences between two different pieces of software that have more than essentially the same function.

Today I had an important PDF document to print out at home instead of at the office. For the purposes of practical convenience, it was far better to print it out at home and just deliver it to the office than spend the extra time at the office (5-10 minutes) turning on my office computer and printing it out there on a printer I knew would have no difficulty dealing with it, having printed out a few dozen identically-generated documents on it.

On my pretty much stock Fedora 10 box, I use the Evince Document Viewer 2.24.2 using poppler 0.8.7 (cairo) for the Gnome desktop to display and print PDF documents. So far, I’ve been satisfied.

The PDF’s layout had margins beyond my printer’s abilities. And of course the most important parts of the document, being right at the edges of the margins in this document, were being cut off in the process of printing out the document. A reduction in the print size was not useful since the vital information was on the end of the document being cut off in the margins. I suppose I could have tried rotating the document to try to see if the cut off part would not contain crucial information, which I didn’t think of at the time. Both these strategies, however, miss the point: If the original document has very narrow margins, something is going to get cut off no matter what; not exactly desireable.

I did try something that happened to involve a Windows box (ughh) mostly because it had a different printer, and you never know how things behave differently with different equipment.

Not surprisingly, the windows box happens to have an Adobe viewer installed (I avoid that box as much as possible; I don’t even maintain it, that’s my brother’s job. 🙂 ). I click to print the document and whaddya know, in the print dialog there’s an option to fit the document within the printable area. Document printed, convenience secured.

Now what I would like to know is how much of the print window in my desktop is governed by HPLIP, how much by Gnome, how much by CUPS, and how much by the application invoking it at the moment. So I did a little experiment: Always selecting my printer, I opened a print dialogue in Evince Document Viewer, OpenOffice.org (3.0.1), Firefox (3.0.7), The Gimp (2.6.5), Xpdf (3.02) which I intentionally installed for the purpose of this experiment, and gedit (2.24.3) (on which I’m composing this blog). Besides Xpdf, each appears to have the same base, and except for Evince Document Viewer, each also adds a function tab of its own. Xpdf, on the other hand, has its own stripped-down interface — either invoke the lpr command or print to a file.

Here’s a quick table listing the tabs listed in the print dialogs available in five, off-the-shelf standard installs of Fedora 10 software, with my printer selected, plus Xpdf, which was installed directly from the Fedora repositories without any modification of settings or whatever on my part:

OpenOffice.org*: General; Page Setup; Job; Advanced; Properties
Firefox: General: Page Setup; Job; Advanced; Options; Properties
Document Viewer: General; Page Setup; Job; Advanced
The Gimp**: General; Page Setup; Job; Advanced; Image Settings***
Xpdf: Xpdf has its own stripped-down interface
gedit : General; Page Setup; Job; Advanced; Text Editor

* There is an Options button in the “Page Setup” tab for OpenOffice.org.
** The Gimp treats my “special” PDF as an image much like any other, and automatically sizes it to the current settings, much like it would handle a .png or .jpg image
*** The Gimp has an option to ignore the margins; see above note

Not one, besides The Gimp, has an option to fit the document within the printable range, and The Gimp only indirectly, because of the way it seems to handle PDFs by default as an image to be manipulated. And of the others, to be fair, only Document Viewer and Xpdf deal with PDFs — even FireFox delegates PDFs to the Evince Document Viewer by default.

Then I did another little experiment: I installed Adobe Reader 9.1 (that license is interesting, pretty convoluted, and makes me wonder whether I may use the installation at all; in any case, I’ll be getting rid of it since I really only installed it for the purpose of this experiment, and decided a while ago that having 2 PDF viewers above and beyond that which is available in the basic distro installation is superfluous unless ther’s a particular reason for it.) And what do I see? A new print dialog that reminds me of the one I saw earlier on the windows box. Interestingly, it has “fit to printable area” and “shrink to printable area” options.

So my little experiment has led me to the following conclusions:

– many pieces of software, presumably not wanting to reinvent the wheel, rely either on the OS or I suspect, at least in this case, the desktop environment for its print dialogs;
– some software authors do want to reinvent the wheel, such as to “do it their own way”, or to be completely platform and environment independent, and therefore make their own dialogs;
– some software authors want to do extra things but don’t want to reinvent the wheel, so they have a wrapper for to add extra functionality to an existing base;
– in my documents, I shouldn’t try to stuff as much content as possible into each page too far, at least not by playing around with the margins.

Looks like something for the Evince authors to toss in. Assuming, of course, that — without fundamentally changing a document — resizing a PDF and/or its content to the local printer’s printing range is a really useful feature, such as to deal with awry margins, or PDFs sized for A4 instead of letter sized or vice-versa. 🙂 And that such non-conformities and/or their prevalence make it worth my using the Adobe Reader, licensing issues aside. Or that another PDF reader out there that has that functionality.