PDF’s, Scanning, and File Sizes

I’ve been playing around with PDF’s for the past few weeks and have noticed a very interesting thing: A PDF is *not* a PDF is *not* a PDF is *not* a PDF, ad nauseum, and it would seem, ad infinitum. At least, so it would seem. Part of me almost wonders if the only distinguishing feature of a PDF is the .pdf extension at the end of the file. In “researching” this post I have learned what I knew already; PDF boils down to being simply a container format.

Lately I have been scanning some annual reports from years past for an organization I belong to, and due to the ways xsane 0.997 that comes with Fedora 12 scans pages — which I will concede straight out of the gate I have only explored enough to get it to do what I want and to learn how it does things “its way” — the PDF file sizes are “fairly” large.

In order to find this out, I first found out about one of the quirks in xsane 0.997: Something about the settings with xsane doesn’t have it stop between pages for me to change pages; at least, I haven’t gotten around to finding where the settings in xsane are to have it pause between pages. This is important because my scanner doesn’t have an automatic page feeder. The first page of results of a google search indicate several comments about this problem, but not a solution. At first glance the second page of results is of no help.

So I end up scanning pages one at a time, and then use GhostScript to join them all up at the end to make a single PDF.

Without having added up file sizes, it was obvious that the total size of all the scanned pages at 75 dpi and in black and white was sufficiently larger than the single PDF with all the pages joined. This did not bother me since, again without having added things up, the difference didn’t seem *too* great, and I assumed that the savings were principally due to adminstrative redundancies being eliminated by having one “container” as opposed to having 25 to 30 “containers” for each individual page.

Then this week a curious thing occurred: I scanned a six page magazine article, and then separately, another two page magazine article, at 100 dpi and colour, and whaddya know, the combined PDF of each set is smaller than any of the original source files. Significantly so. In fact, the largest page from the first set of six pages is double the size of the final integrated PDF, and in the case of the second set of two pages, each of the original pages are triple the size of the combined PDF. I’m blown away.

Discussing this with someone who knows the insides of computers way more than I, I learn something: It would appear that xsane probably creates PDF’s using the TIFF format (for image quality) as opposed to what I imagine Ghostscript does when joining files, which would seem to be to do what it can to reduce filesizes, and as such in this case I imagine convert the TIFF’s inside the PDF’s into JPEG’s. A bit of googling indeed appears to associate tiffs and PDF’s when it comes to xsane; indeed a check on the “multipage” settings shows three output file formats — PDF, PostScript and TIFF. And looking in Preferences/Setup/Filetype under the TIFF Zip Compression Rate, it’s set at 6 out of 9.

So I google PDF sizing, and one result led me to an explanation of the difference between using “Save” and “Save As …” options when editing a PDF: “Save” will typically append metadata on top of metadata (including *not* replacing the expired metadata in the “same” fields!); “Save As”, well, that’s what you really want to do to avoid a bloated file since all that should be will be replaced.

Another result begins describing (what is no doubt but a taste of) the various possible settings in a PDF file, and how using a given PDF editing application, you can go through a PDF, remove some setings, correct others, etc., and reduce the size of PDF’s by essentially eliminating redundant or situationally irrelevant — such as fields with null values — information whose presence would have the effect of bloating the file unecessarily.

I’ve known for a few years that PDF’s are a funny beast by nature when it comes to size: For me the best example by far used to be the use of “non-standard fonts” in the source file, oh say any open-source font that isn’t in the standard list of “don’t bother embedding the font since we all know that nine out of ten computers on the planet has it”. In and of itself this isn’t a problem; why not allow for file size savings when it is a reasonable presumption that many text PDF’s are based on a known set of fonts, and most people have said known set of fonts installed already on their system. However, when one uses a non-standard font or uses one of the tenth computers, when one constantly creates four to 6 page PDF text documents ten times the size of source documents, frustration sets in; having wondered if designating a font substitution along the lines of “use a Roman font such as Times New Roman” when such a font is used — such as in my case, Liberation Serif or occasionally Nimbus Roman No9 L — I asked my “person in the know”. Apparently, Fedora 12’s default GhostScript install, whose settings I have not modified, seems to do just that.

I guess what really gets me about this is how complicated the PDF standard must be, and how wildly variable the implementations are — at least, given that Adobe licences PDF creation for free provided that the implementations respect the complete standard — or more to the point, how wildly variable the assumptions and settings are in all sorts of software when creating a PDF. I bet that were I to take the same source and change one thing such as equipment or software that the results would be wildly different.

So, concurrent to the above scanning project, I happened to experiment with a portable scanner — a fun challenge in and of itself to make it work, but it did without “too much fuss”. And I found out something interesting, which I knew had nothing to do with PDF’s but (I presume) rather with scanners, drivers, and xsane. I tried scanning some pages of one of the said annual reports with the portable scanner on an identical Fedora 12 setup using xsane, and the PDF’s that were produced were far greater in size than those scanned with my desktop flatbed scanner. My flatbed scanner would scan the text and the page immediately surrounding the text, but correctly identified the “blank” part of the page as being blank, and did not scan in those areas, thereby significantly reducing the image scanned size. The other scanner, a portable model, did no such thing and created images from the whole page, blank spaces rendered, in this case, to a dull grey and all, thereby creating significantly larger PDF files than the scans of the same pages created on my flatbed scanner. However, as I mentioned, I assume that this is a function of the individual scanners and their drivers, and possibly how xsane interacts with them, and in my mind is not a function per se of how xsane creates PDF files.

Another interesting lesson.