PDF – Don's House of Fine Patisseries

April 5, 2022April 7, 2022

Overview of Open Source / Free Software for PDF files

This post is a translation of and (somewhat of an) adaptation, as well as slight update, of a presentation I gave in November, 2021, at a meeting of my local Linux Meetup. This adaptation includes some extra limited mockups of demonstrations performed live during the presentation.

The presentation was put together using Fedora Workstation (a general purpose version of Linux, in this case specializing in being a desktop workstation), highlighting some software either installed by default, or available in the Fedora Linux and rpmfusion software repositories (“App Stores”). It is therefore not intended to be a complete exposé on all available open source / free software options for PDF, even under Fedora Linux, let alone GNU / Linux in general, or other systems.

It should be noted that the presentation’s original target audience was a French-speaking group of Linux enthusiasts, Linux professionals, and other IT enthusiasts and professionals familiar with Linux. Most of the listed software would typically be available in standard or easily accessible Linux software repositories (“App Stores”). Beyond the world of GNU / Linux, free software is generally available for use on other systems, and, barring instances of a specific given package offered with paid warranty support, are usually also free of charge to download, install, and use.

In the case of the software highlighted in this post, all are either free-of-charge, or represent the free-of-charge version.

The Value of a PDF File

Context / Situation:

Take the case of the exchange of a document between two computers — such as between one running Linux, and another running Windows (or vice-versa) — and each computer is endowed with a different office suite, such as LibreOffice (cross-platform) on one, and Microsoft Office (Windows / Mac) on the other. (Of course, other possibilities exist, such as Calligra Suite (cross-platform), Pages / Numbers / Keynote / etc. (Mac), Corel Wordperfect, Google Docs, etc.)

LibreOffice, and in days gone by, OpenOffice.org, have long been touted as being “compatible” with MS Office; this purported compatibility, however, is disappointingly nowhere near as good as I and many others would like to believe.

As such, each user will open the shared document, which will be displayed according to each suite’s interpretation of the file, and may find that the actual displayed content on their screen could be different — sometimes substantially so — from the intended original display of the document. Text lines may be cut off; fonts may not be available on one or more of the systems, causing font substitution; font sizes may be changed, or text size may be different while substituting a different font due to the lack of the specified font; certain symbols may not be available on some systems; table effects may not work, or objects inserted into tables may not function or be displayed as expected, such as the insertion of a spreadsheet.

Unfortunately, I would estimate that said disappointing lack of “complete and perfect” “drop-in replacement” compatibility is a very common experience in comparing many well-known pieces of proprietary software and their open-source counterparts — not just LibreOffice and MS Office. Personally, as a Linux user, I have experienced this lack of complete compatibility a number of times since beginning to use OpenOffice.org in 2005 and Linux in 2006. Since then, I have also seen the incompatibility in action on a number of occasions during varying presentations under completely unrelated circumstances in which the presentation files were produced in one suite, and attempts made to show them in another were met with varying degrees of disappointment, sometimes leading to complete failure.

Example PDF

The PDF at this link is a somewhat varied although basic document created for this presentation (you will need a PDF viewer); images of the PDF are shown below. It was developed in order to use throughout the presentation as an example PDF to demonstrate the various given points at hand. It should be noted that the PDF was written in French because the presentation’s original target audience was French-speaking.

The following four images are jpeg images of the pages of the PDF document linked to above, and which I created in LibreOffice Presentation. It should be noted that, for the sake of argument, the pages could have been created in another format, such as a word processor, a spreadsheet program, or a drawing program, for instance.

Page 1 — Song lyrics to be displayed for a Karaoke Night

Page 1, the lyrics to a French song, such as one might want to display during a karaoke event among friends

Page 2 — Expenses list for a Luncheon

Page 2, a fictitious list of expenses for a luncheon

Page 3 — TV Listings

Page 3, a fictitious TV listing for an evening, with some Linux in-jokes and some in-jokes specific to the original audience

Page 4 — Flea Market Poster

Page 4, a fictitious flyer for a local flea market

The above document — represented here in jpeg format directly produced from a PDF of the document — was originally prepared in LibreOffice Presentation, and therefore correctly represented the original document.

However, the following four images are jpeg images of the pages of the PDF document I created in Microsoft PowerPoint (you will need a PDF viewer) into which I imported the original LibreOffice Presentation, in order to demonstrate the relative lack of compatibility between, at least in this case, LibreOffice Presentation and Microsoft Powerpoint.

Page 1 — Song lyrics to be displayed for a Karaoke Night

Changes: Text fonts and font sizes, causing text to be cut off the page

Page 1, note the changes in fonts and font sizes

Page 2 — Expenses list for a Luncheon

Changes: Text fonts, and improper translation of symbols

Page 2, note the changes in fonts, font sizes, and improper translation of symbols

Page 3 — TV Listings

Changes: text fonts, font sizes, and lack of background colours in the various cells

Page 3, note the changes in fonts, font sizes, and lack of background colours in the various cells

Page 4 — Flea Market Poster

Changes: Text fonts, font sizes, corrupted translation of spreadsheet table in the centre of the flyer

Page 4, note the changes in fonts, font sizes, and the completely corrupted translation of the spreadsheet table in the centre of the flyer

The value of a PDF:

PDF files are generally well supported across multiple platforms and software, generally regardless of platform, and will usually be displayed in a virtually identical fashion on all systems; in the case of discrepancies, they are usually inconsequential.

However:

There exists a certain perception that, short of having Adobe Acrobat Pro (a commercial, closed source piece software), PDF files are difficult to edit and modify, allowing for a certain view that PDF files are more secure. This is a case of “security by obscurity”, since editing and modification may be performed by many pieces of software, besides but of course including Adobe Acrobat Pro.

PDF files may also benefit from a perception of being less susceptible to viruses and malware, such as through macros. Suspicious files, regardless of format, should always be checked when there is reasonable doubt, particularly under certain environments.

Warning:

Be careful when using some PDF software downloaded from random websites on the internet, or websites which advertise PDF modification: The may add watermarks to the resulting file — this may be undesirable, and embarrassing, particularly if the software, website, or their output aren’t vetted prior to distributing the resulting file.

PDF Software which adds a watermark to edited watermarks when using an unregistered version

Further, websites providing PDF editing services may have very reasonable terms of service for editing your document, limiting their responsibilities toward you. By submitting a document to an external website, it may may not be able to protect personal privacy, nor be able to guarantee to not divulge commercial or industrial secrets or confidential personal information contained in the submitted document: They may become the victim of a hacking, or become the target of legal proceedings, not to mention potential dubious or unscrupulous intentions operators might have to begin with. Or, they may simply be unwilling to formally engage in such responsibilities in the absence of a paid service contract.

Sample from a website listing their conditions of use

This article’s objectives therefore are:

Firstly, presenting the utility of PDF as a useful format for distributing documents to a wide audience, without having to concern oneself with what software individual audience members may or may not have access to, if at all, and regardless of reason(s);
Secondly, presenting safe, free software and open-source software options for using and editing of PDF files;
Thirdly, beyond the general promotion of free and open-source software and PDF editing, this article is not about promoting nor deriding particular OSes or software packages, or strictly speaking their strengths or weaknesses.

As such, if a particular system or software package suits your needs and / or purposes, you should use it.

However, if a given preferred solution is costly software, perhaps your organization (or your family) may find it to be financially worthwhile to only purchase a minimum number of licences and only install it on a minimum number of designated computers, instead of needlessly on every computer in your organization (or family).

A simple cost / benefit analysis would be worthwhile: You should consider whether you wish to pay $5, $10, $15, or more, on a recurring basis (perhaps monthly), per computer on which such software would be installed. The costs, be they one-time costs or recurring, should be considered against how often the software may be used, perhaps in some cases only once or twice monthly — perhaps overall, let alone for each individual instance, depending on your organization’s size, needs, and other considerations. Further, it should be considered what operations are typically executed, especially if they simple operations such as joining multiple PDFs, or extracting a page or two, which can be easily performed by many, using any of a multitude of software packages you can get without cost, as opposed to perhaps more technical tasks which may justify costly specialized software.

Creating PDFs from an established document

To begin with, most software which create documents will have an option in the File menu or elsewhere to Print, or Print to Document, or an Export function, which will offer PDF as a format:

PDF (creation) Options in the “Export as PDF” option in LibreOffice

At the risk of skipping ahead to the PDF splitting section below, note that it is a common option to be able to selectively output some, instead of all, pages to the resulting PDF, thereby avoiding the question of having to later split the PDF to get only the desired page(s).

Overview of PDF Software

Perhaps (or perhaps not) to the surprise of many, there are many software packages and suites which will:

Display PDF files
Combine, divide, and export PDF files, as well as reorder pages within a PDF;
Edit PDF files, such as the overall files and the file metadata, as well as the PDF file content
Import and display PDF files according to particular strengths (The Gimp, Inkscape, e-readers)

Displaying PDF files:

Here are some examples of software which will display PDF files directly:

Evince Document Viewer (Gnome Project)
Okular (KDE Project)
Firefox and Chromium (Web Browsers)
PDFSam (limited free version; there is also a commercial version with more capabilities); a version for Debian derived Linux systems is available on their website

Here is a very short list of software which will open and display PDF files and allow editing, each according to their strengths, but whose primary function is not PDF display:

LibreOffice (Office Suite)
Calligra (Office Suite)
The Gimp (Image Manipulation)
Inkscape (Vector Graphics Editor)

Evince Document Viewer

Chromium (web browser)

Okular

Software to Combine PDF files

A relatively common activity is to combine multiple PDF files into one file — such as, separately scanned pieces of paper, or PDF files produced separately, perhaps by different people.

Here are some examples of software which will combine PDF files:

PDF Mix Tool
PDF Arranger
PDF Mod
PDF Jumbler
PDFedit
PDFTricks
PDFSam
LibreOffice
Calligra Suite
The Gimp

Combining PDF files in PDFArranger

Software to Divide PDF Files / Extract Pages

Another relatively common activity is to divide a PDF File, or extract one or more pages from a PDF file.

Note that if you are the creator of the document, as shown earlier, the software you used to create the document likely allows for you to selectively export individual or multiple pages to PDF in addition to exporting the entire document.

Here are some examples of software which will divide PDF files / extract pages:

PDF Mix Tool
PDF Mod
PDF Jumbler
PDFedit
PDFTricks
PDFSam
LibreOffice — allows to print and / or export one or more pages
Calligra Suite — allows to print and / or export one or more pages
The Gimp — allows to print and / or export one or more pages

Splitting a PDF File with PDFMod

Removing pages from a PDF file using PDFMod

PDF Editing

Here are some examples of software which will edit PDF files to varying degrees:

LibreOffice permits the possibility of creating a hybrid PDF and .odt / .ods file (word processor or spreadsheet files), which will allow for the PDF to be more easily edited by any suite that is able to edit .odt and .ods files; create a document with LibreOffice, and in creating a PDF, choose Export — General — PDF Hybrid (incorporating .odt / .ods file)

Other software to edit existing PDF files:

LibreOffice Draw
The Gimp
Scribus
PDFedit (old, but good)
jPDF Tweak (old, but good)
PDF Mix Tool (Basic functions)
https://itsfoss.com/pdf-editors-linux/
https://alternativeto.net/software/pdf24-creator/?platform=linux
PDFFill (pdffill.com) (Windows)

In my personal experience, PDF editing — and ease of doing so — can vary wildly according to what one wishes to do, as well as wildly according to the nature of the source PDF. I have had excellent experiences editing a PDF created from a CAD software drawing (presumably created using commercial CAD software such as AutoCAD), and whose individual elements could be manipulated in LibreOffice Draw. I have also used LibreOffice Draw to insert text zones, arrows, and scanned signatures into PDFs. Conversely, documents composed primarily of scanned images — including text and forms — may require more image manipulation skills to edit, modify, and manipulate individual and specific elements of the document, depending on your objectives.

What you can do will also be dictated by which software package you choose and its strengths and weaknesses.

For instance, it should be noted that the phrase “Editing a PDF” can be a nebulous thing which can mean many and different things to many and different people. For instance, actually editing document text directly in the PDF may be what one understands and expects, while the strengths of a given piece of software may lay elsewhere.

LibreOffice has some PDF import functions, as well as imperfect document layout functions. Depending on the source PDF document, it can be quite effective at editing text directly.

Note from the closed-source world: I once had an excellent experience with a moderately-difficult-to-edit PDF using Microsoft Word, which included being able to edit the text — and presumably save in MS Word’s native file format.

Importing and editing a PDF in LibreOffice Draw (note the imperfect import):

In the case of my example PDF, LibreOffice Draw allows for some direct editing of the text (Notice the word “MODIFIÉ” with a brick-red text colour replacing some of the text):

Importing and editing a PDF in Scribus, a desktop publishing programme:

The Gimp can insert text zones into a PDF, and which text zones themselves may be edited within The Gimp; however, its strengths lie in dealing with a PDF as an image, and editing image characteristics, while editing the text as one might in a word processor might be more challenging.

Importing a PDF file into The Gimp, image manipulation software:

Adding a text zone to a PDF in The Gimp:

Note the insertion of a text zone under the first line, saying “TEST document”

Exporting Text, Cut & Paste, and .odt File Creating

Depending on the source PDF and its nature, “cut & paste” may work (as opposed to not working at all), and may even “work well”, although this may be wildly variable according to the source PDF document. However, even in the best case, this method will normally only copy the actual text, and some of the images, from your PDF document; it may not usually be particularly useful in actually replicating the PDF document formatting.

As for other document and content formats, such as drawings, pictures, and text rendered into images, other sections of this post should be consulted (ie. using LibreOffice Draw or The Gimp for drawings; optical character reading (OCR), including OCRFeeder, etc.)

In addition to the mention of LibreOffice above, OCRFeeder is software that acts as a front end to optical character recognition software, and is able to import PDF files, and then export in HTML, plain text, OpenDocument (.odt) format, and of course PDF. Again depending on the source file, results may be variable, although the results are usable.

OCRFeeder in action and ready to export a page of the example PDF to ODT format

… and here is an image of the exported .odt file (word processor file) of the page viewed and created in OCRFeeder, then opened in my word processor (LibreOffice):

Ironically, as this case shows, the changes (or lack of adequate recognition and / or translation of the original layout) can be as great or even more as can occur by simply sharing documents between not-fully-compatible-though-similar software suites. However, though far from perfect, it is arguably usable, depending, of course, on how much effort you are willing to devote to replicating the original document layout — and then making your desired changes, and finally creating a new PDF document.

Exporting to other file formats:

As has been (indirectly) demonstrated several times throughout this post, PDF files can be imported into software that isn’t specifically dedicated to PDFs, and then allow for the resulting imported file to be exported into other formats. For example, The Gimp was used to create most of the working images for this post: In the case where PDF files were to be displayed, the PDF files were imported into The Gimp, and then exported in jpeg or png formats. This type of conversion — from PDF to another given format — can often be done by other pieces of software (to varying degrees) according to their strengths or weaknesses.

Photo Editing with PDFs

The Gimp is fully functional image processing software, very similar to — but, unfortunately, not fully compatible with nor a perfect drop-in replacement of — Photoshop. Using The Gimp, you can import a PDF and edit the image(s) directly, or extract photos and other images through a variety of means, such as selecting the area of the photo, copying the selected area, and creating a new document from the clipboard.

Here is a The Gimp having imported a PDF of a photo of myself on a cruise:

PDF of a photo of the author imported into The Gimp

During the live presentation, I gave the hypothetical example — for the sake of levity — of a barber who particularly likes sideburns, and seeing mine in a PDF, decided to clip out one of my sideburns from the photo …

Selecting a region of the photo and creating a new document therefrom

… and then notice on how I was starting to go grey at the time :

The beginnings of some greying in my sideburns

It is taken as an understood that use of The Gimp to manipulate the photo can be continued at this point — such as how my sideburns might look after a colouring, or to compare side-by-side against other people’s sideburns — and then the result exported as a PDF.

PDFTricks allows for resizing of images in PDFs, principally compressing and reducing the file size to the order of “large”, “medium”, “small”, and “extra-small”, as well as image exporting to .jpg / .png / .txt formats, and file merging and splitting.

During the presentation, the PDF document above composed of the photo of myself on a trip was run through the software’s “extreme compression” option. The following is a clip from a screenshot from a file manager, showing the size difference between the the original file, and the newly created compressed file:

File size difference before and after processing file with PDFTricks

LibreOffice Draw allows for some image manipulation.

LibreOffice Draw being used to manipulate an embedded image

In this particular situation, the night sky drawing in the karaoke page of the example PDF I created was selected, and the various options directly available were shown. However, as mentioned earlier, I have imported PDF documents of building plans and modified them to include notes showing were works were performed, or to add signatures to documents.

PDF Forms

PDF Form Creation

LibreOffice Writer and Calligra Suite are fully-featured for the creation of forms. Unfortunately, I am not particularly adept at creating forms.

Filling PDF Forms

Evince — if the PDF form was designed to be interactively filled
Okular — if the PDF form was designed to be interactively filled
The Gimp — allows for text areas to be inserted, as well as photos, drawings, and the like
LibreOffice Draw — allows for text areas to be inserted, as well as photos, drawings, and the like

Here is an example form found at https://www.aloaha.com/sample-fillable-pdf-forms/ — a sample tax form which I began filling out for Mickey and Minnie Mouse, using Evince:

Fillable form being filled with the names of Mickey and Minnie Mouse

Final Choices:

Viewing / displaying PDF files : User’s choice (usually a system’s default PDF viewer is adequate, or a web browser)
Combining and splitting PDF files : PDFMixTool
Editing PDF files : User’s choice (depends on objectives and source file; The Gimp and LibreOffice Draw are good contenders)
Adjusting PDF file size : PDFTricks
Form creation : User’s choice
Form filling : User’s choice (usually a system’s default PDF viewer is adequate, or a web browser)
Exporting PDF to other formats : OCRFeeder (for .odt); LibreOffice Draw (Photos and images); The Gimp (photos and images)

Note on Linux availability of the above software:

Here are some screen shots from my system’s installed repositories (Fedora Stable; Fedora Updates; rpmfusion.org — free and non-free)

PDF software easily accessible from my computer’s software repositories (“App Stores”):

Gnome Software list of available PDF software from various software repositories on Fedora Linux

As this list suggests, there is lot of software available which have varying PDF abilities, ranging from being dedicated PDF software of various kinds, to other pieces of software with other principal functions but with PDF functions ranging to simple importing from and exporting to the format, to being useful within the limits of the software’s main functions to manipulate PDF files in some way(s).

Summary:

This presentation’s goals are to highlight:

how PDF files are well supported most of the time on most systems, while the various pieces of software, between two versions, typically a well-known closed source project and an open-source counterpart, for document production, are not as compatible with each other as we may want;
free software while avoiding the security risks inherent to using unknown and potentially dangerous websites, as well as software which is easily available for routine tasks as well as to reduce costs;
the possibility of editing PDF files with various pieces of free software which are easily available in most Linux distributions’ repositories — as well as often easily available for other platforms — albeit occasionally with variable success.

Questions taken during the presentation:

A question asked midway through the presentation expressed a certain surprise that The Gimp can be used to edit PDFs. As mentioned earlier, The Gimp is able to import PDF files, and perform various functions on the file according to its strengths (image manipulation).

A participant asked at the end during a question period about a recommendation for software to affix signatures to documents. I replied that I was not aware of any open source official signing software with digital traceability, simply because that I had not done any research on that subject; however, an image of a scanned signature can usually be inserted in a document using The Gimp or LibreOffice Draw, or as a document is being created in a word processor.

A final comment recommended the use of LibreOffice Draw, based on the commentor’s frequent use of it to perform a number of the functions listed here, to which I’d commented that I had asked my employer’s IT department to install LibreOffice on my work-issued Windows-based laptop computer in order to be able to perform some drawing-modification functions as part of my employment.

Enjoy sharing and editing PDF files!

UPDATE 20220407:

Signing PDFs can be performed with jPDF Tweak.

JPDF Tweak can also encrypt and add passwords to a PDF.

March 17, 2019January 17, 2021

Document Formatting When Joining Texts From Various Sources

I have mounted, on a volunteer basis and in a lay capacity, the annual reports for a community group to which I belong, since about 2008.

Up to that point, the group’s annual reports were individual committee reports delivered to the secretary, individually printed out as and when received, and then stapled together with handwritten pages numbers when it had to be distributed, with an added cover page, and an extra page listing the reports and their page numbers. This did have the charm of not requiring a herculean effort and time requirement in both mounting the report, and on “printing day”, to print literally a thousand pages or more, depending on the number of pages to the report and the number of copies to be drawn. Admittedly, it does not take into account possible collating, as per how one might print out the reports (ie. pages with colour drawings and photos vs black and white, etc.).

The year I took on mounting the annual report, I believed that the annual reports should have been in an electronic format such as PDF so that it could be placed on the group’s website. But that was barely the beginning of why I took on the job.

To fulfill the technical goal of making a PDF for download from the website was not too difficult. Two easy options would have been to either scan the report once produced the “old fashioned way” and produce a PDF from all the images, or, at least for those received in electronic format, create individual PDF documents plus scan for those received on paper, then use a PDF joiner to string the PDF files together into a single document. In fact, at the time, I gathered as many previous annual reports as I could and scanned them, making them available on the website.

However, going forward, I did not consider either option to be satisfactory.

The aesthetic appearance of the annual report irked me. It wasn’t the old school printing on paper — to this day, I still print lots of paper copies for distribution. Rather, I saw an opportunity to put to the test some angst stemming from a bit over a decade earlier when the community group’s recipe book to which I’d contributed led to my having had a few ideas on improvements to the text’s basic formatting and overall layout. (The actual recipes, variety, organization, editing, and recipe testing that I learned went on behind the scenes, and the like, were beyond the scope of my interest, although one common error, separate from my angst, was a mild nuisance.) I of course wisely kept my opinions to myself, both at the time of the recipe book in the mid 1990’s, as well as at the time of initially volunteering to mount the annual report.

As can be surmised from the above, each report came from almost as many different people as there were reports, depending on how many committee reports given individuals would take on. Each person would typically type their report on their computer and email it to the office, or perhaps print it out at home and drop it off at the group’s office personally. They used whichever word processor they had: Sometimes simple text editors, or MS Write, or MS Word, presumably ranging through Word 98, Word 2003, and Word 2006. Presumably some people had Macs with whichever word processor they might have had. I believe that the secretary, who was sometimes typing up the reports which were submitted handwritten, was using a version of Wordperfect. Finally, I was submitting my reports at that point using OpenOffice.org. Presumably, there may have been other text editors or word processors used. Each instance presented a random opportunity for default settings to be different, as well as for the user to change the settings to those that suited their own personal taste.

As a result, each report predictably had formatting unique to each author, sometimes unique to each individual report, if two or more reports were submitted by the same person.

The various differences in formatting in the reports received included the following, without being an exhaustive list:

– varying text fonts and font sizes, and occasionally, more than one of either or both in a given report;
– varying line spacing;
– varying paragraph indentation, including lack thereof;
– line jumps or lack thereof between paragraphs;
– varying page margin widths;
– varying text alignment, typically either left justified, or left and right justified;
– the occasional use of italics over the whole document, beyond that which would normally be used;
– the inclusion or lack of section titles, sometimes (or not) rendered bold and/or italicised and/or underlined and/or capitalized;
– tables listing figures in formats unique to each table and report, or simple lists with varying bullet styles;
– varying spelling conventions, ie. American vs. British vs. Canadian spellings (ie. neighbor vs. neighbour, or center vs. centre);
– varying naming conventions: Sometimes full names, sometimes initialized first names with full last names, sometimes full first names with initialized last names, or sometimes very informally with only first names;
– varying honourific format conventions: sometimes honourifics, titles, and/or ranks would not be used, with persons simply named, and sometimes referred to with variations of their title such as Reverend, Rev., The Reverend, The Rev., etc.
– varying naming conventions for committee names, multi-word names, places, and the like, which were sometimes fully spelled out, and sometimes initialized, abbreviated, and / or contracted;
– etc.

As such, as alluded to in a previous post, minor changes and differences in formatting between the individual reports created subtle (or, depending on the changes, more obvious) visual changes in how each report appeared compared to each other, when joined and printed on paper or read on a computer screen. Multiple permutations and combinations of the above formatting issues often led to creating wildly varying end results which go beyond the subtle, creating a patchwork of formatting over the multiple reports joined together into a single document. This may be jarring to the eye of some readers, particularly when it is not a subtle, unified, overarching design choice, but rather the result of a decided lack of unified design choice.

This link shows a hypothetical example of how such a report could look (you’ll need a PDF reader) — with various individual reports each having unique blends of formatting as compared to each other. Note that I intentionally use the “Lorem Ipsum” text so as to highlight the formatting.

The obligatory let’s tie it all together part at the end:

When I collect the individual reports and create one document, I cut and paste all the electronic reports (and rarely, type up handwritten reports) into a single document, imposing a uniform text formatting throughout in the form of a standard font, font size, line spacing, (lack of) paragraph indentation, page margins, and standardized and / or uniform versions of the other items above. Pages are automatically numbered, and standard page headers and footers are automatically added throughout, with date codes to distinguish between earlier and later versions. Basic spelling and typing conventions are applied and made uniform. Note that I don’t dictate or edit writing style, so one report might have section headers, while another may not, nor do I edit for turns of phrase and the like.

This link shows the above hypothetical report changed (you’ll need a PDF reader) to show the same reports with some basic text formatting across the whole document made uniform, while allowing each author’s text flow (and implicitly, were each text to be unique, writing style as well) to remain relatively untouched.

Have I addressed my angst from the mid 1990’s? Yes.

Is the document formatting on the annual reports I produce every year a work in progress, with subtle improvements, changes, and the like every year? Of course.

November 12, 2010November 8, 2020

Free PDF splitters, and other crippleware

Yesterday I downloaded a PDF splitter to use on my MS computer at work. And I got bitten, hard. I wouldn’t exactly call it crippleware; most people expect even crippleware to be minimally useful. This piece was not.

I shall quote the message that I sent to their support email addy:

I am writing to let you know that your free trial download for the PDF splitter is not a useful piece of software at all, for the simple reason that it intentionally and flagrantly renders the split documents useless by inserting the “watermark” — a large message spanning the diagonal of the page, in cherry red characters, saying “in order to remove this message please visit our website” — across every page of the document.

Were it to put a far more discreet message along the top or bottom, this might be tolerable however ugly it would be; however, it is hardly of any value to anyone wishing to take advantage of the “15 free uses” or somesuch in order to evaluate the software before deciding to purchase it; in fact, I expect that most people downloading the evaluation copies are immediately turned off by this malfunction.

Obviously, I don’t expect a response from them, at least not a useful response. Obviously, I would never have bought the software to begin with were I to have had a good experience using the software — I admit it, I’m cheap.

And sure, I should have thought things through a bit better and (as I mention below) install Ghostscript to do the job. Sure, I was in a bind and embarrassed myself and my employer in front of the client.

So of course, the following reactions come to mind:

– What, the programmer(s) wanted to show off their skill at insering “watermarks”, and that are ugly to boot?
– Or did the programmer or company put more thought into the dollar signs floating in front of their eyes than, oh, I don’t know, producing a piece of software that someone may actually wish to buy?
– Or did the Marketing Department convince the programmer’s supervisor that the watermark had to be put in?

And on a personal level:

– I should install ghostscript and run:
“gswin32c -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dFirstPage=m -dLastPage=n -sOutputFile=out.pdf in.pdf”
– I should stop trying to delude myself that there won’t be an ever increasing number of useless PDF tools out there that require you to buy the product before getting a true evaluation copy;
– When using my work computer, stop using a Windows mentality, and apply a thing or two that I know how to do under linux.

Of course in the short term, what I did was speak with the secretary very nicely, who has Adobe Professional to split the file, and she did.

My point should be clear: If you want to sell your software, go right ahead; I won’t be buying it anyway. And if you want to give away a trial period during which people can, well, try the software, go right ahead; I may try your product during the trial period. But why give a free trial period (in the case above, 15 operations) that reflects poorly on the company and actually annoys your potential customers?

February 28, 2010April 11, 2020

PDF’s, Scanning, and File Sizes

I’ve been playing around with PDF’s for the past few weeks and have noticed a very interesting thing: A PDF is *not* a PDF is *not* a PDF is *not* a PDF, ad nauseum, and it would seem, ad infinitum. At least, so it would seem. Part of me almost wonders if the only distinguishing feature of a PDF is the .pdf extension at the end of the file. In “researching” this post I have learned what I knew already; PDF boils down to being simply a container format.

Lately I have been scanning some annual reports from years past for an organization I belong to, and due to the ways xsane 0.997 that comes with Fedora 12 scans pages — which I will concede straight out of the gate I have only explored enough to get it to do what I want and to learn how it does things “its way” — the PDF file sizes are “fairly” large.

In order to find this out, I first found out about one of the quirks in xsane 0.997: Something about the settings with xsane doesn’t have it stop between pages for me to change pages; at least, I haven’t gotten around to finding where the settings in xsane are to have it pause between pages. This is important because my scanner doesn’t have an automatic page feeder. The first page of results of a google search indicate several comments about this problem, but not a solution. At first glance the second page of results is of no help.

So I end up scanning pages one at a time, and then use GhostScript to join them all up at the end to make a single PDF.

Without having added up file sizes, it was obvious that the total size of all the scanned pages at 75 dpi and in black and white was sufficiently larger than the single PDF with all the pages joined. This did not bother me since, again without having added things up, the difference didn’t seem *too* great, and I assumed that the savings were principally due to adminstrative redundancies being eliminated by having one “container” as opposed to having 25 to 30 “containers” for each individual page.

Then this week a curious thing occurred: I scanned a six page magazine article, and then separately, another two page magazine article, at 100 dpi and colour, and whaddya know, the combined PDF of each set is smaller than any of the original source files. Significantly so. In fact, the largest page from the first set of six pages is double the size of the final integrated PDF, and in the case of the second set of two pages, each of the original pages are triple the size of the combined PDF. I’m blown away.

Discussing this with someone who knows the insides of computers way more than I, I learn something: It would appear that xsane probably creates PDF’s using the TIFF format (for image quality) as opposed to what I imagine Ghostscript does when joining files, which would seem to be to do what it can to reduce filesizes, and as such in this case I imagine convert the TIFF’s inside the PDF’s into JPEG’s. A bit of googling indeed appears to associate tiffs and PDF’s when it comes to xsane; indeed a check on the “multipage” settings shows three output file formats — PDF, PostScript and TIFF. And looking in Preferences/Setup/Filetype under the TIFF Zip Compression Rate, it’s set at 6 out of 9.

So I google PDF sizing, and one result led me to an explanation of the difference between using “Save” and “Save As …” options when editing a PDF: “Save” will typically append metadata on top of metadata (including *not* replacing the expired metadata in the “same” fields!); “Save As”, well, that’s what you really want to do to avoid a bloated file since all that should be will be replaced.

Another result begins describing (what is no doubt but a taste of) the various possible settings in a PDF file, and how using a given PDF editing application, you can go through a PDF, remove some setings, correct others, etc., and reduce the size of PDF’s by essentially eliminating redundant or situationally irrelevant — such as fields with null values — information whose presence would have the effect of bloating the file unecessarily.

I’ve known for a few years that PDF’s are a funny beast by nature when it comes to size: For me the best example by far used to be the use of “non-standard fonts” in the source file, oh say any open-source font that isn’t in the standard list of “don’t bother embedding the font since we all know that nine out of ten computers on the planet has it”. In and of itself this isn’t a problem; why not allow for file size savings when it is a reasonable presumption that many text PDF’s are based on a known set of fonts, and most people have said known set of fonts installed already on their system. However, when one uses a non-standard font or uses one of the tenth computers, when one constantly creates four to 6 page PDF text documents ten times the size of source documents, frustration sets in; having wondered if designating a font substitution along the lines of “use a Roman font such as Times New Roman” when such a font is used — such as in my case, Liberation Serif or occasionally Nimbus Roman No9 L — I asked my “person in the know”. Apparently, Fedora 12’s default GhostScript install, whose settings I have not modified, seems to do just that.

I guess what really gets me about this is how complicated the PDF standard must be, and how wildly variable the implementations are — at least, given that Adobe licences PDF creation for free provided that the implementations respect the complete standard — or more to the point, how wildly variable the assumptions and settings are in all sorts of software when creating a PDF. I bet that were I to take the same source and change one thing such as equipment or software that the results would be wildly different.

So, concurrent to the above scanning project, I happened to experiment with a portable scanner — a fun challenge in and of itself to make it work, but it did without “too much fuss”. And I found out something interesting, which I knew had nothing to do with PDF’s but (I presume) rather with scanners, drivers, and xsane. I tried scanning some pages of one of the said annual reports with the portable scanner on an identical Fedora 12 setup using xsane, and the PDF’s that were produced were far greater in size than those scanned with my desktop flatbed scanner. My flatbed scanner would scan the text and the page immediately surrounding the text, but correctly identified the “blank” part of the page as being blank, and did not scan in those areas, thereby significantly reducing the image scanned size. The other scanner, a portable model, did no such thing and created images from the whole page, blank spaces rendered, in this case, to a dull grey and all, thereby creating significantly larger PDF files than the scans of the same pages created on my flatbed scanner. However, as I mentioned, I assume that this is a function of the individual scanners and their drivers, and possibly how xsane interacts with them, and in my mind is not a function per se of how xsane creates PDF files.

Another interesting lesson.

March 27, 2009April 11, 2020

Printing PDFs

I’ve just had an interesting object lesson in the differences between two different pieces of software that have more than essentially the same function.

Today I had an important PDF document to print out at home instead of at the office. For the purposes of practical convenience, it was far better to print it out at home and just deliver it to the office than spend the extra time at the office (5-10 minutes) turning on my office computer and printing it out there on a printer I knew would have no difficulty dealing with it, having printed out a few dozen identically-generated documents on it.

On my pretty much stock Fedora 10 box, I use the Evince Document Viewer 2.24.2 using poppler 0.8.7 (cairo) for the Gnome desktop to display and print PDF documents. So far, I’ve been satisfied.

The PDF’s layout had margins beyond my printer’s abilities. And of course the most important parts of the document, being right at the edges of the margins in this document, were being cut off in the process of printing out the document. A reduction in the print size was not useful since the vital information was on the end of the document being cut off in the margins. I suppose I could have tried rotating the document to try to see if the cut off part would not contain crucial information, which I didn’t think of at the time. Both these strategies, however, miss the point: If the original document has very narrow margins, something is going to get cut off no matter what; not exactly desireable.

I did try something that happened to involve a Windows box (ughh) mostly because it had a different printer, and you never know how things behave differently with different equipment.

Not surprisingly, the windows box happens to have an Adobe viewer installed (I avoid that box as much as possible; I don’t even maintain it, that’s my brother’s job. 🙂 ). I click to print the document and whaddya know, in the print dialog there’s an option to fit the document within the printable area. Document printed, convenience secured.

Now what I would like to know is how much of the print window in my desktop is governed by HPLIP, how much by Gnome, how much by CUPS, and how much by the application invoking it at the moment. So I did a little experiment: Always selecting my printer, I opened a print dialogue in Evince Document Viewer, OpenOffice.org (3.0.1), Firefox (3.0.7), The Gimp (2.6.5), Xpdf (3.02) which I intentionally installed for the purpose of this experiment, and gedit (2.24.3) (on which I’m composing this blog). Besides Xpdf, each appears to have the same base, and except for Evince Document Viewer, each also adds a function tab of its own. Xpdf, on the other hand, has its own stripped-down interface — either invoke the lpr command or print to a file.

Here’s a quick table listing the tabs listed in the print dialogs available in five, off-the-shelf standard installs of Fedora 10 software, with my printer selected, plus Xpdf, which was installed directly from the Fedora repositories without any modification of settings or whatever on my part:

OpenOffice.org*: General; Page Setup; Job; Advanced; Properties
Firefox: General: Page Setup; Job; Advanced; Options; Properties
Document Viewer: General; Page Setup; Job; Advanced
The Gimp**: General; Page Setup; Job; Advanced; Image Settings***
Xpdf: Xpdf has its own stripped-down interface
gedit : General; Page Setup; Job; Advanced; Text Editor

* There is an Options button in the “Page Setup” tab for OpenOffice.org.
** The Gimp treats my “special” PDF as an image much like any other, and automatically sizes it to the current settings, much like it would handle a .png or .jpg image
*** The Gimp has an option to ignore the margins; see above note

Not one, besides The Gimp, has an option to fit the document within the printable range, and The Gimp only indirectly, because of the way it seems to handle PDFs by default as an image to be manipulated. And of the others, to be fair, only Document Viewer and Xpdf deal with PDFs — even FireFox delegates PDFs to the Evince Document Viewer by default.

Then I did another little experiment: I installed Adobe Reader 9.1 (that license is interesting, pretty convoluted, and makes me wonder whether I may use the installation at all; in any case, I’ll be getting rid of it since I really only installed it for the purpose of this experiment, and decided a while ago that having 2 PDF viewers above and beyond that which is available in the basic distro installation is superfluous unless ther’s a particular reason for it.) And what do I see? A new print dialog that reminds me of the one I saw earlier on the windows box. Interestingly, it has “fit to printable area” and “shrink to printable area” options.

So my little experiment has led me to the following conclusions:

– many pieces of software, presumably not wanting to reinvent the wheel, rely either on the OS or I suspect, at least in this case, the desktop environment for its print dialogs;
– some software authors do want to reinvent the wheel, such as to “do it their own way”, or to be completely platform and environment independent, and therefore make their own dialogs;
– some software authors want to do extra things but don’t want to reinvent the wheel, so they have a wrapper for to add extra functionality to an existing base;
– in my documents, I shouldn’t try to stuff as much content as possible into each page too far, at least not by playing around with the margins.

Looks like something for the Evince authors to toss in. Assuming, of course, that — without fundamentally changing a document — resizing a PDF and/or its content to the local printer’s printing range is a really useful feature, such as to deal with awry margins, or PDFs sized for A4 instead of letter sized or vice-versa. 🙂 And that such non-conformities and/or their prevalence make it worth my using the Adobe Reader, licensing issues aside. Or that another PDF reader out there that has that functionality.