Document Formatting When Joining Texts From Various Sources

I have mounted, on a volunteer basis and in a lay capacity, the annual reports for a community group to which I belong since about 2008.

Up to that point, the group’s annual reports were individual committee reports delivered to the secretary, individually printed out as and when received, and then stapled together with handwritten pages numbers when it had to be distributed, with an added cover page, and an extra page listing the reports and their page numbers. This did have the charm of not requiring a herculean effort and time requirement in both mounting the report, and on “printing day”, to print literally a thousand pages or more, depending on the number of pages to the report and the number of copies to be drawn. Admittedly, it does not take into account possible collating, as per how one might print out the reports (ie. pages with colour drawings and photos vs black and white, etc.).

The year I took on mounting the annual report, I believed that the annual reports should have been in an electronic format such as PDF so that it could be placed on the group’s website. But that was barely the beginning of why I took on the job.

To fulfill the technical goal of making a PDF for download from the website was not too difficult. Two easy options would have been to either scan the report once produced the “old fashioned way” and produce a PDF from all the images, or, at least for those received in electronic format, create individual PDF documents plus scan for those received on paper, then use a PDF joiner to string the PDF files together into a single document. In fact, at the time, I gathered as many previous annual reports as I could and scanned them, making them available on the website.

However, going forward, I did not consider either option to be satisfactory.

The aesthetic appearance of the annual report irked me. It wasn’t the old school printing on paper — to this day, I still print lots of paper copies for distribution. Rather, I saw an opportunity to put to the test some angst stemming from a bit over a decade earlier when the community group’s recipe book to which I’d contributed led to my having had a few ideas on improvements to the text’s basic formatting and overall layout. (The actual recipes, variety, organization, editing, and recipe testing that I learned went on behind the scenes, and the like, were beyond the scope of my interest, although one common error, separate from my angst, was a mild nuisance.) I of course wisely kept my opinions to myself, both at the time of the recipe book in the mid 1990’s, as well as at the time of initially volunteering to mount the annual report.

As can be surmised from the above, each report came from almost as many different people as there were reports, depending on how many committee reports given individuals would take on. Each person would typically type their report on their computer and email it to the office, or perhaps print it out at home and drop it off at the group’s office personally. They used whichever word processor they had: Sometimes simple text editors, or MS Write, or MS Word, presumably ranging through Word 98, Word 2003, and Word 2006. Presumably some people had Macs with whichever word processor they might have had. I believe that the secretary, who was sometimes typing up the reports which were submitted handwritten, was using a version of Wordperfect. Finally, I was submitting my reports at that point using OpenOffice.org. Presumably, there may have been other text editors or word processors used. Each instance presented a random opportunity for default settings to be different, as well as for the user to change the settings to those that suited their own personal taste.

As a result, each report predictably had formatting unique to each author, sometimes unique to each individual report, if two or more reports were submitted by the same person.

The various differences in formatting in the reports received included the following, without being an exhaustive list:

– varying text fonts and font sizes, and occasionally, more than one of either or both in a given report;
– varying line spacing;
– varying paragraph indentation, including lack thereof;
– line jumps or lack thereof between paragraphs;
– varying page margin widths;
– varying text alignment, typically either left justified, or left and right justified;
– the occasional use of italics over the whole document, beyond that which would normally be used;
– the inclusion or lack of section titles, sometimes (or not) rendered bold and/or italicised and/or underlined and/or capitalized;
– tables listing figures in formats unique to each table and report, or simple lists with varying bullet styles;
– varying spelling conventions, ie. American vs. British vs. Canadian spellings (ie. neighbor vs. neighbour, or center vs. centre);
– varying naming conventions: Sometimes full names, sometimes initialized first names with full last names, sometimes full first names with initialized last names, or sometimes very informally with only first names;
– varying honourific format conventions: sometimes honourifics, titles, and/or ranks would not be used, with persons simply named, and sometimes referred to with variations of their title such as Reverend, Rev., The Reverend, The Rev., etc.
– varying naming conventions for committee names, multi-word names, places, and the like, which were sometimes fully spelled out, and sometimes initialized, abbreviated, and / or contracted;
– etc.

As such, as alluded to in a previous post, minor changes and differences in formatting between the individual reports created subtle (or, depending on the changes, more obvious) visual changes in how each report appeared compared to each other, when joined and printed on paper or read on a computer screen. Multiple permutations and combinations of the above formatting issues often led to creating wildly varying end results which go beyond the subtle, creating a patchwork of formatting over the multiple reports joined together into a single document. This may be jarring to the eye of some readers, particularly when it is not a subtle, unified, overarching design choice, but rather the result of a decided lack of unified design choice.

This link shows a hypothetical example of how such a report could look (you’ll need a PDF reader) — with various individual reports each having unique blends of formatting as compared to each other. Note that I intentionally use the “Lorem Ipsum” text so as to highlight the formatting.

The obligatory let’s tie it all together part at the end:

When I collect the individual reports and create one document, I cut and paste all the electronic reports (and rarely, type up handwritten reports) into a single document, imposing a uniform text formatting throughout in the form of a standard font, font size, line spacing, (lack of) paragraph indentation, page margins, and standardized and / or uniform versions of the other items above. Pages are automatically numbered, and standard page headers and footers are automatically added throughout, with date codes to distinguish between earlier and later versions. Basic spelling and typing conventions are applied and made uniform. Note that I don’t dictate or edit writing style, so one report might have section headers, while another may not, nor do I edit for turns of phrase and the like.

This link shows the above hypothetical report changed (you’ll need a PDF reader) to show the same reports with some basic text formatting across the whole document made uniform, while allowing each author’s text flow (and implicitly, were each text to be unique, writing style as well) to remain relatively untouched.

Have I addressed my angst from the mid 1990’s? Yes.

Is the document formatting on the annual reports I produce every year a work in progress, with subtle improvements, changes, and the like every year? Of course.

A text formatting riddle

I’d like to propose my version of a little visual puzzle I saw years ago. In the following table, the same text is repeated in each cell. In eight of the cells, an element of formatting has been changed from the appearance of the text using a basic set of formatting, while the ninth contains, in this case, the default settings on my wordprocessor on my system. The riddle is to find which cell has not been modified as compared to the other eight. (View a slightly larger version of the table here.)

A hint of sorts: What the basic formatting settings are, or which word processor I used on which system or OS, all represent red herrings to solving which is “the original”, or “vanilla”, version.

a text formatting puzzle

Scroll down for the solution.
Scroll down to see the solution
The solution is B2, the cell / square in the centre of the table.

All the other cells have one thing changed from the B2’s qualities.

A1) The font was changed (from a Serif font to a Sans Serif font);
A2) The font size was changed to a slightly larger point size;
A3) The cell’s background colour was changed to a light grey;
B1) The text was italicized;
B2) Standard, unchanged text using my word processor’s standard settings;
B3) The text colour was lightened from a standard black to a grey;
C1) The text was capitalized;
C2) The text was made bold;
C3) The text’s line spacing was increased.

Besides at the core being what I perceive to be a fun riddle, it demonstrates how subtle differences can be made to standard document formatting in a variety of ways. It also alludes to the challenges presented by receiving documents from multiple sources for integration into a single document, such as a community group’s newsletter, or a community group’s annual report, presenting content and / or reports from its various members, leaders, subgroups, committees, and the like. In a forthcoming post, I will further discuss basic issues of varying formatting, and the need for standard formatting in a text document from the perspective of a layman editor of a community group’s annual report.

Updated recipes

I have been adding my personal recipes to malak.ca since the beginning of December, 2017.

It has been a sort of starting from fresh to create my personal cookbook, a project I started, I think, long before 2011 — as early as 2007-ish, as I recall.  (I remember discussing the cookbook with someone somewhere around 2012, and said conversation could not have occurred before 2011.)

Several years ago, I’d put together a personal cookbook, but at a certain point during its construction, somehow the main file either got corrupted, or I had several copies which I didn’t manage properly (and presumably, in this scenario, began overwriting previously entered recipes with newer versions of other versions.)  However it all happened, I became disillusioned and lost interest on a practical level to reconstruct it all, let alone finish it, despite a certain allure it had.

Back in December, I decided to start from scratch, doing a rather 90’s thing — or perhaps even an 80’s, or 70’s, or 60’s thing — I used a basic text editor and started retyping each recipe, sometimes using what I did still have as a reference, and in at least four cases, just reusing the recipe as I’d entered it back a few years ago, with the remnants of the original cookbook file.

In the case of some of recipes I’ve been typing in, I’ve actually been able to tune the text based on recent memory of just having made the items in the last couple of weeks (as in, as I was making the item in question, going over to my computer to make adjustments), or up to a couple of months ago.

I started posting pdf’s on my website.  And, I’ve been using a “post early, post often” approach to each recipe, as in, check recipes, fine tune them, repost the update, and, fine tune them again, adding sections like “equipment” as I’d start to be using that in other recipes, and so on.  I even have been recalling a lightning talk I rather liked at a linux conference I attended in 2011 which, ironically, used baking and recipes as a way to demonstrate the need to developers the importance of clear, concise, and complete instructions and documentation in order to encourage others to join their software projects.

And, fun fun fun, today I took advantage of another day of holidays, er, waiting for the garage to call me back and say that my car, in for servicing, was ready:  I went through all my recently-typed recipes and did some basic editing.  Lists and sentences / semi-sentences were capitalized.  Lists received dash points.  Instructions which hadn’t already been fleshed out, were fleshed out.  Sentences with multiple steps were broken up into discrete instruction lists.  A number received sections “do this part, then while that cooks, do this part”, etc.  (And then, transferring the updates to my webserver, to my laptop as a backup, and to my backup server which is also my webserver.)

Obviously, the likes of “cooking sausages” isn’t there, even though apparently when I make them for a Santa’s Breakfast, they are highly rated beyond the fact that I’m the only volunteer who actually relishes in making 200+ sausages at home in advance.  And, that having the sausages pre-cooked so that they only need to be reheated in the oven is quite convenient when you’re serving 100+ people.

Eventually, if you look at the eggplant, first meatball, cheese biscuit and zucchini dish recipes, I may update them in the style of the newly retyped recipes as above, while converting the texts of the newly retyped recipes to that format (the original format for my “personal cookbook”), and take photos.

Finally!  My recipes are now documented, accessible, shared, sharable, and, if I ever get around to it, ready for transfer into a “cookbook”.

Free PDF splitters, and other crippleware

Yesterday I downloaded a PDF splitter to use on my MS computer at work. And I got bitten, hard. I wouldn’t exactly call it crippleware; most people expect even crippleware to be minimally useful. This piece was not.

I shall quote the message that I sent to their support email addy:

I am writing to let you know that your free trial download for the PDF splitter is not a useful piece of software at all, for the simple reason that it intentionally and flagrantly renders the split documents useless by inserting the “watermark” — a large message spanning the diagonal of the page, in cherry red characters, saying “in order to remove this message please visit our website” — across every page of the document.


Were it to put a far more discreet message along the top or bottom, this might be tolerable however ugly it would be; however, it is hardly of any value to anyone wishing to take advantage of the “15 free uses” or somesuch in order to evaluate the software before deciding to purchase it; in fact, I expect that most people downloading the evaluation copies are immediately turned off by this malfunction.

Obviously, I don’t expect a response from them, at least not a useful response. Obviously, I would never have bought the software to begin with were I to have had a good experience using the software — I admit it, I’m cheap.

And sure, I should have thought things through a bit better and (as I mention below) install Ghostscript to do the job. Sure, I was in a bind and embarrassed myself and my employer in front of the client.

So of course, the following reactions come to mind:

– What, the programmer(s) wanted to show off their skill at insering “watermarks”, and that are ugly to boot?
– Or did the programmer or company put more thought into the dollar signs floating in front of their eyes than, oh, I don’t know, producing a piece of software that someone may actually wish to buy?
– Or did the Marketing Department convince the programmer’s supervisor that the watermark had to be put in?

And on a personal level:

– I should install ghostscript and run:
“gswin32c -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -dFirstPage=m -dLastPage=n -sOutputFile=out.pdf in.pdf”
– I should stop trying to delude myself that there won’t be an ever increasing number of useless PDF tools out there that require you to buy the product before getting a true evaluation copy;
– When using my work computer, stop using a Windows mentality, and apply a thing or two that I know how to do under linux.

Of course in the short term, what I did was speak with the secretary very nicely, who has Adobe Professional to split the file, and she did.

My point should be clear: If you want to sell your software, go right ahead; I won’t be buying it anyway. And if you want to give away a trial period during which people can, well, try the software, go right ahead; I may try your product during the trial period. But why give a free trial period (in the case above, 15 operations) that reflects poorly on the company and actually annoys your potential customers?

PDF’s, Scanning, and File Sizes

I’ve been playing around with PDF’s for the past few weeks and have noticed a very interesting thing: A PDF is *not* a PDF is *not* a PDF is *not* a PDF, ad nauseum, and it would seem, ad infinitum. At least, so it would seem. Part of me almost wonders if the only distinguishing feature of a PDF is the .pdf extension at the end of the file. In “researching” this post I have learned what I knew already; PDF boils down to being simply a container format.

Lately I have been scanning some annual reports from years past for an organization I belong to, and due to the ways xsane 0.997 that comes with Fedora 12 scans pages — which I will concede straight out of the gate I have only explored enough to get it to do what I want and to learn how it does things “its way” — the PDF file sizes are “fairly” large.

In order to find this out, I first found out about one of the quirks in xsane 0.997: Something about the settings with xsane doesn’t have it stop between pages for me to change pages; at least, I haven’t gotten around to finding where the settings in xsane are to have it pause between pages. This is important because my scanner doesn’t have an automatic page feeder. The first page of results of a google search indicate several comments about this problem, but not a solution. At first glance the second page of results is of no help.

So I end up scanning pages one at a time, and then use GhostScript to join them all up at the end to make a single PDF.

Without having added up file sizes, it was obvious that the total size of all the scanned pages at 75 dpi and in black and white was sufficiently larger than the single PDF with all the pages joined. This did not bother me since, again without having added things up, the difference didn’t seem *too* great, and I assumed that the savings were principally due to adminstrative redundancies being eliminated by having one “container” as opposed to having 25 to 30 “containers” for each individual page.

Then this week a curious thing occurred: I scanned a six page magazine article, and then separately, another two page magazine article, at 100 dpi and colour, and whaddya know, the combined PDF of each set is smaller than any of the original source files. Significantly so. In fact, the largest page from the first set of six pages is double the size of the final integrated PDF, and in the case of the second set of two pages, each of the original pages are triple the size of the combined PDF. I’m blown away.

Discussing this with someone who knows the insides of computers way more than I, I learn something: It would appear that xsane probably creates PDF’s using the TIFF format (for image quality) as opposed to what I imagine Ghostscript does when joining files, which would seem to be to do what it can to reduce filesizes, and as such in this case I imagine convert the TIFF’s inside the PDF’s into JPEG’s. A bit of googling indeed appears to associate tiffs and PDF’s when it comes to xsane; indeed a check on the “multipage” settings shows three output file formats — PDF, PostScript and TIFF. And looking in Preferences/Setup/Filetype under the TIFF Zip Compression Rate, it’s set at 6 out of 9.

So I google PDF sizing, and one result led me to an explanation of the difference between using “Save” and “Save As …” options when editing a PDF: “Save” will typically append metadata on top of metadata (including *not* replacing the expired metadata in the “same” fields!); “Save As”, well, that’s what you really want to do to avoid a bloated file since all that should be will be replaced.

Another result begins describing (what is no doubt but a taste of) the various possible settings in a PDF file, and how using a given PDF editing application, you can go through a PDF, remove some setings, correct others, etc., and reduce the size of PDF’s by essentially eliminating redundant or situationally irrelevant — such as fields with null values — information whose presence would have the effect of bloating the file unecessarily.

I’ve known for a few years that PDF’s are a funny beast by nature when it comes to size: For me the best example by far used to be the use of “non-standard fonts” in the source file, oh say any open-source font that isn’t in the standard list of “don’t bother embedding the font since we all know that nine out of ten computers on the planet has it”. In and of itself this isn’t a problem; why not allow for file size savings when it is a reasonable presumption that many text PDF’s are based on a known set of fonts, and most people have said known set of fonts installed already on their system. However, when one uses a non-standard font or uses one of the tenth computers, when one constantly creates four to 6 page PDF text documents ten times the size of source documents, frustration sets in; having wondered if designating a font substitution along the lines of “use a Roman font such as Times New Roman” when such a font is used — such as in my case, Liberation Serif or occasionally Nimbus Roman No9 L — I asked my “person in the know”. Apparently, Fedora 12’s default GhostScript install, whose settings I have not modified, seems to do just that.

I guess what really gets me about this is how complicated the PDF standard must be, and how wildly variable the implementations are — at least, given that Adobe licences PDF creation for free provided that the implementations respect the complete standard — or more to the point, how wildly variable the assumptions and settings are in all sorts of software when creating a PDF. I bet that were I to take the same source and change one thing such as equipment or software that the results would be wildly different.

So, concurrent to the above scanning project, I happened to experiment with a portable scanner — a fun challenge in and of itself to make it work, but it did without “too much fuss”. And I found out something interesting, which I knew had nothing to do with PDF’s but (I presume) rather with scanners, drivers, and xsane. I tried scanning some pages of one of the said annual reports with the portable scanner on an identical Fedora 12 setup using xsane, and the PDF’s that were produced were far greater in size than those scanned with my desktop flatbed scanner. My flatbed scanner would scan the text and the page immediately surrounding the text, but correctly identified the “blank” part of the page as being blank, and did not scan in those areas, thereby significantly reducing the image scanned size. The other scanner, a portable model, did no such thing and created images from the whole page, blank spaces rendered, in this case, to a dull grey and all, thereby creating significantly larger PDF files than the scans of the same pages created on my flatbed scanner. However, as I mentioned, I assume that this is a function of the individual scanners and their drivers, and possibly how xsane interacts with them, and in my mind is not a function per se of how xsane creates PDF files.

Another interesting lesson.

Printing PDFs

I’ve just had an interesting object lesson in the differences between two different pieces of software that have more than essentially the same function.

Today I had an important PDF document to print out at home instead of at the office. For the purposes of practical convenience, it was far better to print it out at home and just deliver it to the office than spend the extra time at the office (5-10 minutes) turning on my office computer and printing it out there on a printer I knew would have no difficulty dealing with it, having printed out a few dozen identically-generated documents on it.

On my pretty much stock Fedora 10 box, I use the Evince Document Viewer 2.24.2 using poppler 0.8.7 (cairo) for the Gnome desktop to display and print PDF documents. So far, I’ve been satisfied.

The PDF’s layout had margins beyond my printer’s abilities. And of course the most important parts of the document, being right at the edges of the margins in this document, were being cut off in the process of printing out the document. A reduction in the print size was not useful since the vital information was on the end of the document being cut off in the margins. I suppose I could have tried rotating the document to try to see if the cut off part would not contain crucial information, which I didn’t think of at the time. Both these strategies, however, miss the point: If the original document has very narrow margins, something is going to get cut off no matter what; not exactly desireable.

I did try something that happened to involve a Windows box (ughh) mostly because it had a different printer, and you never know how things behave differently with different equipment.

Not surprisingly, the windows box happens to have an Adobe viewer installed (I avoid that box as much as possible; I don’t even maintain it, that’s my brother’s job. 🙂 ). I click to print the document and whaddya know, in the print dialog there’s an option to fit the document within the printable area. Document printed, convenience secured.

Now what I would like to know is how much of the print window in my desktop is governed by HPLIP, how much by Gnome, how much by CUPS, and how much by the application invoking it at the moment. So I did a little experiment: Always selecting my printer, I opened a print dialogue in Evince Document Viewer, OpenOffice.org (3.0.1), Firefox (3.0.7), The Gimp (2.6.5), Xpdf (3.02) which I intentionally installed for the purpose of this experiment, and gedit (2.24.3) (on which I’m composing this blog). Besides Xpdf, each appears to have the same base, and except for Evince Document Viewer, each also adds a function tab of its own. Xpdf, on the other hand, has its own stripped-down interface — either invoke the lpr command or print to a file.

Here’s a quick table listing the tabs listed in the print dialogs available in five, off-the-shelf standard installs of Fedora 10 software, with my printer selected, plus Xpdf, which was installed directly from the Fedora repositories without any modification of settings or whatever on my part:

OpenOffice.org*: General; Page Setup; Job; Advanced; Properties
Firefox: General: Page Setup; Job; Advanced; Options; Properties
Document Viewer: General; Page Setup; Job; Advanced
The Gimp**: General; Page Setup; Job; Advanced; Image Settings***
Xpdf: Xpdf has its own stripped-down interface
gedit : General; Page Setup; Job; Advanced; Text Editor

* There is an Options button in the “Page Setup” tab for OpenOffice.org.
** The Gimp treats my “special” PDF as an image much like any other, and automatically sizes it to the current settings, much like it would handle a .png or .jpg image
*** The Gimp has an option to ignore the margins; see above note

Not one, besides The Gimp, has an option to fit the document within the printable range, and The Gimp only indirectly, because of the way it seems to handle PDFs by default as an image to be manipulated. And of the others, to be fair, only Document Viewer and Xpdf deal with PDFs — even FireFox delegates PDFs to the Evince Document Viewer by default.

Then I did another little experiment: I installed Adobe Reader 9.1 (that license is interesting, pretty convoluted, and makes me wonder whether I may use the installation at all; in any case, I’ll be getting rid of it since I really only installed it for the purpose of this experiment, and decided a while ago that having 2 PDF viewers above and beyond that which is available in the basic distro installation is superfluous unless ther’s a particular reason for it.) And what do I see? A new print dialog that reminds me of the one I saw earlier on the windows box. Interestingly, it has “fit to printable area” and “shrink to printable area” options.

So my little experiment has led me to the following conclusions:

– many pieces of software, presumably not wanting to reinvent the wheel, rely either on the OS or I suspect, at least in this case, the desktop environment for its print dialogs;
– some software authors do want to reinvent the wheel, such as to “do it their own way”, or to be completely platform and environment independent, and therefore make their own dialogs;
– some software authors want to do extra things but don’t want to reinvent the wheel, so they have a wrapper for to add extra functionality to an existing base;
– in my documents, I shouldn’t try to stuff as much content as possible into each page too far, at least not by playing around with the margins.

Looks like something for the Evince authors to toss in. Assuming, of course, that — without fundamentally changing a document — resizing a PDF and/or its content to the local printer’s printing range is a really useful feature, such as to deal with awry margins, or PDFs sized for A4 instead of letter sized or vice-versa. 🙂 And that such non-conformities and/or their prevalence make it worth my using the Adobe Reader, licensing issues aside. Or that another PDF reader out there that has that functionality.