The Trouble with PDFs

@Graham Stewart: “… [.mht files] will only work for ONE page and the result will only be usable in Internet Explorer.”

Actually, Opera supports .mht files natively, and Firefox can read and write them as well, through plugins. AKAIK MHTML isn’t limited to encapsulating a single page. That’s just what current web browsers save.

It’s not such a big deal to .zip up a tree full of HTML and related resources anyway. Bruce Eckel has been distributing his books this way for years. It works fine. For reading and using a book from a computer screen, I think it works far better than a PDF would.

PDFs rock when it comes to high-res professional quality printing, but useless as a web device. The packaging argument makes no sense. Why would you need to package and send a web page? That’s what HYPERLINKS are for. And hyperlinking is even better than packaging, because the information in the package is only a copy, and may become deprecated, while the HTML source will always be the most current copy.

It’s a data black hole. That’s the biggest killer.

We have plenty of customers who want to send us, in PDF, POs, even asset lists they want us to track, and expect us to be able to use that data. Just try to extract a multi-page table out of a PDF! People are using this as universally as a container, when in fact what it’s containing is no longer data but a picture of the data.

This theoretical discussion about the needs of the printing industry versus the web is fine for us, but it’s wholly lost on the actual users out there, in a land where my wife is queen of her department’s computers because she actually knows how to zip files. To these folks, PDF is the defacto way that you extract content to transmit to others. If I had a dollar for every time I’ve been sent a PDF of a website, rather than just a URL…

Add to that the fact that they just don’t work on my PocketPC, where I do a lot of my reading.

Surely in some cases it really is important to preserve content, and it is useful there. It’s also useful to type into forms. But considering the ineptness with which some people are creating the content (e.g., columns too wide to scan with minimal eye movement), the freedom afforded by reflowing as browsers do is valuable.

I have to admit I have a vote for PDF’s (at least for now) cuz I found it very convenient to download. Some HTML pages are just painful to download and maintain as a whole. With a single PDF it is made easier to save and print. Many websites just fail to provide a printer-friendly page.

“…more defensible in 2001, when browser printing support was notoriously poor…”

Looking at stats from various websites, you see an alarming rate of user still stuck with ie 6 (like 70%). IE 6 couldnt even autofit a webpage to one paper, you get 2, or worse 4.

I think that the PDF format would be a lot more useful
if Adobe would strip out the 99% of the engine that you
don’t need when loading it, or at least make it a
load-on-demand operation.

Google “acrobat lite” or something similar, it’s already been done (it’s a small app written by a guy in the UK that unplugs the 99% of Acrobat that’s useless bloat).

Alternatively, just run Foxit.

Sorry Jeff, I don’t think you get why people use PDF. It’s not because of the layout. It’s because they expect someone to print the document and web browsers are impossibly lame when it comes to printing pages. The reason is obvious: pages are formatted to be viewed in browsers, not printed on paper, and it is the rare designer who comes even close to making a separate set of print settings in their CSS which look good (and even then you still are at the mercy of whatever default header/footer options the browser puts on, and even then you have no possible chance of real fidelity in the output).

As all designers know, HTML is meant to be different than print. It is dynamic, and it looks different from machine to machine. It is a different paradigm than the print world. That’s all well and good, but sometimes you want to design things for the print world. You want things to look the same to everyone. You don’t want to worry about shit getting out of alignment or some bit of text getting orphaned on the last page. Even modern browsers print most pages like shit (from a design point of view), something which is usually the fault of the designer but is something that’s going to be a problem in general with browser printing.

PDF is about the print world. It’s not meant for the web. The web simply becomes a distribution point for PDFs.

What does Paul N. Edwards have that page as a PDF on his website? Because he expects people to print it. Because he’s a professor (a damned interesting one, too–The Closed World is a fantastically interesting book). Because he distributes it to his students. Because he wants to make sure when he says, “look at page 4,” that they all have the same page 4. Because students love loopholes (“Oh, well MY copy of that didn’t include that clause, so you can’t hold it against me!”). Because academics like fidelity–reliable reproducibility is the cornerstone of any academic work (or, as they would say in Edwards’ field–which happens to be mine as well–immutable mobiles are necessary for knowledge generation areas as geographically large as the modern world).

It’s true that some people might stupidly be using PDFs as a replacement for HTML, but I suspect they are in the minority here, and any attempt to understand why people use PDFs via this lens is going to be misleading.

I have to agree with your article Jeff. I think that HTML is great and while PDF may excel in some areas this is more a call to continue refining HTML rather than to jump ship to PDF.

By the way – MY biggest problem with PDFs is that they are very hard to implement in software when it comes to READING them. If you are not 100% happy with the free drop-in ActiveX component (and there’s a lot not to be happy about), then you’re stuck with either licensing a third-party reader (at ridiculous prices) or trying to re-code a viewer from scratch that can handle the entire format and all of its backwards-compatible variants.

I think Reader has improved a lot in the last version but it still doesn’t do a number of things that would be useful in my line of work. For example, it is incredibly hard to take notes in a separate file while viewing a PDF on the screen. As an academic, in a world where most academic work of the past is being scanned as PDFs on sites like JSTOR, this is something that I need to do practically every day. My fellow graduate students are CONSTANTLY needing to read PDFs and take notes in separate files (the “in-line” notes preferred by Acrobat are not all compatible with how most academics use notes and they have a lousy interface).

If the format was a bit more flexible–if there were better code snippets out there that would allow it to be easily used–it would not be hard at all to whip up a custom PDF viewer that worked for my purposes. But it isn’t, and the only large codebase for viewing PDFs out there is for xpdf, which is massive and only in C++. I don’t have the programming chops to adapt such a thing easily. Imagine how the format could be opened up, though, if people made simple libraries for VB.NET, PHP, RB, etc., that allowed one to raster pages and extract text reliably! Well, I know that I’d put it to good use, anyway…

John S said:

  1. Embedded font

Again, How do embedded fonts help the reader?, as I see this, a reader cannot choose a bigger font to ease it’s reading (just zooming, which is quite nasty for multicolumn docs)

  1. Whole document including images is one file. 1 document = 1 file. A “document” that is a directory of html file(s) and images feels unwieldy.

Whole HTML document including images is one URL. 1 Document = 1 URL.

I, as a reader, never see the files and directories. Why should I?

  1. The text in the images is anti-aliased according to my preferences. If the diagrams in this document were inline html images, I would be at the mercy of whatever AA settings the author used when creating the image.

Exactly the same in HTML. If the OS has antialiasing, your text will be antialiased to your settings (damn IE not supporting SVG). If the author has put text on an image (and I’ve seen this on PDFs too) you are limited to the AA setting the author chose exactly the same.

  1. Zooming retains layout.

Layout is not content. I don’t want to see the layout, I want to see the content. And as others have said, zooming a multicolumn layout is heavily uncomfortable for the reader.

  1. Annotation/comments (again all within the one file), side-by-side viewing, quick rotation of the page, other viewing-based features…

Blog comments (again all in one URL), you can open as many windows of the browser to view pages side-by-side, quick rotation of the screen, and the viewing based features are out of the content becuase they don’t belong with the content.


Graham Stewart said:

What if you want to package up multiple pages (e.g. a manual) and make them available to everyone that visits your site?

Well, you set the manual on a webpage. If the users are browsing your site, they sure have a web browser, and it costs the same to publish (on Internet) a PDF or a webpage.

OTOH, if you want your users to print out the manual (or to have a copy to pass on to offliners), you could convert on the fly those HTML files to PDF for your users, there are lots of tools to do that in whichever language you are using to your webpage, be it php, java, asp, python, ruby, etc… (and here you can be totally independent of the browser quirks, so you just have to find the way to transform the page to something that will print fine).


And regarding to comments related to citations and page numbers, DAMN, this is what the ANCHOR 'a href… is for: you do the reference straight with an hyperlink (don’t forget what the ‘#’ means on a URL), no need for page-paragraph numbers that burdens the user making him search by hand, when you have already done the search and can provide him with an easier way.

The only backdraw to this is the volatile element the web has become. This is something forced onto us by people that doesn’t understand the medium: ISPs provide such a lousy service with webpages that we usually need separate hosting, and the idea Tim Berners Lee had about people publishing research papers online is not sustainable anymore (well, maybe for works in progress). Even though free publishing platforms make permalinking a bit easier.

I use PDFs only in cases where I am unsure how much access to and with which software people have. Mostly in cases where people have to see the information as it has been designed.

Browsers vary. Desktop software varies. Fonts vary. PDF solves all that, and that’s the only time I think it makes sense. However, the biggest reason I don’t use PDF even when I should is filesize.

But it’s nice to know that even if someone doesn’t have a PDF viewer (which I imagine is rare), they can download one for FREE and see what I need. The only inconvenience I’m putting on people is the download/install of freeware. That I can accept. Locking people out by sending them a format for which they must purchase software or convert the data first is just mean.

PDF is a great cross-platform solution to communicate with the assurance your readers are seeing things exactly as you want. The trick is knowing when things have to be seen as-designed and when they are acceptable otherwise (which are usually faster and more convenient).

The real problem with PDFs in a web context or for local usability is search. You’ll find that often ligatures are not searchable in a PDF document. This makes the terms invisible to the world (i.e. Google), at least until someone fixes this problem.

Here’s a real example from a blog entry I did on the mess created for search by PDFs: http://acl.ldc.upenn.edu/P/P06/P06-2051.pdf

Now try searching for the word ‘fulfills’. It follows the phrase ‘which fully’ on page 397 (7/8 in the PDF page numbering).

The reason you can’t see it is that ‘fi’ is what’s known as a ligature, and in the interest of prettiness, PDF treats it as a unit, so it doesn’t match the sequence of two characters ‘fi’.

At least in January 2008, PDFs are nice for printing and layout, but lousy for search.

I’d say it is a question of control over flow, attention, ads display, and restrictions of copying, printing and viewing. PDF gives more control to publisher and HTML gives more control to reader. Those who want to control readers, use PDF, those who want to cooperate, use HTML.

Maybe an idea for an article: How does your website show up on paper? A lot of blog-sites show up horrible (not this site, though).

About PDF: it’s horrible for programmers. You even have to do your own line-breaking (PDF doesn’t know anything about soft line breaks).

As was mentioned earlier, use the right tool for the job. PDFs generated by marketing for brochures and such can just easily be uploaded and linked to on a web-site, rather than duplicating the information in HTML and CSS.

I rarely print to a printer, but use PDF Creator to print off information from Excel, Word, etc and then can share the PDF; they’ll see what I see. Its especially helpful w/ people like my parents who don’t have Word or Excel.

I jeff,

I read your blog daily and like it a lot. Thanks for your effort in keeping such a nice blog.

I usually agree with your points of view, but not this time.

I agree that a lot of people use PDF in a wrong way. And using it on a pure web context shoudl be avoided. The Jakob Nielsen article influenced my view since it was written and still agree with bits of it. But Adobe has done a long path with PDF, and I think it is a nice replacement for Post Script (its main purpose).

HTML and PDF address totally different objectives and should not be compared.

HTML is good for web publish with good search ability on web environments.

In my humble opinion PDF should be used mainly for distribution of material to be printed, or document digital archiving.

Most Operation Systems have a good native PDF support. Even Windows is getting better, and it is easy to get a good pdf reader nowadays.

You should not mix thing up. PDF HTML. Different objectives.

Of course the web masters should now this differences, and that is really the problem.

IE (at least IE6, dunno about IE7) does a lousy job in printing oversized web pages. It chops off content and can’t resize.

A workaround is to offer a pdf version for printing purposes.
Note: I have no time nor desire to design another version of the same page just so that it prints properly in IE.

Those four problems you cite just aren’t a problem on the Mac:

  1. In OS X, Safari handles the PDF itself, so no need to link to an “out-of-browser experience”.

  2. In OS X, everyone has Safari installed. If you don’t like viewing PDFs in Safari, Preview is even nicer. And everyone has Preview installed, too. See http://watchingapple.com/2007/05/tip-opening-safari-pdfs-in-preview/ for more details.

  3. The layout control PDF offers is far superior to HTML, and quite exact. While I will agree that some people resort too quickly to PDF, it’s a judgment call to say whether it’s “mind-blowingly” better and whether HTML+CSS requires “no aesthetic loss at all”, but PDF is certainly attractive and pleasant to use.
    On my Mac, I smile whenever someone offers a PDF; in my experience, most Windows users frown for the platform limitations you’ve already cited.

  4. Your argument for “one version of the content” would just as easily favor PDF. Plenty of content originates in more traditional, non-Web authoring tools, after all.
    On the Mac, you can easily print any document to PDF with no loss in fidelity. Print, upload, link–you’re done. Converting documents of any complexity at all to HTML would be significantly more difficult.

Graphic designers largely piss me off.

There’s nothing like making something that’s “super elegant” but annoying and then throwing a bitch fit when someone says, “Can you just make this so that the text is without all this other … stuff?”

As most have said: it’s print media vs. screen media.

I run Vista and I like it. No problem loading software to view PDFs. But I don’t like to HAVE to save/download something to view it as a PDF when the information therein isn’t meant to be sent to a printer for mass duplication.

The problem stems mostly from Adobe which is on far too many machines. It is bloated and eventually causes other non-pdf problems. I think if you had a simple pdf viewer as the default, many of the problems discussed would just go away.