The Trouble with PDFs

loki_jf · January 3, 2008, 12:00am

For some reason I much better like to read very big documents in PDF than HTML. I think that big documents without external links have a good home in PDF files, small ones should be fine in plain text.

MoJo · January 3, 2008, 12:00am

The real problem with HTML vs. PDF is the difficulty with which layout is defined, not how suited either format is to it. You can do pretty much anything you can do in a PDF in HTML, but the problem is HTML is not really designed for static layout for a fixed page size - it’s purposefully designed to be renderable in many different ways.

A well written HTML page will lay out well in everything from Lynx to IE7 to mobile browsers to FF2.x, both on the screen and in print form. That’s because HTML only gives hints as to layout, which should be chosen such that if the browser chooses to ignore them or change the page size or the available fonts or use a keypad instead of a mouse the information is still reasonably presented. PDFs take the opposite approach and define exactly every aspect of rendering.

That’s why HTML templates are so popular. You do the hard part, the layout, once. Test it in everything, make sure it prints okay etc, and the wedge the content into it. Wired.com still uses a FIXED WIDTH LAYOUT ffs. The reason I made by browser window 1280 pixels wide is that I want to see content presented over 1280 pixels. With HTML I am partly designing the layout.

PDFs are generally rubbish on screen anyway. Unless you have a nice 24" monitor you will either have to view them scaled down or only part of a page at a time, when they are obviously designed to be viewed in their entirety like a book.

WesternI · January 3, 2008, 12:00am

It’s funny that two of the arguments made in favor of PDF seem to be 1) PDFs are Portable and provide readers a precisely controlled experience on all platforms and 2) PDFs work waaaaay better on (insert favorite platform here).

MichaelG · January 3, 2008, 12:00am

And it’d be better as HTML: easier to hyperlink and search, more accessible to a wide audience, and it would certainly generate greater advertising revenue through the existing web ad ecosystem. […] But I’ll never understand how the founding editor of Wired could fall prey to such shallow PDF elitism-- completely missing the obvious and inherent power of the world’s HTML common denominator.

This text implies that the content only available as PDF, but Kevin Kelly’s truefilms content has been, and continues to be, available as HTML:

http://www.kk.org/truefilms/index.php

It’s difficult to see why you’re calling it “elitism” when he’s simply making his content available in both HTML and PDF.

Ben · January 3, 2008, 12:00am

I have no trouble with PDFs, just the Adobe reader.

They’re lovely here, viewed with Evince Document Viewer (GNOME) on Linux.

I did a Google for “alternative PDF viewer downloads” and this looked good: http://www.foxitsoftware.com/pdf/rd_intro.php

I’ve trained myself just not to open PDFs when I’m on windows though - I just wait till I’m back with Linux!

joe11 · January 3, 2008, 12:00am

I prefer PDFs for computer science papers, which I generally print and read offline.

mahalie · January 3, 2008, 12:00am

Wow, this is the first time I’ve seen a post at CH and shook my head in disappointment. Tell me you’re just saying this to incite debate/drive traffic to your site?

I agree with Joe Chin and others. I don’t understand your assertion about “massive inconvenience”. It’s a lot easier to download a PDF that’s meant to be a book than it is to download a website, especially for normal (non-techie) users. Kelly made the perfect choice for the book. Is his blog in PDF format? No? Is his eBook? So what?!? This is just inane.

Tommy · January 3, 2008, 12:00am

The PDF format isn’t good for online reading, but beats HTML for content that is primarily meant for downloading (such as technical manuals/books).

Providing a printer friendly format that the reader can choose to print to PDF (which is a lot better than saving a bunch of separate html/css/gif/jpeg/png files to disk) works almost as good, but few sites I frequent have the sensibility to do this.

"Do we really want to maintain two different versions of the same content?"
Doesn’t modern content management system generate both versions from the same source anyway?

Albin · January 3, 2008, 12:00am

The only problem with PDF as a representation is that browser integration isn’t seamless when moving from PDF to HTML. If PDF was treated more as a first-class citizen in a browser, there wouldn’t be this other-worldly experience.

PDF’s can contain hyperlinks and can be also be professionally typeset , arguably they are better then html. If you were serving TeX documents and could send different settings macros for screen, letter, A4, handheld - it would be better than what we have now. For those who have never written a TeX document, the basics are easy and the complex is possible.

If you had a PDF “web-browser” which dealt mostly with PDF content and then had to switch out to an HTML viewer to occasionally look at ugly HTML pages, your tune would be the same, only you’d be cursing HTML instead of PDF.

TeX is a professional typesetting language which can largely separate content from presentation, and had existed long before HTML was invented. If we had used some of ideas from TeX with a more uniform syntax such as s-expressions we may have avoided all headaches of HTML/CSS/JavaScript which we have lived with. Yeah, that’s the ticket… TeX + Scheme = PDF

Of course, there are several lines being blurred now that browsers are being used more as a thick client for applications rather than just to view documents. TeX (nor HTML for that matter) was really meant to be dynamically changed on the fly and re-rendered. However, if TeX syntax had been adjusted to sexps, it could be manipulated easier than XML, Lisp has certainly proven that.

On the other hand, moving in the opposite direction -

Read more of Ted Nelson - http://ted.hyperland.com/buyin.txt

“Markup must not be embedded. Hierarchies and files must not be part of the mental structure of documents. Links must go both ways. All these fundamental errors of the Web must be repaired. But the geeks have tried to lock the door behind them to make nothing else possible.”

But, all this has been talked about before… extensively. Someone else will be cursing PDF a year from now and this discussion will happen again.

But these “representations” that we are talking about are distinct from the actual “resource” which are behind the scenes. I should be able to ask for either a PDF or HTML representation of a particular resource from a server. That’s what content negotiation is all about.

That resource could be stored as a text file, TeX document, HTML, signed and encrypted, English or Chinese… but for future generations, it should be easy to transform into something else.

Binary PDF’s are not easy to transform into a different format - and that is it’s single biggest failing as a resource format.

So, after all that, I’m saying that PDF is great as a representation, and terrible as a resource format. HTML is decent as both a representation and for storage as an underlying resource.

But in response to your article, modern web-browsers are tuned to work with HTML as a representation, and that is the only reason why PDF’s are out of place. If browsers handled PDF’s as easily as they did jpegs (we don’t need to launch a picture-viewer do we?), there would be no problem.

mvark · January 3, 2008, 12:00am

My 2 cents - a href="http://mvark.blogspot.com/2008/01/how-to-mimic-google-searchgmails-view.html"http://mvark.blogspot.com/2008/01/how-to-mimic-google-searchgmails-view.html/a

AnonymousC52 · January 3, 2008, 12:00am

@rustyvz:
"PDF allows restrictions that you cannot get with HTML or PDF. You can:

lock a file with a password
prevent content from being copied
prevent printing the document"

No, you can’t do any of those except the first (password encryption), and the encryption is too weak to work. You obviously can’t “prevent content from being copied”; information doesn’t work that way.

Here’s a fun page explaining the similar “restrictions” put on fonts in the bad old days. Don’t miss the haiku at the end!
http://www.andrew.cmu.edu/user/twm/embed/dmca.html

Professor Tom’s talk of digitally signing PDFs put me in mind of this other fun page, although it’s not relevant to the security of PDFs, but rather to the security of MD5.
http://www.cits.rub.de/MD5Collisions/

Lerc · January 3, 2008, 12:00am

Foxit really does take the edge off PDF reading. It’s not ideal, but leaps an bounds better than Adobe’s …erm… thing.

While I don’t think PDFs are the entire answer, I really would like HTML to go away and never be seen again.

For all the flexibility, and the browsers that tout an Acid2 pass. The very existence of Acid2 points to something wrong to me.

For a explicitly formatted page, I’d even be happy with a format that had a list of [thing] along with an affine matrix.

[thing] could be any visual element the viewer knows about (individual glyphs, video, images, chickens)

It’d be inefficient on space but the solution for that could just be a compression layer that’s aware of the format. It could guess the next matrix for a series of glyphs quite easily. Compressing zero rotation would be obvious. Any browser using a method like this would have the compression/decompression layer as a black box they don’t care about.

The net result would be a browser that could render any page exactly with an almost trivial rendering engine.

nexusprime · January 3, 2008, 12:00am

@Tom Clancy: In older versions of OS X, yes, but in Leopard, when using Safari, it opens inline in the browser window, with the context menu option of opening it seperately in Preview if you want.

Roddy · January 3, 2008, 12:00am

For a pretty clear statement of why KK produced PDF ebooks, read here:

a href="http://www.kk.org/cooltools/archives/002537.php"http://www.kk.org/cooltools/archives/002537.php/a

I have to say I agree with him on most parts. Although the concept of PDFs containing advertising truly sucks.

jpsa · January 3, 2008, 12:00am

The quotation at the head of this article says it all, really:

Because of the idiosyncratic way web browsers work,
designers do not have full control of what you as a
reader see on the web. The web page, including its
fonts, fonts sizes, and placement of material and
size of the window, partly depends on the viewer’s
preferences.

Well, I’ve got news for you, Kevin Kolly: /I’m/ the one trying to read it, not you. I know better than you do what size font is legible on my screen, and how big my screen is.

Yes, I agree with other commenters, PDF is a good delivery format for stuff that’s intended only to be printed. But it sucks big time for anything meant to be read on the screen.

WurdBendur · January 3, 2008, 12:00am

The experience of opening a book and having everything laid out clearly is something that can’t be surpassed. It provides all kinds of layout possibilities, you know, if that’s more important to you than the content.

But it also isn’t replicated by PDFs, at least not on-screen. Getting the same view of a document on screen means shrinking it to fit on my display, which is where resolution becomes a problem. After all, not all of us have enormous monitors. And who cares about layout when the text is illegible anyway? Apparently due to this limitation (particularly on handheld devices) it’s also possible to format PDFs so that they can reflow without shrinking the text to unreadable sizes. Of course this totally defeats the purpose of the format.

At least I know the HTML will (almost) always fit on my screen without sacrificing readability. I actually LIKE the fact that the page can be dynamically sized to suit my needs. In my opinion this feature of HTML, which is often cited as it’s greatest weakness when compared to PDFs, is actually its greatest strength.

Sure, PDFs are useful if you’re designing primarily for print. But onscreen, they’re really for two kinds of people:

Outmoded designers who only know how to design for print (are there any of these left?)
Vindictive designers who think their design is more important than the content they’re presenting and want to use PDFs as a medium to strike out at the web for cutting into their turf.

Admittedly, there is one place where PDFs are really useful. If the fonts are vitally important (perhaps the document uses fonts with characters that your readers are unlikely to have, say anything not in Unicode), or if the text and images are strongly integrated, then HTML simply may not suffice.

Tom_Dibble · January 3, 2008, 12:00am

@John S:

" Really? Then why is this “packaged” as a PDF?

http://www.si.umich.edu/~pne/PDF/howtoread.pdf

How does this “packaging” help me, the reader?

Embedded font
Whole document including images is one file. 1 document = 1 file. A “document” that is a directory of html file(s) and images feels unwieldy.
The text in the images is anti-aliased according to my preferences. If the diagrams in this document were inline html images, I would be at the mercy of whatever AA settings the author used when creating the image.
Zooming retains layout.
Annotation/comments (again all within the one file), side-by-side viewing, quick rotation of the page, other viewing-based features…

These are just things for this simple document you linked."

Embedded font: okay. And that’s important?

Document as as single file. In a modern OS, a folder can be treated as a single file (Package or Archive) as well. Windows hasn’t caught on there yet, but as long as the PDF proponents are saying the only problem with PDF is that IE/Adobe treat it like crap then I can claim the only problem with packages is that Windows treats them like crap Also, you can very easily print an HTML page as a PDF document when/if you ever need to send it around as a single file. I routinely archive web purchase receipts and invoices by printing them to PDF straight from my browser; having them displayed as PDFs originally doesn’t save me any time or effort.

Anti-aliased: I’m not sure of the specific point here, other perhaps than that overlay text will anti-alias on images when that happens at render time instead of at generation time. You can do the same in HTML, except for IE’s poor support for such. Still, generally speaking, text superimposed over images is the least of my worries when reading content. How the primary content, which is generally given in paragraph form be the delivery mechanism PDF or HTML, renders is far more important. And, there, HTML does a damned fine job of allowing the user’s anti-aliasing preferences to be obeyed.

Zooming retains layout. Okay, zooming can retain layout in HTML as well, OR zooming can retain content. You have a choice (and the default is retaining content). Not sure how inflexibility is a win for the reader. For that matter, I’m not sure how it’s a in for the designer either.

Annotation/comments. If you’re talking about the user annotating the file, that’s a clear case for PDF. If I, as a user, want to be able to easily annotate a file without getting my hands dirty, I’ll print it to PDF (using my preferred font, anti-aliasing, and page sizing, thank you very much) and annotate from there. If you, as a designer, however, want to embed annotations in the document, it seems HTML’s facilities for dynamic content are far more advanced than PDF’s.

Side-by-side viewing: Huh? Requiring horizontal scrolling is a horrendous inconvenience of PDF. If you’re interested in how it looks on paper I can see side-by-side viewing as a benefit (again, though, with an HTML source document this is easily obtained by printing to PDF).

Quick rotation of the page: Again, huh? I don’t get your meaning here. I haven’t ever seen PDFs which can be rotated to landscape from portrait, whereas that is child’s play in HTML. Or, are you saying rotation of the page and content is a boon in PDF? Why the hell would you want to read text sideways? Again, if I as a reader want this, I can print my HTML doc to PDF and rotate it however I like.

The long and short of it is this: if you put a document out in HTML, it can easily be printed to PDF by anyone using a modern OS or a crappy OS but with a decent set of tools, when and if they need the “viewing-based features” of PDF. If you put it out in PDF, you lock in your particular preferences, your particular screen size and resolution, your particular paper size and orientation. These things can’t be “undone” by the reader.

Now, a few semi-points in PDF’s favor:

It does vector graphics better than HTML today (when will IE support VML or the like?) This means that HTML designers must “lock in” the display resolution of graphics often long before they are displayed, which is just as bad as PDF designers locking in all those things listed above.
Mathematical equations are clumsy in HTML (generally unsupported out of the box, and so get rendered as images and fall into the case above). PDF’s not really much better in this respect, but since it allows the rendered image to be vector-based instead of raster-based, and further to include actual characters where such make sense, it allows for resizing, selection, etc.
You can cross-reference by page number instead of by section (although way back when, before either PDF or HTML, I was taught that page number cross-referencing was bad form and one should always favor section number/name references in any semi-structured document).
It is easier to generate not-obviously-ugly-and-bloated PDF from Word than it is to generate not-obviously-ugly-and-bloated HTML from Word. This isn’t so much a feature of Word as it is a feature of the inability of most people to look “inside” a PDF document to see how atrociously it has been constructed by the PDF printer. Still, though, generate a 200k PDF from Word and people understand a lot better than if you generate a 150k HTML file which they can plainly see only has 2k of actual content.
PDFs allow for DRM (inasmuch as the particular reader obeys the DRM settings). If that is your business, then the choice is obvious. For the vast majority of PDFs I encounter on the web, though, DRM isn’t a consideration and isn’t employed at all.

@Bob Carpenter:
"Here’s a real example from a blog entry I did on the mess created for search by PDFs: http://acl.ldc.upenn.edu/P/P06/P06-2051.pdf

Now try searching for the word ‘fulfills’. It follows the phrase ‘which fully’ on page 397 (7/8 in the PDF page numbering).

The reason you can’t see it is that ‘fi’ is what’s known as a ligature, and in the interest of prettiness, PDF treats it as a unit, so it doesn’t match the sequence of two characters ‘fi’.

At least in January 2008, PDFs are nice for printing and layout, but lousy for search."

Safari found “fulfills” just fine. FYI. It seems someone has already fixed this problem.

minglem · January 3, 2008, 12:00am

my experience is that a number of small (especially non-profits) Website owners create their original documents in something like Word or even something obscure. They want to make the documents available on the Web but don’t have the resources to turn them into full-on HTML, so they resort to the next best choice: PDFs.

JD18 · January 3, 2008, 12:00am

Probably the new GNU pdf project will be so flexible, that a firefox plugin could be readily cached into memory and display content in a quick, unjarring way. With Adobe it seems like there is a pretty dramatic resizing of an external window into the browser area, or, its just external to the browsers direct control all together.

engtech · January 3, 2008, 12:00am

The one thing PDF has going for it, is guaranteed layout… something that HTML doesn’t achieve because of cross-browser issues.

Or at least, that’s what you’d think. That only holds true for Adobe reader, I’ve some PDFs not work properly with FoxIt.

But still, PDF is usually a better way to present paginated information you’d want to print than HTML.