The Trouble with PDFs

Dave_Pawson · January 4, 2008, 12:00am

“The web page, including its fonts, fonts sizes, and placement of material and size of the window, partly depends on the viewer’s preferences.”

Which IMHO is quite right. If the author wants the reader to read his|her content.

The simplest example is waiting until age or something else results in your eyesight deteriorating to the point where 8 point is no longer usable. Then user preference becomes important. If I can’t read it, I’ll move on, no matter how wonderful your typesetting.

Lecturer · January 4, 2008, 12:00am

As I have pointed out to many undergraduates … use the correct tool for any job.

PDFs renders don’t have the same set of problems that the web browsers introduces other problems. But what are PDFs good for?

I mostly write documents in TeX, and LaTeX. Why do I do that? Because I should not have to care how my text looks. I should say “my title is this” and something should render it to something that looks great.

I have used word extensively in the past. I know that there is a nice, safe feeling with a WYSIWYG interface, but it comes at a cost. How many of use have fought that chart/picture/image/clipart into exactly the right place in the document, only to press space and it move? TeX and LaTeX remove this hassle: you say “I want this picture with this caption” and it places it in the most optimum position.

LaTeX renders best to PDF, and it’s there that you have the answer … PDFs were designed to be a printer and publisher standard - a more human friendly version of postscript.

As an author, I don’t necessarily want everyone to be able to copy-paste my work. I don’t have the option of preventing copy-paste in a plain text standard like [X]HTML.

PDF still has a place in this world - it just might be the case that it’s not the right tool for your job.

Sam · January 4, 2008, 12:00am

Usually I don’t give a rat’s ass about the designer’s layout so I prefer html. Browsers are the problem. It’s a pain to change styles - it should be one click and for god’s sake give me a matrix for changing colors. A built in grep tool to search a subtree while in file:/// would be nice. As for packaging, whats wrong with html.tar.gz? It would be nice if firefox would open a tgz of html docs since the whole point of a browser is to reduce everything to a single mouse click. It’s a little hard to do one thing well when mousing is your one thing but now I’m getting philosophical. Cheers, fellow pdf-hater.

TravisJ · January 4, 2008, 12:00am

PDFs are all about printing. The site I engineer makes extensive use of charting and my customers demand high quality, low bandwidth, and printable reports. Printing using the browser is a joke. My site is a highly dynamic reporting site and the PDF also acts as an excellent archiving tool to snapshot a point in time and to snapshot the report criteria and customization the user spent so much time on. Saving a complicated web app as an html page is also a joke. My solution with all of those features is to create the charts in SVG format(XML) and dump your objects to XML, hit it with an XSLT to create XSL-FO and then use a 3rd party tool create a PDF from that. Once you get your head wrapped around XSLT and XSL-FO it’s quite elegant. My tools of choice for doing this are ASP.Net, Dundas Ent Charting, AltSoft Xml2PDF and Altova Stylevision or Stylus XML Studio, and Telerik. Granted these tools add up to about $5000+ but hey if you want to play you gotta pay.

Using the internet requires the user at minimum to have a modern browser, Flash, and PDF plugins. It’s just the way it is.

Riley · January 4, 2008, 12:00am

The central criteria for choosing an output format is the reader’s information transfer needs.

A secondary criteria, at least for some content, is the ability to either view or print the information. PDFs excel at the second: printing a PDF produces much nicer output that printing HTML.

But otherwise, PDFs are too often bloated. Adobe’s attempts to turn PDFS into another multimedia “experience” does nothing to address the problem.

Finally, PDFs would be much more palatable, I think, if one wasn’t so generally dependent on Adobe’s hideuously bloated software to read them. Seriously – why part of reading a PDF should require a 25MB or greater download?

Dave · January 4, 2008, 12:00am

I agree that HTML is much much better for new, custom documents designed for the web.

However, I find that usually PDFs are used on the web because the website is displaying a document originally created in MS Word and then converted to PDF for easier viewing operating systems other than windows.

Although HTML would still be better, the output of saving a Word document as HTML sucks.

Will · January 4, 2008, 12:00am

When I was in university, lots of my Comp. Sci. courses (and a few non-comp sci courses) offered downloadable course notes, lecture slides, assignments and old tests in PDF format.

Many people have already pointed out the advantages:

PDFs are self-contained
They can be saved offline easily
The fixed layout is great for printing

Of course, the fixed layout sucks for online browsing. And to me, one of the worst “features” of Adobe Reader is “fast web” view. On by default, the feature causes the browser plugin to open PDF files immediately, even if they haven’t loaded completely. Unfortunately, if your PDF file is quite large (say 5-10 megs), it will take a few minutes to load completely, even on a high speed connections. Woe to you if you happen to use the text search function. Your browser will lock up as the PDF plugin waits for the rest of the file to load, in order to search the entire document.

Even if they’ve fixed this problem in the latest version of Adobe Reader, I don’t care. This is an example of horrible design, and IIRC, it’s existed in at least two versions (5 and 6). IMO, any version later than 6 is horribly bloated anyway.

And I agree that PDFs are misused in a lot of contexts. I’ve seen sequences of pictures packed in a PDF. No formatting, just 1 picture per page. What if you want to extract the images and save them in their original format? Well, the latest version of Adobe Reader has removed the “image toolbar” functionality that used to let you do this.

PDFs are decent for (offline) manuals and e-books, IMO. I’ve seen sites which offer manuals in both PDF and HTML format. This is a good compromise.

CptBongue · January 4, 2008, 12:00am

I hate to point this out, but PDFs aren’t intended for layout and user interaction

Case-in-point: Paperless offices and digitalization of document assets.

Nowadays, everyone is scanning in their documents, records et-al, and storing them using document management software in the form of multi-page tiffs or PDFs.

When I see a technology I don’t understand, rather than just criticizing it like you’re doing here, why don’t you try accepting the fact that it has reached the point it has for a reason(even if you don’t know what it is)?

Dennis64 · January 4, 2008, 12:00am

I completely agree with you Jeff, more isn’t automatically better. Part of the problem is that the user/document creator doesn’t necessarily understand the alternative options available, at least I hope so. I do technical support – for programmers, not end users (keep this in mind).

It blows my mind how often I get an email that states something like “the attached Word document explains my problem”, then I open the attached Word document and it’s half a page of plain text. Now how did the person sending me this email come to the conclusion that it would be better to send me a Word document than simply type the same information into the body of the email?

The other thing I get all the time is a 4 MB Word doc that contains a screenshot that could have been sent as a 50 kB .jpg image file.

Again, this is from programmers, people who use computers professionally every day and should know what options are available to them.

Luke · January 4, 2008, 12:00am

Hey, for those of you looking for bookmarking in Adobe Reader, check this out: http://korayem.net/post/2007/12/Adobe-Reader-Tip-Open-a-PDFs-Last-Viewed-Page.aspx

Bill140 · January 4, 2008, 12:00am

Hey, for those of you looking for bookmarking in Adobe Reader, check this out: http://korayem.net/post/2007/12/Adobe-Reader-Tip-Open-a-PDFs-Last-Viewed-Page.aspx

Thanks Paul. I will look into this.

KashifS · January 4, 2008, 12:00am

All of you need to educate yourself on Document Authoring. Ever wonder how some technical documents you read are available as HTML or PDF?

They probably used DocBook (industry standard amongst technical writers), Latex, or some other form to create documents that separate content from presentation (e.g. HTML is content, CSS is presentation).

So you write all you content expressed in some markup language, and let the presentation layer do all the formatting, layout, etc to render to HTML or PDF.

Ray6 · January 4, 2008, 12:00am

There seem to be two types of people in this comment thread: those who are graphic or layout designers (or at least loyal to their cause), and those who are web programmers (or who fall into ranks behind them). The designers want to make sure the content they have painstakingly designed to look good and flow properly and be digestible to the audience.

This is a Good Thing, at least often enough to be mentioned. Jeff, you have mentioned the importance of good design (especially visual) so many times on this blog I’m flabbergasted that you throw in unconditionally with the “I don’t give a rat’s ass about the designer’s layout” bunch. Surely you can understand the desire, and sometimes the need, to ensure that certain visual information is presented just so. If not, go talk to some print comics authors. Webcomics to this day consist generally of solid images, so the exacting layout can be immutable.

Next is the portability argument, which belies the web programmers’ intentions: talking of whether Google can read something, 1 URL = 1 document, .mht or html.tar.gz, OSX’s handling versus IE+Adobe’s, et cetera ad nauseam completely misses the point.

Portable. Document.

I work in an office where half the people need me to teach them how to attach something in an email. We don’t have a server to publish. We don’t have a domain name we’re willing to pay for eternally, nor a fancy content management system that stores everything in Docbook XML. Our index is a binder and a crappy interal site designed by monkeys with no proper search function. PDFs are a lifesaver in that environment. You can save them, print them straight off the intranet, toss them around in emails, back them up by shoving them on an external hard drive because god knows management doesn’t have a proper backup system in place. They’ll stay the same, and because PDF is now an ISO standard, I’m going to be able to read them like I can still read ASCII.

The HTML/print/CMS/CSS/anti-aliasing/whatever argument misses the point. Servers aren’t permanent. URLs aren’t permanent. A good chunk of things are needed offline, and a surprising number of things need to be as easy as possible to be punted around by the ignorant masses. PDF works better for that. It’s not the pretty, shiny web 2.x usually discussed here, but it’s the web a lot of people still live in.

Rob_Funk · January 4, 2008, 12:00am

@Shmork,
The two major open-source codebases out there for reading PDFs are xpdf and Ghostscript. Yes, xpdf is a bunch of C++, but if you don’t like that you can use derivatives of it as outside commands, rather than linking directly. That’s the way almost everyone uses Ghostscript, which can convert PDF (and PostScript) to any raster format (or back to PS/PDF).

annoonn · January 4, 2008, 12:00am

Paul wins the thread.

the real problem is that PDF is commonly misused in completely inappropriate contexts.
Paul Coddington on January 3, 2008 02:53 AM

It’s not about saying PDF is “better” than HTML, or vice versa. It’s a question of picking the right medium for your message, which is really just a corollary to the first principal of effective communication: know your audience (and cater to their needs).

MattM · January 4, 2008, 12:00am

I think I can answer Mike Shaffer’s question (“Can Google’s spider crawl through PDFs?”) with the comment I wanted to make. Yes, it seems they can, because there’s that wonderful feature of Google that allows you to view (many, not all) PDFs as HTML instead. I think the fact that this feature was made says something about the massive inconvenience of PDFs.

BenjaminF · January 6, 2008, 12:00am

Why PDFs “suck”:

(1) Size. A single page of ‘text’ can inflate well beyond a quarter of a megabyte. A complex document can shoot into the tens of megabytes. An HTML page that looks virtually identically is often less than one tenth (or less) of the size.

(2) “Non-Web”. PDFs virtually always violate the utility of the web as a hyperlinked medium. They don’t cleanly integrate with the rest of the web because they are fundamentally standalone documents. Yes, there are extensions to let them do hypertext kind of things, but hardly anyone actually uses those features.

(3) Nearly unmodifiable. Once you have created a PDF it is close to ‘written in stone’.

Why PDFs “rock”:

(1) ANYONE can create one. A person makes whatever they want in their favorite program (Word, Photoshop, whatever) and “print” it as a PDF. Upload, link, and they are done.

(2) They look “exactly” like what the person who created it wanted.

Fundamentally, PDFs do what non-technical people want: Enable them to publish documents to the web while knowing little to nothing about the web. On the short term they care only that it lets them publish their information to the web with the least effort.

Long term, of course, PDFs make life hell for site maintainers.

AlasdairK · January 7, 2008, 12:00am

PDFs don’t look anything like what the person who created
them wanted to, because they’re displayed on my computer

Why PDFs Suck Alasdair King Page 1

Why PDFs Suck Alasdair King Page 2

screen, which isn’t multiple A4 pieces of paper. So
instead PDFs are slow, jerkily-scrolling mess that take

Why PDFs Suck Alasdair King Page 2

Why PDFs Suck Alasdair King Page 3

an age to load and break the back shortcut - and who would
tolerate a web page that takes thirty seconds to load text?

Why PDFs Suck Alasdair King Page 3

Why PDFs Suck Alasdair King Page 4

and can’t be searched easily and won’t zoom and the text
won’t wrap when you resize the font and they use a font you

Why PDFs Suck Alasdair King Page 4

Why PDFs Suck Alasdair King Page 5

can’t read very well because you have a print impairment
but most of all I hate the way they break the page into

Why PDFs Suck Alasdair King Page 5

Why PDFs Suck Alasdair King Page 6

sections that make no sense for anyone trying to read the
damn thing.

Why PDFs Suck Alasdair King Page 6

HenrikS · January 7, 2008, 12:00am

I’ll agree with you Jeff when: IE6 is safe and sound in it’s well deserved grave. As far as Kevin goes I’ll think out loud, I don’t know much about the guy but it seems like he comes from the print arena. If that is the case he can probably create the example you show above in about 5 min in Quark. Compare that with the horrors of getting the CSS to work in all browsers, it’s a science. No wonder Kevin prefers PDF at this point.

However, the future looks bright at the moment. IE7 is not too quirky and it will probably only get better with version 8.

John_Gruver · January 7, 2008, 12:00am

As noted many times above:

=====

"Here’s a real example from a blog entry I did on the mess created for search by PDFs: http://acl.ldc.upenn.edu/P/P06/P06-2051.pdf

Now try searching for the word ‘fulfills’. It follows the phrase ‘which fully’ on page 397 (7/8 in the PDF page numbering).

=====

Preview.app finds this effortlessly on my Mac. That Adobe Reader 8 and other programs do not is an issue with Reader and those other programs, not with the PDF file format.