Cleaning Word's Nasty HTML

I recently wrote a Word 2003 document that I later turned into a blog post. The transition between Word doc and HTML presented some problems. Word offers two HTML options in its save dialog: "Save as HTML" and "Save as Filtered HTML". In practice, that means you get to choose between totally nasty HTML and slightly less nasty HTML.


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2006/01/cleaning-words-nasty-html.html

That’s what you get for trying to use word as a Blog editor. :wink:

The deliverable, in this case, WAS a Word doc. So it didn’t make sense to do it as HTML first. Or so I thought at the time, anyway…

Your macro does a thing that reminds me of a tangential questions, which is why so many people idon’t/i use styles in Word. For all the yacky-yack about CSS in HTML, you’d think people would be a lot hipper to the virtues of a style sheet in Word.

Given the way we work around here, a conversion from WordML/.doc format to HTML is no good unless it preserves (references to) our styles. As it happens, Word’s conversion chooses to apply local formatting to certain converions (w/in tables, if I remember correctly), making our styles useless.

For all the yacky-yack about CSS in HTML, you’d think people would be a lot hipper to the virtues of a style sheet in Word.

But you can’t view the markup “tags” in Word. It’s all magical and hidden, which makes it far more difficult to deal with. At least with HTML you can do a view source and see what’s wrong; with Word you’re just plain screwed. This happens to me all the time in Word, too. I’m happily editing a paragraph and all of a sudden I delete some hidden magic tag and I’m hosed. Drives me absolutely bonkers.

Word’s conversion … mak[es] our styles useless.

Right. When you save a Word doc as HTML, every single paragraph has a style. Every. Single. Paragraph!

If you really really want HTML, you should start with HTML, because Word is absolutely not the way to build HTML.

When I have to paste something from word into an HTML doc, I found the http://www.fckeditor.net/ has a nice utility that imports text from MS Word. It’s also really nice about creating XHTML.

Good little editor if you have people posting lots of stuff from word onto a website

Anybody got an idea why Word creates html that way?

Some form of “We need to sell more operating systems, so we need to make the applications work more” ? :slight_smile:

Documents written for paper and slides should not be equal to a web page. It is a great feature or it can with jeff’s tool be a great feature, but since it is not 100% compliant html and does not display equal on every browser, then why make it to dramatic? Why make so complicated html when the viewer probably not have the requiered font, browser or operating system?

No this is not a i-hate-ms thread, I just dont like word and still have nightmares from my study when office97 crashed because it could not handle the size of my paper and number of equations. Yes I know things have changed, but if it creates html like that, how does it handle word documents? :frowning:

/P

Any reason you didn’t use HtmlTidy? Its easy enough to run SWIG over it and use it from .NET.

I was just about to suggest HTMLTidy as well, although it doesn’t seem to do a complete cleanup of word. I’m looking to creating an open sourced Java project that whitelists standard HTML tags and reduces Word HTML to simple HTML as well. (Small blurb on project intentions here: a href="http://www.critical-masses.com/projects.html"http://www.critical-masses.com/projects.html/a --scroll down to HTMLMin) I’ve been busy and never really got off the ground on that one, although I do have a need for something to replace the combination of HTMLTidy and regex cleanup I’m using now. I just learned that Blogger has a Word ad-in that pretty much does the job, though.

Bloggers add-in only works with Blogger. If you could write a word add-in that works with just about any blog/CMS package, you might have something.

I know there are blog tools out there, but with so many folks using word, why not make a word plugin?

This is great, thanks. I normally compose by blog posts in Word, as I like to keep them as an off-line collection of Word documents as well (for my memoirs one day, you see).

I was just about to suggest HTMLTidy as well, although it doesn’t seem to do a complete cleanup of word.

That’s right, HTMLTidy doesn’t clean out all the craziness. To see the craziness yourself, just save a Word doc as HTML or Filtered HTML and view it in a text editor.

Bloggers add-in only works with Blogger

Vertigo Software (eg, we) wrote the blogger add-in for Word:

http://help.blogger.com/bin/answer.py?answer=1180topic=14

The add-in is written in VB6 for compatibility reasons…

I have backported the code to .NET 1.1 and added a few features - remove class attributes when they contain Microsoft classes (i.e. name beginning with Mso), ignore spans and ignore divs (leave those tags intact, but still remove attributes). Is it fine if I post my modified version (it is your code after all)?

Very intersting thread.
I also knew about HTMLTidy and I was not convinced, first because it’s not so easy to use, and second that it’s incomplete.
I’m used to the Dreamweaver “Clean Word 97/100” function. From what I remember, around 2/3 of the page size was deleted, but I still had to make manual searches for “MSO” craps.
How do you rank this Dreamwear MX tool, if you know it?

Heh… whan I saw the post I thought that mircale solution that will clean up Word HTML preserving the formatting and looks of the document is finally here. I wonder if there’s one :slight_smile:

This is great. I’d love to be able to use this with a Drupal website I’m working on. Until I know how to create a Drupal input filter module in PHP, I’ll just use one of the two posted variations to cleanup Word and Publisher files for posting. Or, perhaps I’ll add a form so user can drag and drop a file and display the cleaned up HTML they can copy and paste into Drupal.

"Yes I know things have changed, but if it creates html like that, how does it handle word documents? :-("
Considering the Word format has changed to an XML based on, I would guess alot has changed, and conversion to HTML will be easier and slightly cleaner.

Hey there, I have been on a fruitless search for years on something that might help me get docs straight into nice sanitized HTML.

I have a VB6 app I wrote a while back (for compatibility reasons) that is based on htmltidy although I did the rest with regular expressions. Works fine, but I just wanted something neater and now I can use .NET It is time for another go.

Any suggestions on going from a word doc to an online published clean html file?

I guess I will have to write an ActiveX control that takes in a Word Doc and publishes the crap-free result (think it will be better to have the processing done on the client, although updates could be an issue)… hmm i have typed to much.

Any suggestions on going from a word doc to an online published clean html file?

That’s exactly why I wrote this post-- I needed to go from Word doc to published HTML file!

  1. Save the word doc as “filtered HTML”
  2. Run this utility on the saved HTML file

voila. :wink:

Thanks dude!.. your code does a great job cleaning those word tags!.. I love it.