Cleaning Word's Nasty HTML

There is another way you know. Instead of creating a file in Word then going through the rigamorole to clean it, why not just start with the a href="http://openoffice.org"OpenOffice.org/a writer and have compliant documents from the start? Since it’s free and you can even a href="http://portableapps.com/apps/office/openoffice_portable"carry it on your USB drive/a, it seems like the simplest solution to me. It even open all of your old Word documents, though WordPerfect support is sketchy. The only thing it doesn’t do as well as or better than Word in my experience is macros, and I rarely use them on documents intended for the Web.

If you’re interested in converting a BLOCK of MS Word (from a say copy/paste operation), I just blogged about how to do this. You may be able to use the same technique for an entire Word HTML doc. Just put the DHTML control into Design Mode (see post below) and then save web.Document.InnerHTML to a file.

Copy Paste HTML From MS Word: IE’s DHTML Editing Control (in a .NET WinApp)
http://blogs.msdn.com/noahc/archive/2006/10/16/copy-paste-html-from-ms-word-ie-s-dhtml-editing-control-in-a-net-winapp.aspx

Thank you! This will come in very helpful as I’m converting an intranet site at work, and alot of the pages are in god-awful Word HTML format.

Thanks for your function. Works fine!
Here’s the VB (.NET 1.1) function:

    Public Function CleanWordHtml(ByVal html As String) As String
        Dim sc(7) As String
        'get rid of unnecessary tag spans (comments and title)
        sc(0) = "!--(\w|\W)+?--"
        sc(1) = "title(\w|\W)+?/title"
        'Get rid of classes and styles   
        sc(2) = "\s?class=\w+"
        sc(3) = "\s+style='[^']+'"
        'Get rid of unnecessary tags   
        sc(4) = "(meta|link|/?o:|/?style|/?div|/?st\d|/?head|/?html|body|/?body|/?span|!\[)[^]*?"
        'Get rid of empty paragraph tags   
        sc(5) = "([^]+)+nbsp;(/\w+)+"
        'remove bizarre v: element attached to img tag   
        sc(6) = "\s+v:\w+=""[^""]+"""
        'remove extra lines  
        sc(7) = "(\n\r){2,}"
        For Each s As String In sc
            html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase)
        Next
        Return html
    End Function

Thank you so much!
This was exactly what I needed right now.
Gr,
Ben

running wordhtmlcleaner /? at the command line causes an exception, as does trying it on a html saved from openoffice…

Thanks… You have save me time.

After struggling to find why curly quotes and the dash in the input word file got converted to garbage in the html, I found I needed to change the file encoding type used by File.ReadAllText. This change works for me (I have no idea what encoding scheme is used when you don’t specify the .Default):

string html = File.ReadAllText(filepath, System.Text.Encoding.Default);

Nice code!

But I have one enhancement:

Change

// Get rid of empty paragraph tags
sc.Add(@"(]+)+nbsp;()+");

to:

// Get rid of empty paragraph tags
sc.Add(@"(]+){1}nbsp;(){1}");

Otherwise, it will remove unexpected data.
For example, following line will be fully removed:
nbsp;

My fix resolves the problem.

Thanks.

THANK YOU!!!

I have tens of word documents I need to convert into accessible html format. Your cleaner app saved me hours and hours of work and frustration!

THANK YOU!!!

I don’t consider myself a stupid person (much), but how does one use this application? I want in!!

This is superb!!! I have been converting PDF to html which has been very messy and used word as the spell checker as none of the available freeware could do the job properly. I am trying to create the simplest of html for the worlds most basic mobile phones as only the symbian ones can view pdf at the moment. This will save me hours. Thanks

Laurie - Extract it to a directory on your drive. Bring up a command prompt and change to that directory. type the program name followed by the full path and filename of the html doc. It will drop the fixed html in the same dir as the program.

Works well for me too - been searching for years for an elegant solution to converting simple Word documents which covers Word tables of content, semantic markup for headings for SEO and simple tables!

Nice clear instructions still at:
http://www.dickson.me.uk/2007/02/08/howto-blog-using-a-microsoft-word-2003-file/

Download of console app now:
http://www.codinghorror.com/blog/archives/000485.html

I find it did throw an exception if you still have the source .htm file still open - so need to close Word first.

That’s great. One nice addition would be to search for common style attributes and replace them with generated class attributes.

So if there is a load of

span style='font-size:10.0pt;font-family:"Courier New";color:#A31515'

tags, like Word generates, it word add

span.style1
{
 font-size:10.0pt;
 font-family:"Courier New";
 color:#A31515'
}

to the in-page style tag and replace the style attributes with class=‘style1’

A bit more work but it would be really useful. HtmlTidy doesn’t do this from what I can see.

This is a work in progress, but the Drupal modules

http://www.drupal.org/project/word2web and
http://www.drupal.org/project/xslt_book

clean up Word HTML with XSLT expressions, and do a quite nice job with it. And since it’s just XSL at the core, you can download them, rip out the stylesheets, and use them in whatever environment you like.

Thanks for sharing that very handy code! And thanks to everybody else for all the illuminating comments, too. What a great thread.

For the record, though, to call Word’s HTML output garbage isn’t really fair. To the garbage.

That’s an interesting point about using OpenOffice. Have it but had never thought of using it to get around this M$ issue.

thanks for the great!

little suggests:

  1. keep the extension name.
    (if .html used would modified to htm)
  2. add one target name if arguments exsist.
    (ex: whc foo.htm target.htm)

great thanks again. ^^y

This worked great for Word 2007 except for a ’ turns out to be ?T but I cannot do a simple Replace. Left and right quotes as well as a few other characters turn into weird looking things like that as well. Any ideas how to avoid this?

hi

i use the following

// remove inner ?... declaration
//
((\\s*\?)(.*?)(\s*/\s*\))

//remove o:p like constructs
(\\w\:\w\(.*?)(\/\w\:\w\))

before running tidy and it works for me till now hoping it continue to work

kazim

THANK YOU VERY MUCH!!!