Cleaning Word's Nasty HTML

Shrini · August 15, 2007, 12:00am

Thanks… You have save me time.

laskater · August 20, 2007, 12:00am

After struggling to find why curly quotes and the dash in the input word file got converted to garbage in the html, I found I needed to change the file encoding type used by File.ReadAllText. This change works for me (I have no idea what encoding scheme is used when you don’t specify the .Default):

string html = File.ReadAllText(filepath, System.Text.Encoding.Default);

IgorS · September 7, 2007, 12:00am

Nice code!

But I have one enhancement:

Change

// Get rid of empty paragraph tags
sc.Add(@"(]+)+nbsp;()+");

to:

// Get rid of empty paragraph tags
sc.Add(@"(]+){1}nbsp;(){1}");

Otherwise, it will remove unexpected data.
For example, following line will be fully removed:
nbsp;

My fix resolves the problem.

Thanks.

Paivi · January 3, 2008, 12:00am

THANK YOU!!!

I have tens of word documents I need to convert into accessible html format. Your cleaner app saved me hours and hours of work and frustration!

THANK YOU!!!

Laurie · February 26, 2008, 12:00am

I don’t consider myself a stupid person (much), but how does one use this application? I want in!!

Kris · March 7, 2008, 12:00am

This is superb!!! I have been converting PDF to html which has been very messy and used word as the spell checker as none of the available freeware could do the job properly. I am trying to create the simplest of html for the worlds most basic mobile phones as only the symbian ones can view pdf at the moment. This will save me hours. Thanks

Laurie - Extract it to a directory on your drive. Bring up a command prompt and change to that directory. type the program name followed by the full path and filename of the html doc. It will drop the fixed html in the same dir as the program.

DaveC · April 22, 2008, 12:00am

Works well for me too - been searching for years for an elegant solution to converting simple Word documents which covers Word tables of content, semantic markup for headings for SEO and simple tables!

Nice clear instructions still at:
http://www.dickson.me.uk/2007/02/08/howto-blog-using-a-microsoft-word-2003-file/

Download of console app now:
http://www.codinghorror.com/blog/archives/000485.html

I find it did throw an exception if you still have the source .htm file still open - so need to close Word first.

Chris_W · May 28, 2008, 12:00am

That’s great. One nice addition would be to search for common style attributes and replace them with generated class attributes.

So if there is a load of

span style='font-size:10.0pt;font-family:"Courier New";color:#A31515'

tags, like Word generates, it word add

span.style1
{
 font-size:10.0pt;
 font-family:"Courier New";
 color:#A31515'
}

to the in-page style tag and replace the style attributes with class=‘style1’

A bit more work but it would be really useful. HtmlTidy doesn’t do this from what I can see.

tom18 · June 30, 2008, 12:00am

This is a work in progress, but the Drupal modules

http://www.drupal.org/project/word2web and
http://www.drupal.org/project/xslt_book

clean up Word HTML with XSLT expressions, and do a quite nice job with it. And since it’s just XSL at the core, you can download them, rip out the stylesheets, and use them in whatever environment you like.

svend · July 6, 2008, 12:00am

Thanks for sharing that very handy code! And thanks to everybody else for all the illuminating comments, too. What a great thread.

For the record, though, to call Word’s HTML output garbage isn’t really fair. To the garbage.

That’s an interesting point about using OpenOffice. Have it but had never thought of using it to get around this M$ issue.

vegalou · July 31, 2008, 12:00am

thanks for the great!

little suggests:

keep the extension name.
(if .html used would modified to htm)
add one target name if arguments exsist.
(ex: whc foo.htm target.htm)

great thanks again. ^^y

Jeremy · January 7, 2009, 12:00am

This worked great for Word 2007 except for a ’ turns out to be ?T but I cannot do a simple Replace. Left and right quotes as well as a few other characters turn into weird looking things like that as well. Any ideas how to avoid this?

kazim_mehdi · January 14, 2009, 12:00am

hi

i use the following

// remove inner ?... declaration
//
((\\s*\?)(.*?)(\s*/\s*\))

//remove o:p like constructs
(\\w\:\w\(.*?)(\/\w\:\w\))

before running tidy and it works for me till now hoping it continue to work

kazim

Onur_Onal · February 26, 2009, 12:00am

THANK YOU VERY MUCH!!!

Jason_Kemp · February 6, 2010, 12:00am

Hey Jeff,

Love the tool.

Of course, since I’m such a klutz at the keyboard, I found a bug when I typed the file name wrong. There should be a return after line 16: Console.WriteLine(“File doesn’t exist”); to prevent the user seeing the exception get thrown.

Luckily, I’m feeling frisky

Brandon · February 6, 2010, 12:00am

I know Windows is not big on drag-and-drop, but is there a drag-and-drop version of this in the works so we can just drag a folder full of mucked up Word HTML docs onto the program and have it clean them all in batch?

Jason · February 6, 2010, 12:00am

Excellent. Thanks a lot for the time savings!

SerkanY · April 22, 2010, 12:00am

Jeff, I have ported your code to JavaScript.

Thank you, it helped so much.

function cleanWord(str){
    // get rid of unnecessary tag spans (comments and title)
    str = str.replace(/\&lt;\!--(\w|\W)+?--\&gt;/gim, '');
    str = str.replace(/\&lt;title\&gt;(\w|\W)+?\&lt;\/title\&gt;/gim, '');
    // Get rid of classes and styles
    str = str.replace(/\s?class=\w+/gim, '');
    str = str.replace(/\s+style=\'[^\']+\'/gim, '');
    // Get rid of unnecessary tags
    str = str.replace(/&lt;(meta|link|\/?o:|\/?style|\/?div|\/?st\d|\/?head|\/?html|body|\/?body|\/?span|!\[)[^&gt;]*?&gt;/gim, '');
    // Get rid of empty paragraph tags
    str = str.replace(/(&lt;[^&gt;]+&gt;)+&amp;nbsp;(&lt;\/\w+&gt;)/gim, '');
    // remove bizarre v: element attached to &lt;img&gt; tag
    str = str.replace(/\s+v:\w+=""[^""]+""/gim, '');
    // remove extra lines
    str = str.replace(/"(\n\r){2,}/gim, '');
    
    // Fix entites
    str = str.replace("&amp;ldquo;", "\"");
    str = str.replace("&amp;rdquo;", "\"");
    str = str.replace("&amp;mdash;", "–");
        
    return str;
}

hm2k · July 7, 2010, 12:00am

* Download the VS.NET 2005 solution (3kb) * Download the CleanWordHtml console application (3kb, requires .NET 2.0 runtime)

The download links appear to be dead…

http://www.codinghorror.com/blog/files/WordHtmlCleaner-vsnet2005-solution.zip
http://www.codinghorror.com/blog/files/WordHtmlCleaner-executable.zip

Many of the comments above are spam.

Any advances?

Thanks.

Upaj_Os · December 8, 2010, 12:00am

My solution is simpler: If it’s going to wind up as HTML eventually, don’t write it in Microsoft Word. Use the free OpenOffice.org Writer, the open source derivative of Sun Microsystems’ Star Office Writer. It can compose documents with all the styles and nice appearance of Word documents, but when it exports the document as HTML, the resulting code is much, much cleaner. Any styles used in the document are included as HTML inline style blocks. It can also import Word documents, although I have no experience using it as a filter to clean up Word files prior to export to HTML. OpenOffice.org is a free suite similar to Microsoft Office, and is available for Microsoft Windows, Apple Macintosh and Linux systems. (Up to version 2.4 it also ran on Windows 98 systems, and you can probably still find the installer in archives.)