Cleaning Word's Nasty HTML

Hey Jeff,

Love the tool.

Of course, since I’m such a klutz at the keyboard, I found a bug when I typed the file name wrong. There should be a return after line 16: Console.WriteLine(“File doesn’t exist”); to prevent the user seeing the exception get thrown.

Luckily, I’m feeling frisky :slight_smile:

I know Windows is not big on drag-and-drop, but is there a drag-and-drop version of this in the works so we can just drag a folder full of mucked up Word HTML docs onto the program and have it clean them all in batch?

Excellent. Thanks a lot for the time savings!

Jeff, I have ported your code to JavaScript.

Thank you, it helped so much.

function cleanWord(str){
    // get rid of unnecessary tag spans (comments and title)
    str = str.replace(/\<\!--(\w|\W)+?--\>/gim, '');
    str = str.replace(/\<title\>(\w|\W)+?\<\/title\>/gim, '');
    // Get rid of classes and styles
    str = str.replace(/\s?class=\w+/gim, '');
    str = str.replace(/\s+style=\'[^\']+\'/gim, '');
    // Get rid of unnecessary tags
    str = str.replace(/<(meta|link|\/?o:|\/?style|\/?div|\/?st\d|\/?head|\/?html|body|\/?body|\/?span|!\[)[^>]*?>/gim, '');
    // Get rid of empty paragraph tags
    str = str.replace(/(<[^>]+>)+ (<\/\w+>)/gim, '');
    // remove bizarre v: element attached to <img> tag
    str = str.replace(/\s+v:\w+=""[^""]+""/gim, '');
    // remove extra lines
    str = str.replace(/"(\n\r){2,}/gim, '');
    
    // Fix entites
    str = str.replace("“", "\"");
    str = str.replace("”", "\"");
    str = str.replace("—", "–");
        
    return str;
}
* Download the VS.NET 2005 solution (3kb) * Download the CleanWordHtml console application (3kb, requires .NET 2.0 runtime)

The download links appear to be dead…

http://www.codinghorror.com/blog/files/WordHtmlCleaner-vsnet2005-solution.zip
http://www.codinghorror.com/blog/files/WordHtmlCleaner-executable.zip

Many of the comments above are spam.

Any advances?

Thanks.

My solution is simpler: If it’s going to wind up as HTML eventually, don’t write it in Microsoft Word. Use the free OpenOffice.org Writer, the open source derivative of Sun Microsystems’ Star Office Writer. It can compose documents with all the styles and nice appearance of Word documents, but when it exports the document as HTML, the resulting code is much, much cleaner. Any styles used in the document are included as HTML inline style blocks. It can also import Word documents, although I have no experience using it as a filter to clean up Word files prior to export to HTML. OpenOffice.org is a free suite similar to Microsoft Office, and is available for Microsoft Windows, Apple Macintosh and Linux systems. (Up to version 2.4 it also ran on Windows 98 systems, and you can probably still find the installer in archives.)