Building Mht Files from URLs revisited

I finally finished updating my Convert any URL to a MHTML archive using native .NET code CodeProject article. It's based on RFC standard 2557, aka Multipart MIME Message (MHTML web archive). You may also know it as that crazy File, Save As, "Web Archive, Single File" menu option in Internet Explorer. It's basically a way to package an entire web page as a (mostly) functonal single file that can be emailed, stored in a database, or what have you. Lots of interesting possibilities, including quick and dirty offline functionality for ASP.NET websites using loopback HTTP requests.


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2005/03/building-mht-files-from-urls-revisited.html

That’s interesting. The links it builds are in this format:

mhtml:file://C:\Documents and Settings\jya13970\My Documents\My MHTs\Spidersoft - WinMHT Start Page.mht!http://www.spidersoft.com/winmht/start.asp

mhtml:file://C:\Documents and Settings\jya13970\My Documents\My MHTs\Spidersoft - WinMHT Start Page.mht!http://www.spidersoft.com/winmht/default.asp

Hmm. I wasn’t aware of these crazy mhtml:file:// format links and the exclamation points…

I guess if I remapped all the links to that format, you could create one giant MHT that contained all the sub-pages of a website.

I need to also revisit how Firefox deals with this. It is an RFC standard, but last I checked there was a special add-in you needed to view them in FF.

Looks great! Now are you going to extend it to be able to put more than one page into the archive?

I don’t think that’s possible… I believe clicked links will always try to resolve a real host and access the network instead of checking the MHT file for the resource. I can run an experiment to see if it will work or not, but I doubt it.

Well, WinMHT can definitely do it: http://www.winmht.com

The RFC talks about “subsidiary resources”. Obviously it’s no problem if these are HTML pages again… a multi-page archive, as created with WinMHT, can be opened in IE without problems and the various pages inside can be browsed from the archive if they are interlinked (WinMHT can also create a TOC page).

I guess the links are probably relative in the file (otherwise it wouldn’t work as soon as you copy it elsewhere), so part of the links you are showing aren’t really in the file itself. However, would be great if your MhtBuilder could do this!

As for Firefox: it doesn’t support MHT files natively, which strikes some people as funny because actually Thunderbird does. I’m not sure about the exact extent of the support, but it’s definitely possible to view MHT files attached to email directly in the mailer. I looked at this a while ago, but I believe I was able to find a bug tracking entry about this at the time.

Can your product possibly pull down linked active pages like ASP? I’m building active content systems with a local server and being able to bundle all the pages as single files would be insanely great! IE’s save as mht will not do that.
Great effort so far, keep up the good work!
David

How can I save stuff from my local disk? I have a report that I’m creating in HTML, and I already have my stuff locally. Any easy way to just “get” the stuff from there and package into an MHT? The image links are all local (img src=“file.png”)…

Thanks

Also, I ran your demo, saved the default codinghorror pages to disk, and tried to open it with Microsoft Word, which told me that it’s not a valid single page archive file.

and as a followup to that, if I save from IE, it works fine in word.

I’m building active content systems with a local server and being able to bundle all the pages as single files would be insanely great! IE’s save as mht will not do that.

Yes, I would like to get to this… eventually… I wasn’t aware it was possible until Oliver pointed out WinMHT.

tried to open it with Microsoft Word

Does it open from IE?

I am having the same problem as Kurt.

Does it open from IE?
Yep

I want to be able to convert any html page on my web server’s local disk into .mht, rename it to .doc, and be able to open it in word. Any advice?

Well, I never tested Word… it never occurred to me that you could even do this!

hi thx first for the nice app
most sites i tried so far seem to work fine.
But when i try a download (web complete) on a href="http://www.heise.de/"http://www.heise.de//a it crashed after a few seconds.
I assume its a problem cause of invalid filename that it wants to create on the local hdd like “this is a long , filename.txt”

tried to open it with Microsoft Word, which told me that it’s not a valid single page archive file.

I know what this is now. Word is looking for a trailing “–” at the final boundary. So at the very end of the file change this…

But when i try a download (web complete) on http://www.heise.de/ it crashed after a few seconds.

The problem with that URL is its insane use of the link tag. Just take a look at the top of the file for all the link elements. Not easy to fix, because I assume most linked elements are embedded. In this case, they’re not at all…

Why would you want to use MHT for anything. It is a Microsoft psuedo-standard, and IE is the the only browser that will open them, so you can forget Linux users, and Mac users, except for those few who have IE for Mac, while Windows my currently be the most widely used OS for normal users, the internet should not be a place where only Windows users are welcome. Considering the flaws in IE, I would loath to build a site that required opening the view up to security holes, just to make life easier for me.

Yes a MHT file can store images and multiple webpages in one file, but so can a tarred or zipped folder, and as far as browsing those pages, the technique is relative links.

It’s not a pseudo-standard, it’s RFC2557 almost verbatim!

http://www.ietf.org/rfc/rfc2557.txt

The main benefit is keeping everything in a single file.

Have you patched the code for the word problem? Just curious.

Also yes, this is a standard. There are extensions to firefox to save/read this as well. And a bunch of other things support it as well.

I’ve made two code changes to allow for the file to be opened in Word 2003. This made it work for me anyway.

Kyle

In builder.vb starting on line 474 change the procedure to the following:

Private Sub AppendMhtBoundary(Optional ByVal bEndOfFile As Boolean = False)
	AppendMhtLine()
	If bEndOfFile = False Then
		AppendMhtLine("--"  _MimeBoundaryTag)
	Else
		AppendMhtLine("--"  _MimeBoundaryTag  "--")
	End If
End Sub

In builder.vb on line 438, change procedure call to: AppendMhtBoundary(True)