MS-HTML

As a person who worked on the HTML engine inside of Microsoft Office, there is a large reason why there is so much bloat inside of a file generated by MSO. The point of saving a document to HTML is document data fidelity, and document display fidelity. When you are talking about Office in this case, you mean Microsoft Word, as each product has it's own conventions for saving as HTML.

The HTML output by Microsoft Office is about 90% compatible with Netscape Navigator, with a few exceptions:

Embedded movies and media tend to end up as <img dynsrc=... which is an IE only specification. They should end up as <embed src=... but that's a personal gripe I had.
Some meta tags mean nothing to Netscape
Some of the CSS and other positioning things only matter in IE 5.0 and above.
OLE objects tend to be a little weird
Some of the other programs use JavaScript that isn't 100% compliant with Netscape, but do work.

I took a Microsoft Word 2000 Document and saved as HTML. The entire contents of the document were "Hello World", in the true style of CS. Here's what I got:


<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">

You'll see this XML all over the place. This is for future compliance and for use in the Microsoft XML parsing engines. You could also script against yourself this to see what HTML you'll be seeing. It's useful for anyone wanting to parse the information.


<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 9">
<meta name=Originator content="Microsoft Word 9">

These are your standard meta tags, except ProgId. This is a COM name identifier that tells what kind of document you're looking at. Therefore if you open up a Microsoft Word generated HTML file in Excel, it will know not to parse it, but to launch Word to handle that. The charset in the http-equiv is telling what codepage to be looking at for the character set. Word is incredibly picky about the fonts and characters.


<link rel=File-List href="./Hello%20world_files/filelist.xml">

Ahh, our first bit of strangeness here... You will notice that this is a relative path to something that simply doesn't exist. (It is ignored in this case) It would however exist if you had embedded files in the document, or any images. This retrieves a list of everything that the word doc contained (whether it be an OLE object, a file that just isn't visible, etc). When you save out to HTML, with images, you notice that there is a folder oftentimes created with the document named "documentname"_files. This contains all of those items. Here, because the doc is so simple, this isn't an issue.


<title>Hello world</title>

Your title is right there. Nothing strange at all


<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Author>Jay Bonci</o:Author>
  <o:LastAuthor>Jay Bonci</o:LastAuthor>
  <o:Revision>1</o:Revision>
  <o:TotalTime>1</o:TotalTime>
  <o:Created>2001-07-05T18:09:00Z</o:Created>
  <o:LastSaved>2001-07-05T18:10:00Z</o:LastSaved>
  <o:Pages>1</o:Pages>
  <o:Company>Manifest Research Visions, Inc.</o:Company>
  <o:Lines>1</o:Lines>
  <o:Paragraphs>1</o:Paragraphs>
  <o:Version>9.3821</o:Version>
 </o:DocumentProperties>
</xml><![endif]-->

Word keeps a section of the document properties, including who wrote the document, lines, when it was created, etc. Whenever you save as HTML, each of these items are kept in this XML section inside of the document. They are used to describe the document and preserve the information across file format change.


<style>
<!--
 /* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{mso-style-parent:"";
	margin:0in;
	margin-bottom:.0001pt;
	mso-pagination:widow-orphan;
	font-size:12.0pt;
	font-family:"Times New Roman";
	mso-fareast-font-family:"Times New Roman";}
@page Section1
	{size:8.5in 11.0in;
	margin:1.0in 1.25in 1.0in 1.25in;
	mso-header-margin:.5in;
	mso-footer-margin:.5in;
	mso-paper-source:0;}
div.Section1
	{page:Section1;}
-->
</style>

This is non-standard CSS to preserve some of the items that CSS doesn't describe, but obviously Word needs. For instance you see mso-header-margin is obviously the number inside of Page Setup off of the File menu.


</head>

<body lang=EN-US style='tab-interval:.5in'>

<div class=Section1>

<p class=MsoNormal>Hello world</p>

</div>

</body>

</html>

There, that seems to be all of it. Word takes many many steps to make sure that the document is as preserved as possible and that HTML is a lossless format. If you pick apart more complicated documents you'll notice that your data is held intact across this format very well, and that was the design goal. This has all been explained before in a public forum, and I'm not giving away any proprietary secrets of this four year old feature.

Actual hand HTML editing is done very rarely by a person who would save as HTML in Word. Bascially, you are looking at a business person who wants to save to the Web, most likely a corporate intranet. In the Macintosh Office 2001 version of the product, there is a save option off of the Save Dialog that allows you to save as "clean" HTML without the Microsoft specific tags (but you lose information in the conversion back to the web.)

Microsoft HTML de-bastardization	Smart quotes	Funny Macintosh Errors	Windows-1252
Monopoly Properties	Microsoft Active Accessibility	Punahou School	Script
Ole	Internet Explorer	monopoly	intranet
COM	XML	HTML

Recommended Reading

About Everything2

User Picks

Editor Picks

New Writeups