As a person who worked on the
HTML engine inside of Microsoft Office, there is a large reason why there is so much bloat inside of a file generated by
MSO. The point of saving a document to
HTML is
document data fidelity, and document display fidelity. When you are talking about
Office in this case, you mean
Microsoft Word, as each product has it's own
conventions for saving as
HTML.
The HTML output by
Microsoft Office is about 90% compatible with
Netscape Navigator, with a few
exceptions:
- Embedded movies and media tend to end up as <img dynsrc=... which is an IE only specification. They should end up as <embed src=... but that's a personal gripe I had.
- Some meta tags mean nothing to Netscape
- Some of the CSS and other positioning things only matter in IE 5.0 and above.
- OLE objects tend to be a little weird
- Some of the other programs use JavaScript that isn't 100% compliant with Netscape, but do work.
I took a
Microsoft Word 2000 Document and
saved as HTML. The entire contents of the document were "
Hello World", in the true style of CS.
Here's what I got:
<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">
You'll see this XML all over the place. This is for future compliance and for use in the Microsoft XML parsing engines. You could also script against yourself this to see what HTML you'll be seeing. It's useful for anyone wanting to parse the information.
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 9">
<meta name=Originator content="Microsoft Word 9">
These are your standard meta tags, except ProgId. This is a COM name identifier that tells what kind of document you're looking at. Therefore if you open up a Microsoft Word generated HTML file in Excel, it will know not to parse it, but to launch Word to handle that. The charset in the http-equiv is telling what codepage to be looking at for the character set. Word is incredibly picky about the fonts and characters.
<link rel=File-List href="./Hello%20world_files/filelist.xml">
Ahh, our first bit of strangeness here... You will notice that this is a relative path to something that simply doesn't exist. (It is ignored in this case) It would however exist if you had embedded files in the document, or any images. This retrieves a list of everything that the word doc contained (whether it be an OLE object, a file that just isn't visible, etc). When you save out to HTML, with images, you notice that there is a folder oftentimes created with the document named "documentname"_files. This contains all of those items. Here, because the doc is so simple, this isn't an issue.
<title>Hello world</title>
Your title is right there. Nothing strange at all
<!--[if gte mso 9]><xml>
<o:DocumentProperties>
<o:Author>Jay Bonci</o:Author>
<o:LastAuthor>Jay Bonci</o:LastAuthor>
<o:Revision>1</o:Revision>
<o:TotalTime>1</o:TotalTime>
<o:Created>2001-07-05T18:09:00Z</o:Created>
<o:LastSaved>2001-07-05T18:10:00Z</o:LastSaved>
<o:Pages>1</o:Pages>
<o:Company>Manifest Research Visions, Inc.</o:Company>
<o:Lines>1</o:Lines>
<o:Paragraphs>1</o:Paragraphs>
<o:Version>9.3821</o:Version>
</o:DocumentProperties>
</xml><![endif]-->
Word keeps a section of the document properties, including who wrote the document, lines, when it was created, etc. Whenever you save as HTML, each of these items are kept in this XML section inside of the document. They are used to describe the document and preserve the information across file format change.
<style>
<!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-parent:"";
margin:0in;
margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:"Times New Roman";
mso-fareast-font-family:"Times New Roman";}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;
mso-header-margin:.5in;
mso-footer-margin:.5in;
mso-paper-source:0;}
div.Section1
{page:Section1;}
-->
</style>
This is non-standard CSS to preserve some of the items that CSS doesn't describe, but obviously Word needs. For instance you see mso-header-margin is obviously the number inside of Page Setup off of the File menu.
</head>
<body lang=EN-US style='tab-interval:.5in'>
<div class=Section1>
<p class=MsoNormal>Hello world</p>
</div>
</body>
</html>
There, that seems to be all of it. Word takes many many steps to make sure that the
document is as preserved as possible and that
HTML is a
lossless format. If you pick apart more complicated
documents you'll notice that your data is held intact across this
format very well, and that was the
design goal. This has all been explained before in a public forum, and I'm not giving away any proprietary secrets of this four year old feature.
Actual hand
HTML editing is done very rarely by a person who would save as
HTML in Word. Bascially, you are looking at a business person who wants to save to the Web, most likely a corporate intranet. In the
Macintosh Office 2001 version of the product, there is a save option off of the
Save Dialog that allows you to save as "clean"
HTML without the
Microsoft specific tags (but you lose information in the conversion back to the web.)