View Full Version : Cleaning Word HTML
bostjank
05-29-2003, 09:00 AM
Hi!
I got an assignment to prepare HTML versions of a large number of publications that are stored in Word (XP) format.
Does anybody know of a good software that would clean up HTML that Word creates? I know of Dreamweaver, but would like to use some other application - just for cleaning.
Or - even better - is there software that would create clean HTML directly form Word document?
Thanks,
Bostjan
requestcode
05-29-2003, 01:26 PM
I believe Word has the ability to save the document as an HTML document. Have you tried that?
bostjank
05-29-2003, 01:56 PM
Yes, of course, but as I said in the title, I would like to CLEAN HTML that Word creates. Or use a program that produces clean HTML from Word document.
Well the only efficient way of doing that is to sae it as an HTML file and open it into a text-editor such as notepad where you can view the source.
My Suggestion: Just mess around trying to delete tags that you have no idea about. If it doesn't work, leave them in and try something else. That's pretty much how I picked up HTML.
Although Dreamweaver will be able to read Word's HTML, it wouldn't "clean" it up for, like I think you want. MS Word adds so many unnecessary tags, its unbelievable.
MotherNatrsSon
05-29-2003, 02:28 PM
Dreamweaver will clean it up some. Then you will have to use an editor if you want it way down. zip up an html document from it and attach it to a response here. I can run it through BBEdit and send it back to you.
bostjank
05-29-2003, 02:28 PM
Notepad is an option, but as there are several thousand documents that need to be converted, I would like to find more efficient way.
Catman
05-29-2003, 02:56 PM
HTML Tidy does a decent job cleaning Word HTML files. You can find it at the W3C site (http://w3c.org), or you can get it with HTML-Kit (http://www.chami.com). In HTML-Kit, you can run Tidy in batch mode.
brothercake
05-29-2003, 03:04 PM
Just don't use Word to generate HTML documents - it can't really be cleaned because what it generates isn't actually HTML at all - it's XML + Schema, designed so that if you open such a webpage in Word, it turns back into a Word document.
But you might be able to do this:
1 - save the document as RTF format
2 - open it in Word 97 if you have it (this might work in Word 2000 as well, but it won't in Word XP)
3 - now, "save as HTML" from an RTF document in Word 97 generates nice, clean HTML 3.2 :)
I have to deal with word documents all the time - it's no way to go; I always end up pasting them into a plain-text editor and marking them up manually. Works for simple documents - for more complex stuff, I just turn it into a PDF.
Or if you can wait a few months, Office 11 can deal with XML directly, so you wouldn't need an intermediary, you can just write XSL to work with the XML that it generates.
liorean
05-29-2003, 03:22 PM
Originally posted by brothercake
Just don't use Word to generate HTML documents - it can't really be cleaned because what it generates isn't actually HTML at all - it's XML + Schema, designed so that if you open such a webpage in Word, it turns back into a Word document.
Except that it isn't XML at all, it's SGML using XML features in a nice mishmash that no strict parser of any of the technologies would be able to handle.
The one time I needed to change Word documents to HTML I saved the content as a text file. I then added in the <p> ... </p> tags, and then pasted in the predesigned header that I had prepared earlier (DOCTYPE, HEAD, TITLE, META, NAVIGATION, etc), and then saved it as Text File ending in .html -- this didn't take long per page, and needed minimal extra work, just a few links to insert and HTML for images to add. For thousands of pages though this may take quite some time.
bostjank
05-30-2003, 07:09 AM
Thanks everybody. I'll give HTML Tidy a try and will also try to first save in RTF, then to HTML.
Bostjan
ronaldb66
05-30-2003, 09:18 AM
I did a little googling and this hit came up first:
Bersoft Word HTML CleanUp 1.0 (http://www.bersoft.com/bwhcu/).
I have no relation to these people whatsoever and don't know their product, but there's also something about setting up Word 2000 to produce "cleaner" code; it may be of help.
scroots
05-30-2003, 07:48 PM
you could use php or a servside langauge with regular expressions.
scroots
cg9com
05-31-2003, 08:02 AM
http://www.textism.com/resources/cleanwordhtml/
liorean
05-31-2003, 06:31 PM
Argh! cg9com posted that first:D
I guess I'm not the only one who reads blogs these days...
Use this free tool:
http://www.webmaster-toolkit.com/frontpage-code-cleaner.shtml
Save the results file with a new name and verify it before you delete your original version.
cg9com
06-01-2003, 08:14 AM
Originally posted by liorean
Argh! cg9com posted that first:D
I guess I'm not the only one who reads blogs these days...
lol :D
Your the one who made me go to that site! :p
liorean
06-01-2003, 09:09 AM
Oh yeah, you had issues with *HIM*, right?:cool:
cg9com
06-01-2003, 10:41 AM
Don't get me started ...
:rolleyes:
I just went to the site because we were on that topic, I dont know. :D
How are you getting on fixing up the code?
Skyzyx
06-01-2003, 04:32 PM
I guess I read Zeldman's blog too...
http://www.zeldman.com/daily/0503c.shtml#wordcleaner
bostjank
06-02-2003, 10:45 AM
Textism is great tool, but unfortunately I cannot use it for at least 2 reasons:
- charset windows-1250 is not supported
- merged cells in a table are not converted correctly
But anyway - now I'm saving the Word file to RTF and then to HTML. It's not perfect, but it'll have to do.
Thanks to all.
Bostjan
Does the Webmaster Toolkit tool do what you want? See above.
bostjank
06-02-2003, 03:17 PM
No it doesn't - all Word specific tags are unfortunatelly still present.
vBulletin® v3.8.2, Copyright ©2000-2012, Jelsoft Enterprises Ltd.