GrogHeads Forum

IRL (In Real Life) => Tech Talk => Topic started by: bayonetbrant on March 08, 2013, 01:29:47 PM

Title: HTML vs MS Word - some help
Post by: bayonetbrant on March 08, 2013, 01:29:47 PM
I got this by email at work today.  Might be useful to someone out there

QuoteHi folks, if this isn't a new trick, I apologize for spamming you.  It's new to me so I figured I'd share.  Dreading the thought of cleaning up a set of Word-exported HTML by hand, I thought about using jQuery to do a lot of the dirty work for me.  It works pretty well!

Get yourself a copy of jQuery (below I'm using jquery-1.8.3.min.js) , drop it in the folder with the HTML source you'd like to clean, and add this to the HEAD of the HTML file(s) you exported from Word:

<script type="text/javascript" src="jquery-1.8.3.min.js" ></script>
<script type="text/javascript">
$().ready( function() {
      $('*').removeAttr("style").removeAttr("class").removeAttr("vlink").removeAttr("lang").removeAttr("link");
      $('style,script').remove();
      while ( $('span').length > 0 ) { $('span').each( function () { var h = $(this).html(); $(this).replaceWith( h ); } ); }
      var bdy = $('html').html();
      $('body').html("").append('<form><textarea style="width: 100%; height: 100%"></textarea></form>');
      $('textarea').val( bdy );
} );
</script>


When you load the document(s) in your browser (try Chrome, it makes for better output than IE does, and doesn't squawk about local javascript), you should see a textarea containing the cleaned-up HTML.  A _much_ easier starting point for manual cleanup than before!  The above script will strip any style/class/lang/vlink/link attributes, remove any style tags, spans and the clean-up javascript itself from the resulting source.  As always, your mileage may vary depending on just how messy the HTML coming out of Word happens to be, but it's pretty easy to add/change what the script will clean up if you're familiar with jQuery/CSS selectors. For example, the version below adds some additional cleaning for the align attribute and paragraphs containing a single &nbsp; character.

<script type="text/javascript" src="jquery-1.8.3.min.js" ></script>
<script type="text/javascript">
$().ready( function() {
      var checkRepl = function ( obj, str ) { if ( $(obj).html() == str ) { $(obj).replaceWith(""); } };
      $('*').removeAttr("style").removeAttr("class").removeAttr("vlink").removeAttr("lang").removeAttr("link").removeAttr("align");
      $('style,script').remove();
      while ( $('span').length > 0 ) { $('span').each( function () { var h = $(this).html(); $(this).replaceWith( h ); } ); }
      $('p,p>b,p>b>i').each( function () { checkRepl( this, "&nbsp;" ); } ).each( function () { checkRepl( this, "" ); } ).each( function () { checkRepl( this, "" ); } );
      var bdy = $('html').html();
      $('body').html("").append('<form><textarea style="width: 100%; height: 100%"></textarea></form>');
      $('textarea').val( bdy );
} );
</script>