Author Topic: HTML vs MS Word - some help  (Read 1009 times)

0 Members and 1 Guest are viewing this topic.

Offline bayonetbrant

  • Chief Arrogance Mitigator
  • Musketeer
  • *****
  • Posts: 37056
  • Loitering With Intent
HTML vs MS Word - some help
« on: March 08, 2013, 11:29:47 AM »
I got this by email at work today.  Might be useful to someone out there

Quote
Hi folks, if this isnít a new trick, I apologize for spamming you.  Itís new to me so I figured Iíd share.  Dreading the thought of cleaning up a set of Word-exported HTML by hand, I thought about using jQuery to do a lot of the dirty work for me.  It works pretty well!

Get yourself a copy of jQuery (below Iím using jquery-1.8.3.min.js) , drop it in the folder with the HTML source youíd like to clean, and add this to the HEAD of the HTML file(s) you exported from Word:

<script type="text/javascript" src="jquery-1.8.3.min.js" ></script>
<script type="text/javascript">
$().ready( function() {
      $('*').removeAttr("style").removeAttr("class").removeAttr("vlink").removeAttr("lang").removeAttr("link");
      $('style,script').remove();
      while ( $('span').length > 0 ) { $('span').each( function () { var h = $(this).html(); $(this).replaceWith( h ); } ); }
      var bdy = $('html').html();
      $('body').html("").append('<form><textarea style="width: 100%; height: 100%"></textarea></form>');
      $('textarea').val( bdy );
} );
</script>


When you load the document(s) in your browser (try Chrome, it makes for better output than IE does, and doesnít squawk about local javascript), you should see a textarea containing the cleaned-up HTML.  A _much_ easier starting point for manual cleanup than before!  The above script will strip any style/class/lang/vlink/link attributes, remove any style tags, spans and the clean-up javascript itself from the resulting source.  As always, your mileage may vary depending on just how messy the HTML coming out of Word happens to be, but itís pretty easy to add/change what the script will clean up if youíre familiar with jQuery/CSS selectors. For example, the version below adds some additional cleaning for the align attribute and paragraphs containing a single &nbsp; character.

<script type="text/javascript" src="jquery-1.8.3.min.js" ></script>
<script type="text/javascript">
$().ready( function() {
      var checkRepl = function ( obj, str ) { if ( $(obj).html() == str ) { $(obj).replaceWith(""); } };
      $('*').removeAttr("style").removeAttr("class").removeAttr("vlink").removeAttr("lang").removeAttr("link").removeAttr("align");
      $('style,script').remove();
      while ( $('span').length > 0 ) { $('span').each( function () { var h = $(this).html(); $(this).replaceWith( h ); } ); }
      $('p,p>b,p>b>i').each( function () { checkRepl( this, "&nbsp;" ); } ).each( function () { checkRepl( this, "" ); } ).each( function () { checkRepl( this, "" ); } );
      var bdy = $('html').html();
      $('body').html("").append('<form><textarea style="width: 100%; height: 100%"></textarea></form>');
      $('textarea').val( bdy );
} );
</script>

The key to surviving this site is to not say something which ends up as someone's tag line - Steelgrave

"their citizens (all of them counted as such) glorified their mythology of 'rights'...and lost track of their duties. No nation, so constituted, can endure." Robert Heinlein, Starship Troopers