Home > e-infrastructure, service-is-documenting > Fun with .docx to .html transforms by means of HtmlConverter from PowerTools for Open XML

Fun with .docx to .html transforms by means of HtmlConverter from PowerTools for Open XML

  1. The transform is FOSS and platform-independent:
    1. It neither requires Office nor Windows (The OpenXML SDK runs on Linux via Mono on the server.
    2. However, the most recent installment of Powertools for OpenXML, a high-level API to the OpenXML SDK, comes with a PowerShell interface (benefit: no Visual studio requirement).
  2. Valuable features of the transform,  among many other things, are:
    1. HtmlConverter is able to translate MS-Word styles into CSS (insofar needed – my code style has “No proofing” set, however, this cannot be implemented on the WWW), so the layout is preserved as designed, but w/o need for inline formatting:
        span.pt-StrongEmphasis-000052 {
            font-family: Calibri;
            font-size: 11pt;
            font-style: italic;
            font-weight: bold;
            margin: 0in;
            padding: 0in;
        }

        span.pt-lowCodeConsoleChar0 {
            color: #FFFFFF;
            background: #000000;
            font-family: Consolas;
            font-size: 10pt;
            font-weight: normal;
            margin: 0in;
            padding: 0in;
        }
     <h3 dir="ltr" class="pt-000040">
            <span class="pt-000041">2.2.1</span><span class="pt-000042"><span class="pt-000043"> </span></span><span class="pt-Heading2Char"><b>References</b></span>
          </h3>

          <p dir="ltr" class="pt-BodyText">
            <span class="pt-DefaultParagraphFont-000003"><br />
            ‎</span><span class="pt-000000"> </span>
          </p>

          <h1 dir="ltr" class="pt-000006">
            <span class="pt-000007"><b>3</b></span><span class="pt-000008"><b><span class="pt-000009"> </span></b></span><span class="pt-Heading1Char"><b>Introduction</b></span>
          </h1>

          <h2 dir="ltr" class="pt-000018">
            <span class="pt-000019">3.1</span><span class="pt-000020"><span class="pt-000021"> </span></span><span class="pt-Heading2Char"><b>Purpose of Document</b></span>
          </h2>
    1. There are many more options that I have not yet tried:
            SimplifyMarkupSettings simplifyMarkupSettings = new SimplifyMarkupSettings
            {
                RemoveComments = true,
                RemoveContentControls = true,
                RemoveEndAndFootNotes = true,
                RemoveFieldCodes = false,
                RemoveLastRenderedPageBreak = true,
                RemovePermissions = true,
                RemoveProof = true,
                RemoveRsidInfo = true,
                RemoveSmartTags = true,
                RemoveSoftHyphens = true,
                RemoveGoBackBookmark = true,
                ReplaceTabsWithSpaces = false,
            };
            MarkupSimplifier.SimplifyMarkup(wordDoc, simplifyMarkupSettings);

            FormattingAssemblerSettings formattingAssemblerSettings = new FormattingAssemblerSettings
            {
                RemoveStyleNamesFromParagraphAndRunProperties = false,
                ClearStyles = false,
                RestrictToSupportedLanguages = htmlConverterSettings.RestrictToSupportedLanguages,
                RestrictToSupportedNumberingFormats = htmlConverterSettings.RestrictToSupportedNumberingFormats,
                CreateHtmlConverterAnnotationAttributes = true,
                OrderElementsPerStandard = false,
                ListItemRetrieverSettings = new ListItemRetrieverSettings()
                {
                    ListItemTextImplementations = htmlConverterSettings.ListItemImplementations,
                },
            };
    1. One would really wish there was a way to get such HTML cleaned up automatically (ouch!):
               <span class="pt-DefaultParagraphFont-000006">M</span>
                <span class="pt-DefaultParagraphFont-000006">anaged requirements for system integration </span>
                <span class="pt-DefaultParagraphFont-000006">of Center</span>
                <span class="pt-DefaultParagraphFont-000006"> </span>
                <span class="pt-DefaultParagraphFont-000006">software </span>
                <span class="pt-DefaultParagraphFont-000006">with </span>
                <span class="pt-DefaultParagraphFont-000006">iLearning</span>
                <span class="pt-DefaultParagraphFont-000006"> and with content production and management (BPD). To mitigate lack of integration of $50k LMS software investment into departmental workflow</span>
                <span class="pt-DefaultParagraphFont-000006">,</span>
                <span class="pt-DefaultParagraphFont-000006"> </span>
                <span class="pt-DefaultParagraphFont-000006">developed </span>
                <span class="pt-DefaultParagraphFont-000006">and documented </span>
                <span class="pt-DefaultParagraphFont-000006">software to automate</span>
                <span class="pt-DefaultParagraphFont-000006"> creation of 4K+ user accounts p.a., 30K+ learning documents and 100K+ interactive content paths in LMS.</span>
    1. There are also much more serious conversion errors:
      1. MS-Word displays a plain text content control and a repeating section content control within a table, containing one Combobox and one plain text content control per row, perfectly: openxml-convert-docxtohtml-error-word
      2. Convert-DocxToHtml gobbles the content completely (and so does Google Docs Preview): openxml-convert-docxtohtml-error-html The underlying HTML has just a blank table under each heading:
            <div class="pt-000001">
                <p dir="ltr" class="pt-qiCVHeading1">
                  <span class="pt-DefaultParagraphFont-000002">Profile</span>
                </p>
              </div>
              <div align="left">
                <table border="1" cellspacing="0" cellpadding="0" dir="ltr" class="pt-000003" />
              </div>
              <div class="pt-000001">
                <p dir="ltr" class="pt-qiCVHeading1">
                  <span class="pt-DefaultParagraphFont-000002">Technologies</span>
                </p>
              </div>
              <div align="left">
                <table border="1" cellspacing="0" cellpadding="0" dir="ltr" class="pt-000003" />
              </div>
          
      3. MS-Word shows:imageYet need to look in to the underlying XML to see whether the .docx is to blame for that…
      4. But HtmlConverter output in IE or Firefox: imageThe underlying HTML reveals that the css does not get applied in the right place:
 	<tr>
                <td class="pt-000079">
                  <p dir="ltr" class="pt-BodyTextSmall">
                    <span class="pt-BodyTextSmallChar-000081">AD</span>
                  </p>
                </td>
                <td colspan="2" class="pt-000079">
                  <p dir="ltr" class="pt-BodyTextSmall">
                    <span class="pt-BodyTextSmallChar-000081">Active Driector, Microsfot’s directory implementation.</span>
                  </p>
                </td>
              </tr>

              <tr>
                <td class="pt-000086">
                  <p dir="ltr" class="pt-BodyTextSmall">
                    <span class="pt-000085"> </span>
                  </p>
                </td>
                <td colspan="2" class="pt-000086">
                  <p dir="ltr" class="pt-BodyTextSmall">
                    <span class="pt-000085"> </span>
                  </p>
                </td>
              </tr>
  1. One could imagine MS-Word acting less strictly than OpenXML PowerTools:Convert-DocxToHtml, like a web-browser’s parser tolerates and displays bad HTML. However, not only would need to be justified how MS-Word can also serve as the originating HTML WYSIWYG editor. The OpenXML PowerTools:Get-OpenXmlValidationErrors for both of the above documents does not seem to find any OpenXML errors that could explain the bad conversion (other than dozens of Sch_UndeclaredAttribute errors (Version-related? Not sure how this could be) , there is only a Pkg_PartIsNotAllowed relating to a glossary).
  • Also yet to do:
    1. When (not always!) does my page title end up as empty?
      <title></title>
    2. Defaults to doctype xhtml, not html(5).
  • Done:
      1. Pretty-printing. The HtmlConverter output defaults to all content (not css ) on 1 line (e.g. in the example from which above code is taken, 90000chars long). For human readability, and also possibly git tracking, pretty-printing would be better. Can be enforced like so (is there a better way? cannot see a user-configurable option for the SaveOptions enumeration):
    openXml\OxPt\OxPtCmdlets\OxPtHelper.cs:var htmlString = html.ToString(SaveOptions.None); // trp: requesting pretty-printing, was:html.ToString(SaveOptions.DisableFormatting);
    
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: