Archive

Posts Tagged ‘powertools-for-openxml’

Fun with .docx to .html transforms by means of HtmlConverter from PowerTools for Open XML

  1. The transform is FOSS and platform-independent:
    1. It neither requires Office nor Windows (The OpenXML SDK runs on Linux via Mono on the server.
    2. However, the most recent installment of Powertools for OpenXML, a high-level API to the OpenXML SDK, comes with a PowerShell interface (benefit: no Visual studio requirement).
  2. Valuable features of the transform,  among many other things, are:
    1. HtmlConverter is able to translate MS-Word styles into CSS (insofar needed – my code style has “No proofing” set, however, this cannot be implemented on the WWW), so the layout is preserved as designed, but w/o need for inline formatting:
        span.pt-StrongEmphasis-000052 {
            font-family: Calibri;
            font-size: 11pt;
            font-style: italic;
            font-weight: bold;
            margin: 0in;
            padding: 0in;
        }

        span.pt-lowCodeConsoleChar0 {
            color: #FFFFFF;
            background: #000000;
            font-family: Consolas;
            font-size: 10pt;
            font-weight: normal;
            margin: 0in;
            padding: 0in;
        }
     <h3 dir="ltr" class="pt-000040">
            <span class="pt-000041">2.2.1</span><span class="pt-000042"><span class="pt-000043"> </span></span><span class="pt-Heading2Char"><b>References</b></span>
          </h3>

          <p dir="ltr" class="pt-BodyText">
            <span class="pt-DefaultParagraphFont-000003"><br />
            ‎</span><span class="pt-000000"> </span>
          </p>

          <h1 dir="ltr" class="pt-000006">
            <span class="pt-000007"><b>3</b></span><span class="pt-000008"><b><span class="pt-000009"> </span></b></span><span class="pt-Heading1Char"><b>Introduction</b></span>
          </h1>

          <h2 dir="ltr" class="pt-000018">
            <span class="pt-000019">3.1</span><span class="pt-000020"><span class="pt-000021"> </span></span><span class="pt-Heading2Char"><b>Purpose of Document</b></span>
          </h2>
    1. There are many more options that I have not yet tried:
            SimplifyMarkupSettings simplifyMarkupSettings = new SimplifyMarkupSettings
            {
                RemoveComments = true,
                RemoveContentControls = true,
                RemoveEndAndFootNotes = true,
                RemoveFieldCodes = false,
                RemoveLastRenderedPageBreak = true,
                RemovePermissions = true,
                RemoveProof = true,
                RemoveRsidInfo = true,
                RemoveSmartTags = true,
                RemoveSoftHyphens = true,
                RemoveGoBackBookmark = true,
                ReplaceTabsWithSpaces = false,
            };
            MarkupSimplifier.SimplifyMarkup(wordDoc, simplifyMarkupSettings);

            FormattingAssemblerSettings formattingAssemblerSettings = new FormattingAssemblerSettings
            {
                RemoveStyleNamesFromParagraphAndRunProperties = false,
                ClearStyles = false,
                RestrictToSupportedLanguages = htmlConverterSettings.RestrictToSupportedLanguages,
                RestrictToSupportedNumberingFormats = htmlConverterSettings.RestrictToSupportedNumberingFormats,
                CreateHtmlConverterAnnotationAttributes = true,
                OrderElementsPerStandard = false,
                ListItemRetrieverSettings = new ListItemRetrieverSettings()
                {
                    ListItemTextImplementations = htmlConverterSettings.ListItemImplementations,
                },
            };
    1. One would really wish there was a way to get such HTML cleaned up automatically (ouch!):
               <span class="pt-DefaultParagraphFont-000006">M</span>
                <span class="pt-DefaultParagraphFont-000006">anaged requirements for system integration </span>
                <span class="pt-DefaultParagraphFont-000006">of Center</span>
                <span class="pt-DefaultParagraphFont-000006"> </span>
                <span class="pt-DefaultParagraphFont-000006">software </span>
                <span class="pt-DefaultParagraphFont-000006">with </span>
                <span class="pt-DefaultParagraphFont-000006">iLearning</span>
                <span class="pt-DefaultParagraphFont-000006"> and with content production and management (BPD). To mitigate lack of integration of $50k LMS software investment into departmental workflow</span>
                <span class="pt-DefaultParagraphFont-000006">,</span>
                <span class="pt-DefaultParagraphFont-000006"> </span>
                <span class="pt-DefaultParagraphFont-000006">developed </span>
                <span class="pt-DefaultParagraphFont-000006">and documented </span>
                <span class="pt-DefaultParagraphFont-000006">software to automate</span>
                <span class="pt-DefaultParagraphFont-000006"> creation of 4K+ user accounts p.a., 30K+ learning documents and 100K+ interactive content paths in LMS.</span>
    1. There are also much more serious conversion errors:
      1. MS-Word displays a plain text content control and a repeating section content control within a table, containing one Combobox and one plain text content control per row, perfectly: openxml-convert-docxtohtml-error-word
      2. Convert-DocxToHtml gobbles the content completely (and so does Google Docs Preview): openxml-convert-docxtohtml-error-html The underlying HTML has just a blank table under each heading:
            <div class="pt-000001">
                <p dir="ltr" class="pt-qiCVHeading1">
                  <span class="pt-DefaultParagraphFont-000002">Profile</span>
                </p>
              </div>
              <div align="left">
                <table border="1" cellspacing="0" cellpadding="0" dir="ltr" class="pt-000003" />
              </div>
              <div class="pt-000001">
                <p dir="ltr" class="pt-qiCVHeading1">
                  <span class="pt-DefaultParagraphFont-000002">Technologies</span>
                </p>
              </div>
              <div align="left">
                <table border="1" cellspacing="0" cellpadding="0" dir="ltr" class="pt-000003" />
              </div>
          
      3. MS-Word shows:imageYet need to look in to the underlying XML to see whether the .docx is to blame for that…
      4. But HtmlConverter output in IE or Firefox: imageThe underlying HTML reveals that the css does not get applied in the right place:
 	<tr>
                <td class="pt-000079">
                  <p dir="ltr" class="pt-BodyTextSmall">
                    <span class="pt-BodyTextSmallChar-000081">AD</span>
                  </p>
                </td>
                <td colspan="2" class="pt-000079">
                  <p dir="ltr" class="pt-BodyTextSmall">
                    <span class="pt-BodyTextSmallChar-000081">Active Driector, Microsfot’s directory implementation.</span>
                  </p>
                </td>
              </tr>

              <tr>
                <td class="pt-000086">
                  <p dir="ltr" class="pt-BodyTextSmall">
                    <span class="pt-000085"> </span>
                  </p>
                </td>
                <td colspan="2" class="pt-000086">
                  <p dir="ltr" class="pt-BodyTextSmall">
                    <span class="pt-000085"> </span>
                  </p>
                </td>
              </tr>
  1. One could imagine MS-Word acting less strictly than OpenXML PowerTools:Convert-DocxToHtml, like a web-browser’s parser tolerates and displays bad HTML. However, not only would need to be justified how MS-Word can also serve as the originating HTML WYSIWYG editor. The OpenXML PowerTools:Get-OpenXmlValidationErrors for both of the above documents does not seem to find any OpenXML errors that could explain the bad conversion (other than dozens of Sch_UndeclaredAttribute errors (Version-related? Not sure how this could be) , there is only a Pkg_PartIsNotAllowed relating to a glossary).
  • Also yet to do:
    1. When (not always!) does my page title end up as empty?
      <title></title>
    2. Defaults to doctype xhtml, not html(5).
  • Done:
      1. Pretty-printing. The HtmlConverter output defaults to all content (not css ) on 1 line (e.g. in the example from which above code is taken, 90000chars long). For human readability, and also possibly git tracking, pretty-printing would be better. Can be enforced like so (is there a better way? cannot see a user-configurable option for the SaveOptions enumeration):
    openXml\OxPt\OxPtCmdlets\OxPtHelper.cs:var htmlString = html.ToString(SaveOptions.None); // trp: requesting pretty-printing, was:html.ToString(SaveOptions.DisableFormatting);