Fun with .docx to .html transforms by means of HtmlConverter from PowerTools for Open XML
2014/12/15
Leave a comment
- The transform is FOSS and platform-independent:
- It neither requires Office nor Windows (The OpenXML SDK runs on Linux via Mono on the server.
- However, the most recent installment of Powertools for OpenXML, a high-level API to the OpenXML SDK, comes with a PowerShell interface (benefit: no Visual studio requirement).
- Valuable features of the transform, among many other things, are:
- HtmlConverter is able to translate MS-Word styles into CSS (insofar needed – my code style has “No proofing” set, however, this cannot be implemented on the WWW), so the layout is preserved as designed, but w/o need for inline formatting:
span.pt-StrongEmphasis-000052 {
font-family: Calibri;
font-size: 11pt;
font-style: italic;
font-weight: bold;
margin: 0in;
padding: 0in;
}
span.pt-lowCodeConsoleChar0 {
color: #FFFFFF;
background: #000000;
font-family: Consolas;
font-size: 10pt;
font-weight: normal;
margin: 0in;
padding: 0in;
}
<h3 dir="ltr" class="pt-000040">
<span class="pt-000041">2.2.1</span><span class="pt-000042"><span class="pt-000043">&nbsp;</span></span><span class="pt-Heading2Char"><b>References</b></span>
</h3>
<p dir="ltr" class="pt-BodyText">
<span class="pt-DefaultParagraphFont-000003"><br />
&lrm;</span><span class="pt-000000">&nbsp;</span>
</p>
<h1 dir="ltr" class="pt-000006">
<span class="pt-000007"><b>3</b></span><span class="pt-000008"><b><span class="pt-000009">&nbsp;</span></b></span><span class="pt-Heading1Char"><b>Introduction</b></span>
</h1>
<h2 dir="ltr" class="pt-000018">
<span class="pt-000019">3.1</span><span class="pt-000020"><span class="pt-000021">&nbsp;</span></span><span class="pt-Heading2Char"><b>Purpose of Document</b></span>
</h2>
- There are many more options that I have not yet tried:
SimplifyMarkupSettings simplifyMarkupSettings = new SimplifyMarkupSettings
{
RemoveComments = true,
RemoveContentControls = true,
RemoveEndAndFootNotes = true,
RemoveFieldCodes = false,
RemoveLastRenderedPageBreak = true,
RemovePermissions = true,
RemoveProof = true,
RemoveRsidInfo = true,
RemoveSmartTags = true,
RemoveSoftHyphens = true,
RemoveGoBackBookmark = true,
ReplaceTabsWithSpaces = false,
};
MarkupSimplifier.SimplifyMarkup(wordDoc, simplifyMarkupSettings);
FormattingAssemblerSettings formattingAssemblerSettings = new FormattingAssemblerSettings
{
RemoveStyleNamesFromParagraphAndRunProperties = false,
ClearStyles = false,
RestrictToSupportedLanguages = htmlConverterSettings.RestrictToSupportedLanguages,
RestrictToSupportedNumberingFormats = htmlConverterSettings.RestrictToSupportedNumberingFormats,
CreateHtmlConverterAnnotationAttributes = true,
OrderElementsPerStandard = false,
ListItemRetrieverSettings = new ListItemRetrieverSettings()
{
ListItemTextImplementations = htmlConverterSettings.ListItemImplementations,
},
};
- One would really wish there was a way to get such HTML cleaned up automatically (ouch!):
<span class="pt-DefaultParagraphFont-000006">M</span>
<span class="pt-DefaultParagraphFont-000006">anaged requirements for system integration&nbsp;</span>
<span class="pt-DefaultParagraphFont-000006">of Center</span>
<span class="pt-DefaultParagraphFont-000006">&nbsp;</span>
<span class="pt-DefaultParagraphFont-000006">software&nbsp;</span>
<span class="pt-DefaultParagraphFont-000006">with&nbsp;</span>
<span class="pt-DefaultParagraphFont-000006">iLearning</span>
<span class="pt-DefaultParagraphFont-000006">&nbsp;and with content production and management (BPD). To mitigate lack of integration of $50k LMS software investment into departmental workflow</span>
<span class="pt-DefaultParagraphFont-000006">,</span>
<span class="pt-DefaultParagraphFont-000006">&nbsp;</span>
<span class="pt-DefaultParagraphFont-000006">developed&nbsp;</span>
<span class="pt-DefaultParagraphFont-000006">and documented&nbsp;</span>
<span class="pt-DefaultParagraphFont-000006">software to automate</span>
<span class="pt-DefaultParagraphFont-000006">&nbsp;creation of 4K+ user accounts p.a., 30K+ learning documents and 100K+ interactive content paths in LMS.</span>
- There are also much more serious conversion errors:
- MS-Word displays a plain text content control and a repeating section content control within a table, containing one Combobox and one plain text content control per row, perfectly:

- Convert-DocxToHtml gobbles the content completely (and so does Google Docs Preview):
The underlying HTML has just a blank table under each heading:
<div class="pt-000001"> <p dir="ltr" class="pt-qiCVHeading1"> <span class="pt-DefaultParagraphFont-000002">Profile</span> </p> </div> <div align="left"> <table border="1" cellspacing="0" cellpadding="0" dir="ltr" class="pt-000003" /> </div> <div class="pt-000001"> <p dir="ltr" class="pt-qiCVHeading1"> <span class="pt-DefaultParagraphFont-000002">Technologies</span> </p> </div> <div align="left"> <table border="1" cellspacing="0" cellpadding="0" dir="ltr" class="pt-000003" /> </div> - MS-Word shows:
Yet need to look in to the underlying XML to see whether the .docx is to blame for that… - But HtmlConverter output in IE or Firefox:
The underlying HTML reveals that the css does not get applied in the right place:
- MS-Word displays a plain text content control and a repeating section content control within a table, containing one Combobox and one plain text content control per row, perfectly:
<tr>
<td class="pt-000079">
<p dir="ltr" class="pt-BodyTextSmall">
<span class="pt-BodyTextSmallChar-000081">AD</span>
</p>
</td>
<td colspan="2" class="pt-000079">
<p dir="ltr" class="pt-BodyTextSmall">
<span class="pt-BodyTextSmallChar-000081">Active Driector, Microsfot&rsquo;s directory implementation.</span>
</p>
</td>
</tr>
<tr>
<td class="pt-000086">
<p dir="ltr" class="pt-BodyTextSmall">
<span class="pt-000085">&nbsp;</span>
</p>
</td>
<td colspan="2" class="pt-000086">
<p dir="ltr" class="pt-BodyTextSmall">
<span class="pt-000085">&nbsp;</span>
</p>
</td>
</tr>
- One could imagine MS-Word acting less strictly than OpenXML PowerTools:Convert-DocxToHtml, like a web-browser’s parser tolerates and displays bad HTML. However, not only would need to be justified how MS-Word can also serve as the originating HTML WYSIWYG editor. The OpenXML PowerTools:Get-OpenXmlValidationErrors for both of the above documents does not seem to find any OpenXML errors that could explain the bad conversion (other than dozens of Sch_UndeclaredAttribute errors (Version-related? Not sure how this could be) , there is only a Pkg_PartIsNotAllowed relating to a glossary).
- Also yet to do:
- When (not always!) does my page title end up as empty?
<title></title>
- Defaults to doctype xhtml, not html(5).
- When (not always!) does my page title end up as empty?
- Done:
- Pretty-printing. The HtmlConverter output defaults to all content (not css ) on 1 line (e.g. in the example from which above code is taken, 90000chars long). For human readability, and also possibly git tracking, pretty-printing would be better. Can be enforced like so (is there a better way? cannot see a user-configurable option for the SaveOptions enumeration):
openXml\OxPt\OxPtCmdlets\OxPtHelper.cs:var htmlString = html.ToString(SaveOptions.None); // trp: requesting pretty-printing, was:html.ToString(SaveOptions.DisableFormatting);
How to watch a task you did not create in Redmine
2014/12/15
Leave a comment
- “Watching” tasks, i.e. receiving notifications of task updates, is the default for tasks – if you created them.
- “Watching” can also be turned on per task. Go to “issues”,
, click on the first task. - In the upper right of the task page, click “Watch”,

- Then click “next” to cycle through all tasks. A bit tedious even for small projects – is there a way to default to “watching all”. It does not come with the roles (I tried):

Categories: service-is-project-managing
notifications, redmine, watching
Stop ":Zone.Identifier:$DATA" files from being created…
2014/11/19
1 comment
-
… by running gpedit.msc as admin and setting "user configuration / administrative templates / windows components/ attachment manager / "Do not preserve zone information in file attachments" to "Enabled No. "
-
I observed these files in the drive my virtualbox win 8.shares with the linux host.
-
The answer is out there , of course, just not with all the search terms. A more thorough security discussion is also available.
Categories: Glitches&Errors, os
linux, security, virtualbox, windows
Fun with Zotero inserting citations and bibliographies
2014/11/17
Leave a comment
- If you can install Zotero’s word processor add-ins (for LibreOffice Writer or MS-Word).:
- If you cannot, you can still use the “create bibliography from items” of Zotero (which itself can be run under portable Firefox from a USB stick – no install needed at all). Here is a brief example and insert those into your writing;

Categories: animated-GIFs, service-is-library, training
bibliographies, MS-Word, zotero
Fun with Zotero managing bibliographic references
2014/11/17
Leave a comment
Categories: animated-GIFs, service-is-library
bibliographies, MS-Word, zotero
Fun with MS-Word inserting boilerplate text from the Quick part gallery
2014/11/17
Leave a comment
Categories: animated-GIFs, office-software, training
building-blocks, MS-Word, quickparts
Fun with Zotero downloading instead of typing bibliographic information
2014/11/17
Leave a comment
- This ain’t your grand daddy’s citation manager anymore, restricted to the library IT infrastructure and the venerable Z39.50 protocol. Zotero can turn any online resource you browse into bibliographic information, saving you hours of distracting typing, for rather starting note taking immediately – also ideally done for later reuse in Zotero’s reference manager.

- Better even if Zotero can manage also your PDF downloads, like in this example:
– including note-taking: Note that Zotero comes with a PDF markup extension.
Categories: service-is-library
bibliographies, MS-Word, zotero

