Archive

Posts Tagged ‘openxml’

How to automate removing MS-Office VBA project protection for multiple files

  1. Problem:
    1. Need VBE extensibility to implement some tools and practices of the modern SDLC.
    2. Software consists of not only 1000s of Word templates which are anything but DRY, but also highly protected even during what should be the SDLC
      1. Not solvable by VBA automaton for security reasons:
        1. VBE password protection (OK, there is Sendkeys, but that is considered harmful).
        2. digitally signed.
      2. Developer tab read-only protection: this one is not covered here, since it can be dealt with through regular VBA automation.
  2. Workaround: PowerShell for starters:
    1. Get Unlock-OfficeMacro.ps1  – including the addition in the comments.
    2. Wrap the downloaded script like so:
    3.    
      Get-ChildItem -Include *.do?m* -Exclude *_unlocked* -Path "G:\imf\word templates\Quarterly Releases_unprotected_ps"  -Recurse |`  
      
      foreach{ $_.IsReadOnly = $false   $output_filename =  $_.Directory.ToString() + "\" + $_.BaseName + '_unlocked' +  $_.Extension .\Unlock-OfficeMacro.ps1 $_.FullName  $output_filename 
      }
      Exit 
       
  3. NextProblem: The script removes the warning dialogues on opening the altered MS-Word files remain. This still hampers automation.
  4. Next workaround: this script automates the GUI:
    1. “OK”’ing the warning dialogue: “The project file ‘C:\Users\tplagwitz\AppData\Roaming\Microsoft\Templates\documaker.dotm’ contains invalid key ‘DPx’.–Continue Loading Project?” 
      image
    2. making minor changes and saving the file (this also bypasses the "discard certificate" warning, if the file was also signed (as is my case).  image
    3. Prerequisites: none, other than putting your word files in a folder the script (which the script will prompt you for, and for an (optional) substring, to filter file names) .

    4. Limitations:
      1. I  used to have also have, per module in the VBA project, warnings: “Microsoft Visual Basic for Applications Unexpected error (40230) ” and try to bypass these also, but since I cannot replicate the warnings, this remains untested.image
      2. A superior approach (enabling round tripping) would be to attempt to automate entering the password, but the traditional SendKeys approach is unreliable, and  newer approaches (using SendMessage from the the Win32 API or bypassing the intended negative effects of password protection, via an in-memory substitution).
    5. And here is the AutoIt script:
  
include <Array.au3>
#include <debug.au3>
#include <File.au3>
#include <log4a.au3>
Opt("WinTitleMatchMode", 2)
Opt("MustDeclareVars", 1)
Dim $file, $runpath, $iPID, $i, $folderpath, $pattern, $files, $filepath, $files, $ret, $oAppl, $oDoc, $sFilter
_log4a_SetEnable()
_log4a_SetOutput($LOG4A_OUTPUT_BOTH)
$pattern = InputBox("File Pattern?", "Enter file pattern,  beyond (before) *.do?m (= Files with macros), that files have to match.", "_unlocked")
$folderpath = InputBox("Where?", "Enter folder to find files in...")
$sFilter = "*" & $pattern & "*.do?m|~*,Backup*"
If Not (FileExists($folderpath) And StringInStr(FileGetAttrib($folderpath), "D")) Then
	MsgBox(1, "Error", " The path you entered does Not seem To exist Or is Not a folder. Exiting....")
	Exit
Else
	$files = _FileListToArrayRec($folderpath, $sFilter, $FLTAR_FILES, $FLTAR_RECUR, $FLTAR_SORT, $FLTAR_RELPATH)
	For $i = 1 To UBound($files) - 1
		$file = $files[$i]
		If (StringRight($folderpath, 1) <> "\") Then $folderpath = $folderpath & "\"
		$filepath = $folderpath & $file
		Local $iPID = Run('"C:\Program Files\Microsoft Office 15\root\office15\WINWORD.EXE" /q /a /m "' & $filepath & '"', "", @SW_SHOWMAXIMIZED)
		$ret = WinActivate("- Word", "")
		$ret = WinWaitActive("- Word", "", 5)
		If ($ret = 0) Then
			_log4a_debug("cannot load: " & $filepath & @TAB & @ScriptLineNumber & @CRLF)
			$ret = ProcessClose($iPID)
			ContinueLoop
		Else
			_log4a_debug("processing: " & $filepath & @TAB & @ScriptLineNumber & @CRLF)
		EndIf
		If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		Send("{SHIFTDOWN}{SHIFTUP}")
		Sleep(100)
		Send("!{f11}")
		$ret = WinActivate("Microsoft Visual Basic for Applications", "invalid key")
		If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		$ret = WinWaitActive("Microsoft Visual Basic for Applications", "invalid key", 5)
		If ($ret = 0) Then
			_log4a_debug("nothing to do with invalid key, will close word and continue next file: " & @ScriptLineNumber & @CRLF)
			$ret = ProcessClose($iPID)
			If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
			ContinueLoop
		EndIf
		If $ret <> 0 Then
			$ret = ControlClick($ret, "", "Button1")
			If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		Else
			If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		EndIf
		Sleep(1000)
		$ret = 0
		Sleep(5000)
		While (0 <> WinActivate("Microsoft Visual Basic for Applications", "Unexpected error (40230)"))
			$ret = ControlClick("[CLASS:#32770
			If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		WEnd
		Sleep(3000)
		$ret = WinActivate("Microsoft Visual Basic for Applications", "")
		$ret = WinWaitActive("Microsoft Visual Basic for Applications", "", 5)
		If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		If ($ret = 0) Then WinActivate("Microsoft Visual Basic for Applications", "")
		$ret = WinWaitActive("Microsoft Visual Basic for Applications", "", 5)
		If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		$ret = ControlSend("Microsoft Visual Basic for Applications", "", "VbaWindow1", "'dummy" & @CRLF)
		If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		Sleep(1000)
		Send("!q")
		Sleep(1000)
		Send("!{f4}")
		Sleep(1000)

		$ret = WinWaitActive("Microsoft Word", "", 5)
		If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		If ($ret = 0) Then $ret = WinActivate("Microsoft Word", "")
		If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		$ret = WinWaitActive("Microsoft Word", "", 5)
		If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		If ($ret <> 0) Then
			Send("!s")
			Sleep(1000)
			$ret = WinWaitActive("Microsoft Word", "discarded", 5)
			If ($ret = 0) Then
				_log4a_debug("the certificate dialogue is not up yet", True)
				$ret = WinWaitActive("Microsoft Word", "", 5)
				If ($ret <> 0) Then
					_log4a_debug("failure with !s: " & @ScriptLineNumber & @CRLF)
					Send("{Enter}")
				EndIf
			EndIf
			Sleep(1000)
		Else
			If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		EndIf
		$ret = WinWaitActive("Microsoft Word", "discarded", 5)
		If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		$ret = WinActivate("Microsoft Word", "discarded")
		If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		$ret = WinWaitActive("Microsoft Word", "discarded", 5)
		If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		If ($ret <> 0) Then
			$ret = ControlClick("Microsoft Word", "", "Button1")
			If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
			If $ret = 0 Then Send("!d")
		Else
			If ($ret = 0) Then _log4a_debug("failure on: " & @ScriptLineNumber & @CRLF)
		EndIf
		Sleep(1000)
		Send("{BACKSPACE}")
		Sleep(1000)
		Send("^s")
		Sleep(3000)
		ProcessClose($iPID)
		Sleep(1000)
	Next
	Sleep(1000)
EndIf

Fun with .docx to .html transforms by means of HtmlConverter from PowerTools for Open XML

  1. The transform is FOSS and platform-independent:
    1. It neither requires Office nor Windows (The OpenXML SDK runs on Linux via Mono on the server.
    2. However, the most recent installment of Powertools for OpenXML, a high-level API to the OpenXML SDK, comes with a PowerShell interface (benefit: no Visual studio requirement).
  2. Valuable features of the transform,  among many other things, are:
    1. HtmlConverter is able to translate MS-Word styles into CSS (insofar needed – my code style has “No proofing” set, however, this cannot be implemented on the WWW), so the layout is preserved as designed, but w/o need for inline formatting:
        span.pt-StrongEmphasis-000052 {
            font-family: Calibri;
            font-size: 11pt;
            font-style: italic;
            font-weight: bold;
            margin: 0in;
            padding: 0in;
        }

        span.pt-lowCodeConsoleChar0 {
            color: #FFFFFF;
            background: #000000;
            font-family: Consolas;
            font-size: 10pt;
            font-weight: normal;
            margin: 0in;
            padding: 0in;
        }
     &lt;h3 dir=&quot;ltr&quot; class=&quot;pt-000040&quot;&gt;
            &lt;span class=&quot;pt-000041&quot;&gt;2.2.1&lt;/span&gt;&lt;span class=&quot;pt-000042&quot;&gt;&lt;span class=&quot;pt-000043&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;pt-Heading2Char&quot;&gt;&lt;b&gt;References&lt;/b&gt;&lt;/span&gt;
          &lt;/h3&gt;

          &lt;p dir=&quot;ltr&quot; class=&quot;pt-BodyText&quot;&gt;
            &lt;span class=&quot;pt-DefaultParagraphFont-000003&quot;&gt;&lt;br /&gt;
            &amp;lrm;&lt;/span&gt;&lt;span class=&quot;pt-000000&quot;&gt;&amp;nbsp;&lt;/span&gt;
          &lt;/p&gt;

          &lt;h1 dir=&quot;ltr&quot; class=&quot;pt-000006&quot;&gt;
            &lt;span class=&quot;pt-000007&quot;&gt;&lt;b&gt;3&lt;/b&gt;&lt;/span&gt;&lt;span class=&quot;pt-000008&quot;&gt;&lt;b&gt;&lt;span class=&quot;pt-000009&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;span class=&quot;pt-Heading1Char&quot;&gt;&lt;b&gt;Introduction&lt;/b&gt;&lt;/span&gt;
          &lt;/h1&gt;

          &lt;h2 dir=&quot;ltr&quot; class=&quot;pt-000018&quot;&gt;
            &lt;span class=&quot;pt-000019&quot;&gt;3.1&lt;/span&gt;&lt;span class=&quot;pt-000020&quot;&gt;&lt;span class=&quot;pt-000021&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;pt-Heading2Char&quot;&gt;&lt;b&gt;Purpose of Document&lt;/b&gt;&lt;/span&gt;
          &lt;/h2&gt;
    1. There are many more options that I have not yet tried:
            SimplifyMarkupSettings simplifyMarkupSettings = new SimplifyMarkupSettings
            {
                RemoveComments = true,
                RemoveContentControls = true,
                RemoveEndAndFootNotes = true,
                RemoveFieldCodes = false,
                RemoveLastRenderedPageBreak = true,
                RemovePermissions = true,
                RemoveProof = true,
                RemoveRsidInfo = true,
                RemoveSmartTags = true,
                RemoveSoftHyphens = true,
                RemoveGoBackBookmark = true,
                ReplaceTabsWithSpaces = false,
            };
            MarkupSimplifier.SimplifyMarkup(wordDoc, simplifyMarkupSettings);

            FormattingAssemblerSettings formattingAssemblerSettings = new FormattingAssemblerSettings
            {
                RemoveStyleNamesFromParagraphAndRunProperties = false,
                ClearStyles = false,
                RestrictToSupportedLanguages = htmlConverterSettings.RestrictToSupportedLanguages,
                RestrictToSupportedNumberingFormats = htmlConverterSettings.RestrictToSupportedNumberingFormats,
                CreateHtmlConverterAnnotationAttributes = true,
                OrderElementsPerStandard = false,
                ListItemRetrieverSettings = new ListItemRetrieverSettings()
                {
                    ListItemTextImplementations = htmlConverterSettings.ListItemImplementations,
                },
            };
    1. One would really wish there was a way to get such HTML cleaned up automatically (ouch!):
               &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;M&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;anaged requirements for system integration&amp;nbsp;&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;of Center&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;&amp;nbsp;&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;software&amp;nbsp;&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;with&amp;nbsp;&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;iLearning&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;&amp;nbsp;and with content production and management (BPD). To mitigate lack of integration of $50k LMS software investment into departmental workflow&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;&amp;nbsp;&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;developed&amp;nbsp;&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;and documented&amp;nbsp;&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;software to automate&lt;/span&gt;
                &lt;span class=&quot;pt-DefaultParagraphFont-000006&quot;&gt;&amp;nbsp;creation of 4K+ user accounts p.a., 30K+ learning documents and 100K+ interactive content paths in LMS.&lt;/span&gt;
    1. There are also much more serious conversion errors:
      1. MS-Word displays a plain text content control and a repeating section content control within a table, containing one Combobox and one plain text content control per row, perfectly: openxml-convert-docxtohtml-error-word
      2. Convert-DocxToHtml gobbles the content completely (and so does Google Docs Preview): openxml-convert-docxtohtml-error-html The underlying HTML has just a blank table under each heading:
            &lt;div class=&quot;pt-000001&quot;&gt;
                &lt;p dir=&quot;ltr&quot; class=&quot;pt-qiCVHeading1&quot;&gt;
                  &lt;span class=&quot;pt-DefaultParagraphFont-000002&quot;&gt;Profile&lt;/span&gt;
                &lt;/p&gt;
              &lt;/div&gt;
              &lt;div align=&quot;left&quot;&gt;
                &lt;table border=&quot;1&quot; cellspacing=&quot;0&quot; cellpadding=&quot;0&quot; dir=&quot;ltr&quot; class=&quot;pt-000003&quot; /&gt;
              &lt;/div&gt;
              &lt;div class=&quot;pt-000001&quot;&gt;
                &lt;p dir=&quot;ltr&quot; class=&quot;pt-qiCVHeading1&quot;&gt;
                  &lt;span class=&quot;pt-DefaultParagraphFont-000002&quot;&gt;Technologies&lt;/span&gt;
                &lt;/p&gt;
              &lt;/div&gt;
              &lt;div align=&quot;left&quot;&gt;
                &lt;table border=&quot;1&quot; cellspacing=&quot;0&quot; cellpadding=&quot;0&quot; dir=&quot;ltr&quot; class=&quot;pt-000003&quot; /&gt;
              &lt;/div&gt;
          
      3. MS-Word shows:imageYet need to look in to the underlying XML to see whether the .docx is to blame for that…
      4. But HtmlConverter output in IE or Firefox: imageThe underlying HTML reveals that the css does not get applied in the right place:
 	&lt;tr&gt;
                &lt;td class=&quot;pt-000079&quot;&gt;
                  &lt;p dir=&quot;ltr&quot; class=&quot;pt-BodyTextSmall&quot;&gt;
                    &lt;span class=&quot;pt-BodyTextSmallChar-000081&quot;&gt;AD&lt;/span&gt;
                  &lt;/p&gt;
                &lt;/td&gt;
                &lt;td colspan=&quot;2&quot; class=&quot;pt-000079&quot;&gt;
                  &lt;p dir=&quot;ltr&quot; class=&quot;pt-BodyTextSmall&quot;&gt;
                    &lt;span class=&quot;pt-BodyTextSmallChar-000081&quot;&gt;Active Driector, Microsfot&amp;rsquo;s directory implementation.&lt;/span&gt;
                  &lt;/p&gt;
                &lt;/td&gt;
              &lt;/tr&gt;

              &lt;tr&gt;
                &lt;td class=&quot;pt-000086&quot;&gt;
                  &lt;p dir=&quot;ltr&quot; class=&quot;pt-BodyTextSmall&quot;&gt;
                    &lt;span class=&quot;pt-000085&quot;&gt;&amp;nbsp;&lt;/span&gt;
                  &lt;/p&gt;
                &lt;/td&gt;
                &lt;td colspan=&quot;2&quot; class=&quot;pt-000086&quot;&gt;
                  &lt;p dir=&quot;ltr&quot; class=&quot;pt-BodyTextSmall&quot;&gt;
                    &lt;span class=&quot;pt-000085&quot;&gt;&amp;nbsp;&lt;/span&gt;
                  &lt;/p&gt;
                &lt;/td&gt;
              &lt;/tr&gt;
  1. One could imagine MS-Word acting less strictly than OpenXML PowerTools:Convert-DocxToHtml, like a web-browser’s parser tolerates and displays bad HTML. However, not only would need to be justified how MS-Word can also serve as the originating HTML WYSIWYG editor. The OpenXML PowerTools:Get-OpenXmlValidationErrors for both of the above documents does not seem to find any OpenXML errors that could explain the bad conversion (other than dozens of Sch_UndeclaredAttribute errors (Version-related? Not sure how this could be) , there is only a Pkg_PartIsNotAllowed relating to a glossary).
  • Also yet to do:
    1. When (not always!) does my page title end up as empty?
      &lt;title&gt;&lt;/title&gt;
    2. Defaults to doctype xhtml, not html(5).
  • Done:
      1. Pretty-printing. The HtmlConverter output defaults to all content (not css ) on 1 line (e.g. in the example from which above code is taken, 90000chars long). For human readability, and also possibly git tracking, pretty-printing would be better. Can be enforced like so (is there a better way? cannot see a user-configurable option for the SaveOptions enumeration):
    openXml\OxPt\OxPtCmdlets\OxPtHelper.cs:var htmlString = html.ToString(SaveOptions.None); // trp: requesting pretty-printing, was:html.ToString(SaveOptions.DisableFormatting);