DOC vs DOCX

Posted on: March 19th, 2012 No Comments

In ePublisher 2011.3, we introduced an alternate processing flow for the Microsoft Word Office Open XML (OOXML, and DOCX, hereafter) document format. Following is a brief explanation of the reason for this new processing flow, some of the existing side-effects, and the implications of this approach down the road.

Let’s start with a brief history of the Microsoft Word integration with ePublisher. In 2004 – 2005, when ePublisher was being designed, the Word adapter leveraged existing code to process Word DOC files to ePublisher intermediate files (WIF), using Word VBA. Through the years, ePublisher development on the Word adapter has been based on this processing flow, and with each successive release of Word (2007 and 2010), the same processing flow has been used.

In Word 2007, Microsoft introduced a new document format named Office Open XML, which uses a *.docx file extension when saved. Unlike the DOC format, the DOCX format is an XML-based open standard. Up until 2011.3, ePublisher has continued to use the same VBA processing flow for both the DOC and DOCX formats.

So why another adapter? There are a number of issues that show up when processing DOCX files using the DOC processing flow. The root cause of these issues is the fact that the VBA-based processing flow normalizes all files to DOC format. This save is lossy and the effect is that formatting information from DOCX files is dropped in some cases.

Why save DOCX to DOC? Why not just leave the file in its native format when generating the ePublisher’s intermediate files? The answer, VBA does not allow inspection of character style runs. Because ePublisher is unable to use the VBA to iterate runs of character formatting, it relies on a library which inspects the raw bytes of a DOC file. The library is able to derive the runs of character formatting from this analysis. This library only works with DOC files (inspection of runs of character formatting is available in DOCX via XPath), so all files must be saved as DOC before the VBA-based processing flow can be applied. So, the same processing flow cannot be applied to both Word formats, but only to the DOC format, an inherent constraint of DOC/VBA processing flow.

The new DOCX adapter works around the limitations enumerated above by leaving the original DOCX file in its native format. It uses a combination of DOM manipulation and XSL to produce the ePublisher intermediate files. The effect is that formatting information derived from DOCX files is more correct and complete.

There are some growing pains associated with this new approach. The DOCX processing flow is not as mature as the DOC processing flow. There are a number of issues with the DOCX adapter as of the 2011.4 release, which we are working to address. As of the 2011.4 release, intermediate patches are being made available for the DOCX processing flow which address these issues more immediately than the regular quarterly release interval. Following is a link to the page from which these intermediate patches are available:

http://wiki.webworks.com/Updates/DocxUpdates

There are a number of natural advantages to the DOCX adapter. Because of the problems with character style runs, the DOC adapter is forever tied to legacy 32-bit code. The DOCX adapter has no such limitation. It represents a viable path toward 64-bit binaries. Also, the speed and memory performance of the DOCX implementation are far superior to the DOC implementation, which improves the scalability ceiling of the DOCX format. Finally, while there are no current plans to make the needed changes, the fact that DOCX is open (doesn’t require Word in order to read and manipulate) opens the potential of the format to be used across platforms.

Tags: adapter, word

DOC vs DOCX

Leave a Reply