Converting .docx files to plain text and preserving line breaks to maintain line number references to source document: howto & implications?

I’m exporting MS Word content to plain text for use with text&file utilities. I have a constraint where the line numbering feature has been enabled in the MS software, and any reference to line numbers in the final output must match that numbering. So enter “numbering lines”:

enter image description here
(Poe, E.A.)

Obviously for Word, that sort of numbering doesn’t break lines at newline, it breaks “lines” after the right margin (or something). A script like docx2txt, doesn’t account for this by default it seems and breaks lines at newline. So if I use grep -n with numbering, the lines won’t match the source line numbers feature, as illustrated above. It’s not exactly clear from the documentation how I would need to edit the Perl script to convert the files the way I need to in this case:

our $config_newLine = "n"; # Alternative is "rn".
our $config_lineWidth = 80; # Line width, used for short line justification.

I tried substituting n for rn but that doesn’t seem to work for me. So I resorted to exporting the documents directly from Word with the following settings(save as plain text, on v.2013,64pc):

  • Unicode(UTF-8)
  • Insert line breaks + end lines with (CR/LF)
  • Allow character substitution

And now indeed when I use the .txt files there is a perfect match between line numbers in the source numbering feature and the grep -n output.


  • Is there any specific configuration/process I should know about docx2txt or a similar command line utility which would have allowed me to convert my .docx files to plain text while preserving line breaks, without resorting to Word like I did?
  • What are the best practices, if any, for exporting MS Word documents (which may contain accented characters) to plain text for use with file/text utilities, with respect to line breaks and formatting; and are there any negative implications with the settings I chose for exporting i.e. inserting CR/LF?

Sample

As suggested I provide a sample. In this rar archive, I bundled a .docx file with simple paragraphs, and its exported .txt file using Word with the aforementioned options. The latter can be compared with a default run of docx2txt on the source file.

Asked By: user44370

||

docx2txt works on the information in the docx file which is a zipped set of XML files.

With regards to line wrapping the .docx XML data only includes information about paragraphs and hard-breaks, not about soft-breaks. Soft-breaks are a result of rendering the text in a specific font, font-size and page width. docx2txt normally just tries to fit text in 80 columns (80 columns is configurable), without any regard for font and font-size. If your .docx contains font information from a Windows system that is not available on Unix/Linux, then doing the export to .txt via Open/LibreOffice would also unlikely result in the same layout, although it tries to do a good job¹.

So docx2txt or any other commandline utility, including commandline driven Open/LibreOffice processing, will not guaranteed convert the text to the same layout as exporting from Word does².

If you want to (or are forced by client requirements) to render exactly as Word does, there is in my experience only one way: let Word do the rendering. When faced with a similar problem as yours³, and having incompatible results using other tools, including OpenOffice, I reverted to installing a Windows VM on the host Linux server. On the client VM a program observes incoming files to be converted on the host, which would start and drive Word to do the conversion and then copy back the result⁴.

Decisions about using CR/LF or LF only, or UTF-8 or some other encoding for the .txt largely depends on how the resulting files are used. If the resulting files are used on Windows I would definately go with CR/LF, UTF-8 and an UTF-8 BOM. Modern programs on Linux are able to deduce that a file is UTF-8, but will not barf on the BOM and/or use that information. You should test all your target applications for compatibility if those are known up front.

¹ This sort of incompatibility is the primary reason some of my friends cannot change to Linux from Windows, although they would like to. They have to use MicroSoft Word, as Open/LibreOffice every once in a while mangles texts they exchange with clients.
² You can install all the fonts used in the Word files and might get lucky for some texts, some of the time.
³ Rendering PDFs from .doc/.docx
The program uses GUI automation—as if someone is clicking its menus—and doesn’t attempt to drive Word via an API. I am pretty sure the latter can be done as well and would have the advantage of not breaking things if Word would get upgraded

Answered By: Anthon