Generating Word Documents with Python

Let’s say you have to repeatedly prepare some reports with the same structure. Wouldn’t it be great if you can automatically generate them once you have the data? Well, actually you can!

Microsoft changed the underlying structure of their Office documents (Word, Powerpoint, Excel, etc.) from binary to Open-XML with the release of Microsoft Office 20071. Take a guess what the x in the new extensions .docx,.pptx and .xlsx stands for. This offers a few advantages, including the most notable one of making the format open and accessible for anyone. We will be taking advantage of this to generate Word reports from templates with Python.

Generate a report from a template

Unbeknownst to many, Office documents are actually just zip files to reduce the document size for sharing. This also means we can unzip the Office file to have a look at its internal structure! The demo files here are available on Github if you wish to follow along.

Let’s first have a quick peek at what our Word document looks like. It has some styled texts and an image.

The template which we will use to generate reports.

Let’s change the extension of the Word document and unzip it. Notice that immediately after opening the zip container, we see directories and file such as [Content_Types].xml. This detail is important when we re-assemble our Word document later.

Compressed contents of a Word file.

We notice that images are stored in /word/media while the document itself is /word/document.xml. This means that we can do some string replacement with Python to replace our placeholder texts!

Here’s the Plan

Prepare your Word template file

  • Start off with a blank document, draw any tables, insert any images you like.
  • Take note when typing placeholder texts, you write the whole word at once. If you edit part of the text, it may visually look like a single word, while in the XML document, it may be separated into separate chunks. We won’t be able to easily replace it then!
  • Here, I am using a format of {placeholder-text}, but you can use anything you like.

Prepare the report data

  • With your software of choice (MATLAB is shown here as an example), generate your graphs and data and save them as files.
  • For the images, be sure to follow the dimensions and names of the ones found in /word/media directory. If your image is a different dimension or aspect ratio, it will be stretch or squished to fit those of the image originally in the template.

Generate the report

  1. Unzip the template Word document.
  2. Replace all images in /word/media directory.
  3. Replace the placeholders in document.xml with the values you have prepared.
  4. Re-zip the folder to generate your Word document. As mentioned before, make sure that when you zip it, you don’t inadvertently create a zip of a folder that contains your files. Otherwise, you’ll get the error below.
  5. Rename your zip file as a .docx file, and done!
Some zip programs automatically create a directory for all your files. If this happened, you’ll get an unreadable content error. (Word found unreadable content in document name.docx. Do you want to recover the contents of this document? If you trust the source of this document, click Yes.)

Python Source Code

Also available on Github. This is a proof-of-concept, so there may be rooms for improvements.

import zipfile
import shutil
import sys, os
import glob


with zipfile.ZipFile('Template.docx', 'r') as zipObj:
    # Extract all the contents of zip file in current directory
    print("Extracting zip file...")
    zipObj.extractall('tmp')
    
    # Copy all the emf files over
    print("Copying all *.emf files...")
    for file in glob.glob('*.emf'):
        shutil.copy(file, 'tmp/word/media/')
    
    # Read data from document xml file
    print("Reading original document.xml from zipObj...")
    obj_xml = zipObj.read('word/document.xml')
    
    print("Reading out.txt from Matlab...")
    with open("out.txt","r") as txt:
    	templatetxt = ["{ENDDATE}","{STARTDATE}","{d2}","{d3}","{d4}","{d5}","{d6}","{d7}","{p1}","{p2}","{p3}","{p4}","{p5}","{p6}","{p7}","{description}"]
    	replacedtxt = []
    	for line in txt:
    	    replacedtxt.append(line.rstrip())
            
    	print("Replacing all placeholder texts in template.docx...")
    	for i,temptxt in enumerate(templatetxt):
            obj_xml = obj_xml.replace(templatetxt[i].encode(),replacedtxt[i].encode())

        # write the new document xml
    	os.remove('tmp/word/document.xml')
    	with open('tmp/word/document.xml', "wb" ) as f_xml:
    	    f_xml.write(obj_xml)


    # Zip the directory into a word file in working dir
    print("Outputting docx file...")
    shutil.make_archive('output', 'zip', 'tmp/')
    # Rename and change ext
    outputfilename = "Report_" + replacedtxt[1] + ".docx"
    shutil.move('output.zip',outputfilename)


print("DONE!")
print("Saved to " + outputfilename)
print()

Closing Thoughts

While the provided scripts only create the Word report file for you, you can go ahead and programmatically convert the Word file into a PDF file for sharing. There are a few methods in Python, such as with docx2pdf2, COM automation3, or LibreOffice4.

This quick dive only looks into Word documents, but it would be identical with Excel and Powerpoint.

XML content of Excel and Powerpoint.

References

  1. https://support.office.com/en-us/article/open-xml-formats-and-file-name-extensions-5200d93c-3449-4380-8e11-31ef14555b18
  2. https://pypi.org/project/docx2pdf/
  3. https://stackoverflow.com/questions/6011115/doc-to-pdf-using-python
  4. https://michalzalecki.com/converting-docx-to-pdf-using-python/

Leave a Reply

Your email address will not be published. Required fields are marked *