Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes Monday through Friday.

HPR3596: Extracting text, tables and images from docx files using Python

Hosted by Mr. Young on 2022-05-16 00:00:00
Download or Listen

Tools to extract data from docx files:

  1. docx2txt
  2. python-docx2txt
  3. python-docx

Code Snippets

text = docx2txt.process(src, img_dest)
with open("data.txt", "wt") as f:
document = docx.Document(src)
tables = document.tables
data = []
for table in tables:
    table_data = []
    for row in table.rows:
        row_data = []
        for cell in row.cells:

for i, table in enumerate(tables):
    with open(f"{i}.csv", "wt") as f:
        writer = csv.writer(f)

HPR Comments

Mastodon Comments

More Information...

Copyright Information

Unless otherwise stated, our shows are released under a Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.

The HPR Website Design is released to the Public Domain.