Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes Monday through Friday.


HPR3596: Extracting text, tables and images from docx files using Python

Hosted by Mr. Young on 2022-05-16 00:00:00
Download or Listen

Tools to extract data from docx files:

  1. docx2txt
  2. python-docx2txt
  3. python-docx

Code Snippets

text = docx2txt.process(src, img_dest)
with open("data.txt", "wt") as f:
    f.write(text)
document = docx.Document(src)
tables = document.tables
data = []
for table in tables:
    table_data = []
    for row in table.rows:
        row_data = []
        for cell in row.cells:
            row_data.append(cell.text)
        table_data.append(row_data)
    data.append(table_table)

for i, table in enumerate(tables):
    with open(f"{i}.csv", "wt") as f:
        writer = csv.writer(f)
        writer.writerows(table)

Comments



More Information...


Copyright Information

Unless otherwise stated, our shows are released under a Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.

The HPR Website Design is released to the Public Domain.