Hacker Public Radio

Your ideas, projects, opinions - podcasted.

New episodes Monday through Friday.


HPR Comments


SolusSpider - Peter Paterson says: Experience with Tesseract OCR software

RE: hpr3998::2023-11-29 Using open source OCR to digitize my mom's book by Deltaray
00:30:47 Listen in ogg, spx, or mp3 format.
Greetings Deltaray, so pleased to meet you.
My own experience with Tesseract OCR software is via my volunteer work with MissionAssist.

MissionAssist is a UK based charity.
I volunteer for them as a Digitisation Keyboarder, receiving PDF scans of Bibles and other books, from people groups all over the world, and typing the chapter text into a structured text file.
https://missionassist.org.uk/services/digitisation/bible-digitisation-project/

Tesseract is a wonderful tool that helps me with a lot of the process, obtaining a text file and then working directly on it.

Since I run KDE, I use Spectacle to highlight the area of the PDF I want to convert into a PNG file for tesseract to read.
A lot of the scans we receive are not exactly straight, often in columns, have ink marks, and bleed through from the other side. So, not always a straight forward OCR process.
I save these files with chapter and verse references in the title.
Once I have a set of PNG files from my allocated chapter, I simply run tesseract per file to create the text file.
I then use cat to collect the text files into one file to work on.

Your show was really more about using bash and especially the grep command to process your project.
I learned a lot from that alone! Thanks for the education.

Checked your HPR profile and was not surprised you are the guy behind @climagic
I did follow you on Twitter, but left at the buyout.
So glad to know you are on Mastodon, and I followed that account today.

I do plan recording my own show about my use of tesseract as I volunteer with MissionAssist, but given my current workload and other reasons I am looking at sometime in the new year of 2025.

Mastodon Comments



More Information...


Copyright Information

Unless otherwise stated, our shows are released under a Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) license.

The HPR Website Design is released to the Public Domain.