Tesseract OCR
Programming

Adding OCR to a multipage pdf

PDF of images

Nothing is more frustrating than a PDF book/course existing of scanned images. It’s impossible to search the text or to copy parts of the texts. Luckily there is something called OCR (Optical Character Recognition). With this technology it is possible to extract text out of images. The most popular application to do this in Linux is tesseract. But looking at its man page, we immediately see 1 big problem… it does not work on PDF files!

Get rid of the PDF format

So the first thing we have to do is to dispose of the pdf format and adopt the tiff format. This is very simply done with the gs command. But before making a bunch of new files, let’s make a temporary working directory.

mkdir ./tmptiff
/usr/bin/gs -sDEVICE=tiff24nc -r200 -o ./tmptiff/outfile-\%02d.tiff infile.pdf
cd ./tmptiff

Add OCR and make one searchable PDF file

If you take a look now in the tmptiff folder, you will see a bunch of tiff files with the pages of the PDF. Now we are able to give the files a OCR layer and to put them together again in one PDF file. This part is possible with a script I found online. to bad the person that made the script did not think of 1 thing, leading zero’s to prevent messing up the page order! Luckily I had this problem once before so I decided to add my previous function to the script…

Or we could do it the easy way!

And afterwards it all seamed to be to much work for something very simple! There is something as a multipage tiff format, so the following will do everything we need:

gs -sDEVICE=tiff24nc -r200 -o out.tif -dUseBigTIFF infile.pdf
tesseract -psm 1 -l eng out.tif outwithocr.pdf pdf

Dependencies

tesseract-ocr

Leave a Reply

Your email address will not be published. Required fields are marked *