• Tesseract OCR
    Programming

    Adding OCR to a multipage pdf

    PDF of images Nothing is more frustrating than a PDF book/course existing of scanned images. It’s impossible to search the text or to copy parts of the texts. Luckily there is something called OCR (Optical Character Recognition). With this technology it is possible to extract text out of images. The most popular application to do this in Linux is tesseract. But looking at its man page, we immediately see 1 big problem… it does not work on PDF files! Get rid of the PDF format So the first thing we have to do is to dispose of the pdf format and adopt the tiff format. This is very simply done…