Photo by Sear Greyson on Unsplash
The Objective
The application has to let a user upload multiple PDFs together. And on each of those PDFs, an Optical Character Recognition (OCR) operation needs to be run to glean photos and some information.
Restrictions
- The application is deployed on-premise, on a machine that is restricted from connecting to the internet while the software functions.
- The machine is limited to 4 CPU cores.
Tesseract and the catch
We decided to use Tesseract to help us out with the OCR. Our server application was written in Spring (Java), from which a wrapper (Tess4j) would invoke the tesseract-ocr
engine.
Our original plan was to let tesseract-ocr
manage its own multithreading to get a PDF OCRed as quickly as possible, and then move on to the next one in the queue of uploaded PDFs. But tesseract-ocr in multithread mode was significantly slower than in single-thread mode at the time this application was being made.
So we forced each spawned process of tesseract-ocr
to use one thread only by setting OMP_THREAD_LIMIT=1
in the environment. But now, it would be great if we could launch 4 of those processes together to get through the PDFs faster.
Quartz the Scheduler
Quartz allows us to create jobs and then run those jobs concurrently if needed. So, every time a PDF was successfully uploaded synchronously at the request of the user, we scheduled a job for it. This asynchronous job would actually invoke the tesseract-ocr
. When done with a PDF, the job updates a record on our database so that the user can learn about the OCR completion.
We told Quartz to keep it to 4 concurrent jobs at maximum. And this combination of single-threaded Tesseract and a multi-threaded Quartz, was the sweet spot for our application.