Figure 1: Examples of book edition notices where information can be extracted thanks to OCR.
Most of the time, the raw data that we need for our data science project is not organized in a neat, well-structured, and insightful table. Rather, this is sometimes stored as text in a scanned document. Words in the document must then be extracted one by one to form a text formatted data cell. This is the task performed by Optical Character Recognition (OCR).
As you read the words of this article, be it text or number, your eyes are able to process them by recognizing light and dark patterns that make up characters (e.g., letters, number, punctuation marks, etc.). Your brain is then decoding the different combinations of characters and patterns to grasp the meaning of the words. In that sense, your eyes and brain are the most sophisticated and refined OCR engine you can possibly imagine, and they work without you even noticing it.
Computers have similar capabilities, but they have to tackle a crucial limitation: the absence of eyes. If we want computers to see and read a physical text document, we need to input a graphic file generated either with an optical scanner or a digital camera. As far as computers are concerned, there is no difference between a document acquired with either of those options and a photograph of the Eiffel Tower: both are regarded as meaningless collections of coloured squares — also known as pixels — that constitute any computer graphic image. As such, the latter is just a picture of the text that we intend to read rather than the text itself.
This is where OCR can come in handy. This powerful technology is able to extract printed, typed, or hand-written data, be it invoices, business cards, legal texts, or printouts, and convert it into a searchable and editable digital format. Although for many years OCR has been regarded as an expensive service, which only very few large companies could afford, from the mid-2000s onwards, its cost has gradually sunk while its accuracy and capabilities have evolved to support today several hundred languages and character encoding, from UTF-8 to GB2312.
The benefit of being able to search and extract text from images can be priceless. For example, in the legal or accounting industry, it can represent a significant cost and time saver, as it enables the retrieval of portions of text or numbers in articles or financial statements in a matter of seconds. Comparing this process with the cost of hiring a group of people to read through thousands of documents just to find a single, critical piece of information gives an idea of how OCR can benefit businesses.
More recently, the OCR technology is undergoing a quiet revolution as providers of this service are combining it with AI. As a result of it, not only is data being captured, made searchable and editable, but the AI system is actually understanding the content to carry out specific tasks. For example, after OCRing a text, AI can provide its translation using neural machine translation with minimal human intervention. Another classic example comes from the auditing section, where fraudulent invoices can be recognized after OCRing the content of the pdf document, using outlier detection techniques. And so on. This synergy combines the best of both worlds to streamline processes and increase productivity for businesses and clients.
In the use case described in this article, OCR is used to identify a book and then to retrieve the book’s metadata from the Google Books repository.
More specifically, we are going to have a look at:
- How OCR can be conducted in KNIME Analytics Platform.
- How we can integrate KNIME’s OCR processor and Google Books API for the use case of retrieving book metadata and cover.
OCR in KNIME Analytics Platform
OCRing an image containing text in KNIME is a very easy task. All it takes is to install the KNIME Image Processing — Tess4J Integration extension in your local KNIME Analytics Platform, and to drag&drop the Tess4J node onto your workflow editor.
The Tess4J node integrates the Tesseract OCR library, one of the most widely used and accurate open-source OCR processors available. Tesseract was originally developed as a proprietary software by Hewlett-Packard Laboratories in the early 1990s and was later made open source in 2005. Google has since then adopted the project and sponsored its development.
The Tess4J node runs on Tesseract 3, which works by recognizing character patterns in a two-pass procedure.
- In the first pass, the engine attempts to recognize each individual character. It then passes the characters that were recognized with high confidence in the first pass to an adaptive classifier as training data. In this way, the adaptive classifier has the chance to learn how to recognize subsequent text more accurately.
- However, it may happen that the adaptive classifier learns useful information too late to make a meaningful contribution. To solve this issue and leverage the knowledge acquired by the adaptive classifier, the engine operates a second pass where characters that were not recognised well enough are recognised again .
Tesseract 3 handles any Unicode characters (coded with UTF-8) and can process text in various languages and writing layout: left-to-right (e.g., English, Italian, Russian, etc.), right-to-left (e.g., Arabic, Hebrew, Urdu etc.) and top-to-bottom (e.g., Japanese, Korean, Chinese, etc.) .
Disclaimer. Mac users are currently unable to use the Tess4J node. KNIME developers are working to restore smooth functioning.
Use Case: Retrieving Book Metadata and Cover
Now that we have gained a basic understanding of how OCR works in KNIME Analytics Platform, let’s have a look at an interesting use case. Suppose that we have collected images illustrating the edition notices of several books and with this information we want to retrieve book metadata and covers. The retrieved data could then be used, for example, to build a customised digital library and train a book recommender system.
The workflow in Figure 2 covers all steps: from image reading, OCRing, text processing and ISBN reference extraction, to book metadata and cover retrieval and visualization. Let’s have a look at the different steps in detail.
Figure 2: This workflow performs a simple OCR task on book edition notices and retrieves book metadata and cover using Google Books API.
1 — Read Image Data
The first step is to import the book edition notice images into KNIME. The edition notice is the page in a book containing information about the current edition, such as a copyright notice, legal notices, publication information, printing history, and an ISBN code (Figure 1).
The metanode “Read image data” takes care of that in an easy and programmatic way (Figure 3). We identify the location where image files are stored with the List Files/Folders node and use the Image Reader (Table) node to gracefully import the images. In the Image Reader (Table) node, we only need to specify the “File Input Column”, that is to say the column with the paths to the files where our images are stored. All other configurations can be left as default.
The Image Reader (Table) is part of the KNINE Image Processing extension and, like other nodes in this extension, it offers an interactive view containing the image and its metadata by simply right-clicking on the node, selecting “View: Image Viewer”, and double-clicking on any image in the table view.
Figure 3: Inside the metanode “Read image data”. The Image Reader (Table) node imports images into the workflow and allows us to interactively explore them in its view.
2 — OCR
After reading in the image files of the book edition notices, we can OCR them.
The configuration of the Tess4J node is very simple and requires only a few clicks (Figure 4). In the Settings tab, the node offers the possibility to correct any rotation or skewed image by selecting the box “Deskew input images” in the “Preprocessing” section of the configuration dialog. It is usually advised to do so, as graphic files might not be properly aligned. Moreover, the Tess4J node automatically produces a binarized image behind the hood.
Next, we select the “Tessdata Path”. By default, this is set to “Use Internal”, which then allows us to choose the language of the text that we want to process. In this configuration, English is the default language, but the Tess4J node supports other natural languages such as Danish, Italian, Spanish, Russian, Greek, Slovak, German, and French. It is worth mentioning that by selecting “Use External”, we can expand the capabilities of the Tess4J node to include languages that are not internally supported. Indeed, we can choose our own, external trained data language models by specifying the directory where they are stored. We choose “Use Internal” since we prefer to rely on the internal models of Tess4J for our English documents.
In the “Recognition Configuration” section, we find the two most important drop-down list configurations, namely the “Page Segmentation Mode” and the “OCR Engine Mode”. The first defines how our page is segmented.
In Figure 4, we select “Full Auto Pageseg”, which ensures fully automatic page segmentation. Depending on the specific use case at hand, selecting another mode out of the 13 available (e.g., “Single Column” or “Sparse Text”) might be a more suitable option.
The second setting asks us to choose the OCR engine. Here, we select “Tesseract Only”, which ensures the fastest execution. Other options include “Cube Only” — an alternative recognition mode for Tesseract — which is slower but often produces better results; or “Tesseract And Cube”, which combines the best of both worlds. Picking one engine or another strongly depends on the quality of the image and complexity of the text that we wish to process.
In addition to the basic settings, the Tess4J node offers an Advanced Config tab where we can define a set of control parameters. This tab makes the node extremely flexible and helps expert users to customise and fine-tune the Tesseract OCR engine to their specific needs. Do not worry, though, for most cases the basic configurations will take you a long way!
Figure 4: Configuration dialog of the Tess4J node.
Besides tweaking the configurations of the Tess4J node to the use case at hand, it is a good practice to preprocess input images thoroughly, if needed. In particular, Tesseract works best when images are sufficiently scaled up such that the pixel count of the x-height of characters is at least 20 pixels; images are correctly aligned and have a sufficiently high resolution; and any dark borders is removed, or they might be misinterpreted as characters . The KNINE Image Processing extension includes several nodes for image cleaning, manipulation and transformation, and many example workflows can be found on the KNIME Hub.
The output of the Tess4J node is a table containing the extracted text as String data type, and as such it can be searched and edited.
3 — Text Processing for ISBN Extraction
Once the images are OCRed, the text that they contain can be finally accessed and useful information retrieved.
In particular, edition notices usually report the ISBN code assigned to the book. The ISBN code is a unique, 13-digit long (it used to be 10-digit long before 2007), commercial book identifier and as such it is assigned to each separate edition and variation of a publication. Extracting the ISBN code allows us to refer to each book unambiguously when we want to retrieve metainformation. To achieve that, we can rely on the nodes included in KNIME — Text Processing extension, some of which are used in the “ISBN extraction” metanode (Figure 5).
In the “Text cleaning” metanode, we start by transforming the OCRed text from String to Document data type. Next, we convert the text to lowercase, remove punctuation, blank spaces, hyphens, and replace letters “o” by “0” (zeros) to correct for misrecognized characters in the ISBN codes.
We extract the ISBN codes by isolating the 13 characters that follow the string “isbn”, and use the Rule Engine node to check if the extracted characters do not contain missing values and have an expected length of 13 characters. We then exploit this node’s capability to append a column that labels successful extraction as 1 and unsuccessful extraction as 0.
Figure 5: Inside the “ISBN extraction” metanode.
4 — Metainformation Retrieval and Visualization
In the final step, we use the ISBN codes to retrieve book metainformation and covers from the Google Books API. The “Get book metadata and covers” metanode takes care of that (Figure 6). However, metadata retrieval is possible only if the ISBN codes were successfully extracted. To ensure a smooth handling of successful/unsuccessful ISBN extraction, we include several workflow control nodes. You can find an insightful overview in the Cheat Sheet: Control and Orchestration with KNIME Analytics Platform.
If the ISBN code is extracted successfully, we use the GET Request node to send a GET Request to Google Books API, a free RESTful web service powered by Google that allows the retrieval of several metainformation such as book title, subtitle, authors, date of publication, description, page count, language, average rating, rating count and cover. Furthermore, this RESTful web service does not require the creation of a developer account. The configuration of the GET Request node is very straightforward. It requires the simple selection of a meaningful “URL column” that we construct in the String Manipulation node by joining the Google Books API URL with the ISBN code of each edition notice. All other configurations can be left as default.
We then parse the JSON output of the GET Request node using the JSON Path node and join extracted metainformation with book covers before collecting the final results.
Finally, we create the “Visualize book metadata and covers” component for a neat visualization of the retrieved book metainformation and covers.
In the component, we wrap the Interactive Range Slider Filter Widget node to enable dynamic book filtering based on the average rating count (0-terrible; 5-fantastic) assigned by readers on Google Books, and the Tile View node to display the results.
The component then acquires a view including the slider to select the books based on the average rating and a table hosting the covers and description of selected books. For this article, we have selected to extract books with ratings between 3 and 5, and the results are displayed in Figure 6.
Figure 6: Retrieved metainformation and book covers for the books with ratings higher than 3.
In this article, we have illustrated how OCR can be easily conducted in KNIME Analytics Platform. To this end, we have presented the Tess4J node and provided details on the functioning of the Tesseract OCR library on which this node is based.
Furthermore, we have shown a simple use case where OCR can be a powerful and useful resource. We have extracted information from book edition notices –in particular, the ISBN codes– to send a GET Request to Google Books’ RESTful web service. This has allowed us to retrieve book metadata and covers.
With KNIME, OCRing images to extract critical pieces of information becomes as easy as ABC. Try it out yourself! What is your OCR use case?
The workflow presented in this article can be downloaded for free from the KNIME Hub.
 Smith, R. (2007). “An Overview of the Tesseract OCR Engine”. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), pp. 629–633. Accessible at:
 Tesseract OCR project on GitHub — https://github.com/tesseract-ocr/tesseract
 Tesseract OCR documentation on GitHub — https://tesseract-ocr.github.io/tessdoc/
Roberto Cadili is a data scientist at KNIME, NLP enthusiast, and history lover. Editor for Low Code for Advanced Data Science.
Lada Rudnitckaia is a data scientist at KNIME.
As first published in Low Code for Advanced Data Science.
Original. Reposted with permission.