Add Tesseract Optical Character Recognition to Fusion Connectors

Table of Contents

Tesseract Optical Character Recognition (OCR) solution
Prerequisites
Add Tesseract OCR solution

Tesseract Optical Character Recognition (OCR) solution

The Tesseract OCR is an open-source solution that can be added to interact with Fusion connectors in releases 5.2 and later. The example in this topic represents a classic REST service that interfaces with V1 connectors including functions such as file upload and web crawl.

To set up OCR for V2 connectors, you must repeat this process for each individual Docker image related to the connector.

Prerequisites

The following must be established before adding the Tesseract OCR solution:

A local environment for installing and managing Fusion 5 that includes Google Cloud Tools and other required components.
The Docker daemon must be running on MacOS and a Docker account for hub.docker.com.
Fusion release 5.2 or later installed and deployed.

Add Tesseract OCR solution

Execute the following to create a Docker file:
```
FROM lucidworks/classic-rest-service:5.2.1
USER root
RUN apt-get install -y tesseract-ocr
USER 8764
```
The file:
- Directs Kubernetes Helm to use an existing image with the <repo>/<image>:<tag> format as the basis for the new image.
- Switches to the root user to perform the Tesseract install.
- Switches back to user 8764 because the classic REST service pod in Kubernetes is not permitted to run as root.

Build the new Docker image in the same directory as your Dockerfile. Enter values that reflect your image and directory. For example:

docker build -t jdoe/lucidworks/classic-rest-service-ocr:1.0.1

In Fusion 5.2 and later, the dependency check in Fusion must be included in any custom operation. You must add the dependency image where the custom connector image is stored (at the same level and in the same repository). The sample commands are:
```
docker pull lucidworks/check-fusion-dependency:v1.2.0
docker tag lucidworks/check-fusion-dependency:v1.2.0 jdoe/check-fusion-dependency:v1.2.0
docker push jdoe/check-fusion-dependency:v1.2.0
```
Access the Docker hub to view the image-related information such as name, tag, digest, and operating system.

Open the fusion_values.yaml file and replace the existing connector image with the custom version. For example:

classic-rest-service:
    image:
    repository: jdoe
    name: classic-rest-service-ocr
    tag: 1.0.0
    nodeSelector:
        cloud.google.com/gke-nodepool: default-pool

Execute the standard process to upgrade (rebuild) the Fusion cluster.

Access the Tesseract pod using ssh and run tesseract -v to verify Tesseract is installed and working correctly. The result is similar to the following:

<<K9s-Shell>> Pod: jdoe-poc/jdoe-classic-rest-service-0 | Container: classic-rest-service
fusion@jdoe-poc-classic-rest-service-0:/$ tesseract -v
tesseract 4.0.0
 leptonica-1.76.0
   libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found SSE

Access each Fusion parser used for a datasource that performs OCR and select the following items:
- Apache Tika
- Include images
Scan one of the following files to test the OCR function:
- A .pdf file, that may contain an underlying .tiff file
- A .jpeg file
- A .gif file
Verify the parser correctly extracts the information, which includes the body_t field.