This page explains the contents and usage of the text extraction module used for extracting text from a image that was handled or created by the file loading module (documentation). Using Optical Character Recognition (OCR) text is extracted. These images are required to be images to be read.
The module contains two files with a total of 3 classes that together contain all the functionality required to extract text from a file and apply post processing to optimize output. The TextExtractor file (source) contains the core functionality that extracts text from a image using the pytesseract module. This is a wrapper library for the tesseract library built in C++.
The other file is the post processing file (source). This module file contains two functions; one for cleaning a word of unwanted characters, and one function for cleaning sentences.
An example on how the text extractor can be used is seen below:
from text_extraction.text_extractor import TextExtractor
te = TextExtractor()
te.out_dir = "/outdir"
te.read("/inputfile.jpg")
This module should be used with the folder watcher! See example below
from folder_watcher.folder_watcher import FolderWatcher
from text_extraction.text_extractor import TextExtractor
te = TextExtractor()
text_extractor.out_dir = "/outdir"
fw = FolderWatcher("path/to/folder", te.read)
fw.watch()