This page explains the contents and usage of the Spellchecking module used for verification of extracted text. The module works closely together with the verification module to ensure the maximum number of words are shipped to the next steps in the Knox pipeline.
The purpose of the module is to validate whether the text extracted in the extraction step is valid english words. These valid words will be shipped to the verification module, that will extract and sanitize invalid words given by this module.
The module contains two classes that together contain all the functionality required to verify words.
The spellchecker works by instantiating the Spellchecker class with a optional wordlist. The method then exposes 2 functions, insert & query, that allows to insert words and find words. The spellchecker is required to have some words added to its tree in order to be able to verify their validity. A optional wordlist can be supplied, but is recommended to supply, as the instantiating of the spellchecker will add all words in the wordlist to the tree.
A example on how the spellchecker can be used is seen below:
sc = Spellchecker('path/to/wordlist')
# Or
sc = Spellchecker()
# Inserting word
sc.insert('word')
# Querying word
sc.query('word')
# returns [(word, 0)]