This page explains the contents and usage of the FileLoader module used for loading in and validating files that enters the pipeline. This module is the initial step of the pipeline that ensure that only the allowed files is allowed to be processed by the next steps of the pipeline. Additionally it reads the content and metadata from the files that is processable by the pipeline.
The module consists of two files that together compose the functionality required to verify the files that enter the pipeline. One file contains a file type verification module, called the extension checker. This module verifies the filetype is processable by the pipeline. The other file contains the actual processing functionality that will open the files and make the required preperation for the text extraction to work properly.
An example on how fileloader can be used is seen below:
from file_loading.file_loader import FileLoader
fl = FileLoader()
fl.output_folder = "/outdir"
fl.handle_files("/inputfile.pdf")
This module should be used with the folder watcher! See example below
from folder_watcher.folder_watcher import FolderWatcher
from file_loading.file_loader import FileLoader
if __name__ == '__main__':
file_loader = FileLoader()
file_loader.output_folder = "/outdir"
folder_watcher = FolderWatcher("path/to/folder", file_loader.handle_files)
folder_watcher.watch()