Text extraction is the first step in the Knox pipeline. The primary focus of this step is to extract and validate text from PDF and other document types to allow later steps to process. The context in which this step runs can be found here. As a part of the validation step a post processing of the outputted data will ensure maximum output for the later groups to work on.
The output that is generated for the next step in the pipeline (entity extraction) is .txt files with sentences.
The structure of the pipeline step can be seen below:
The meaning of the pipeline is somewhat self-explanatory. The green boxes around each of the modules denotes a docker container. This means that each of the containers and the content within is self contained and can be replaced by needs. The file watcher module is part of each of the containers as it acts as the entrance of each container, catching content made by the previous containers.
To use the project locally please follow these below steps and additionally ensure that you fulfill the following requirements fulfilled:
To build the project follow these below steps:
0. Only for windows: Enter WSL and go to the root of WSL (command: cd
and you should enter ~
)
git clone https://github.com/Knox-AAU/Preprocessessing_Text-extraction.git
cd Preprocessessing_Text-extraction
source run setup
To start the project locally run:
sh run dev up
To stop the project from running:
sh run dev down
To run linting on the project:
sh run lint
To run tests for the project:
sh run test
To tag for production build
To tag the new production it can be done through terminal
git tag {version} {branchName} # e.g. git tag 1.2 main
How to deploy new version
After tagging next production package it is possible to pull from server
Connect to AAU VPN
SSH into preproc01
ssh <STUDENT_MAIL>@knox-preproc01.srv.aau.dk
Git clone project and run
sh run prod up
or
sudo docker compose -f docker-compose-prod.yml pull && docker compose -f docker-compose-prod.yml up
To be able to contribute to this project you will need fulfill following requirements:
Version controlling
To begin your contribution you've to branch out directly from main. Remember to pull the newest version before branching out. When you're done with the branch, you create a pull request and get it approved by another person working on the project.
To make a new branch directly from terminal, you can use following commands:
git pull
git checkout -b {branchName}`` # e.g. git checkout -b INITIALS/new-branch-name
git add {files}
git commit -m {comment about changes}
git push origin {branchName} # e.g. git push origin INITIALS/new-branch-name
Pull requests
Atleast one person is required to review changes
When pull_request is created, the workflow starts running - Checking for code structure, using a linter, and checking if unittests and other tests passes
If workflow fails, then merging is blocked until fixed
Workflow
Workflow is built through 3 steps, where last step is divided in 3 parts
Linter - Ensure good structure and readable code
Unittest - Build-in testing module, ensuring integrity and validation of modules
Deployment - Creates production packages that is pulled on server. Deployment creates three packages, one for each step in text-extraction. To run deployment, production branch (Main) need to be tagged, before workflow constructs packages.