Background¶

Aim¶

In brief. Organisations identify or detect words/strings of interest within documents or data blobs for a variety of purposes, e.g., knowledge graph development, text redaction, criminal investigations, etc. Organisations may opt for manual word detection if off-the-shelf solutions cannot address their needs. For example, most redaction software solutions focus on detecting and redacting words/strings of a narrow set of categories, and it is rarely possible to adapt the solution to the detection of words/strings of bespoke categories.

A viable solution-strategy involves the creation of custom models via named entity recognition machine learning algorithms; also known as token classification algorithms. These algorithms can be trained by domain or problem, leading to compact, domain specific, word/string detection models.

This project publishes an adaptable token classification modelling set-up, i.e., artificial/machine learning engineers can fork and adapt the repositories to their (a) token classification problem, (b) development environment, and (c) the range of language models that they would like to try.