Improve Extraction AI

You have trained your first Extraction AI? Great, now we will support you to reach 90 % + accuracy.

Reaching high levels of accuracy will provide you with a trade-off. Either you invest time to review your training data over and over again vs. gaining accuracy by doing so. This tutorial does not only how you improve accuracy but also provides you with an indication of how much time you should plan to invest.

Prerequisites

At any time make sure to check the following:

  1. Per Label you should have 10 Annotations otherwise those Labels will not become part of your AI. You can see the number of Annotations on the Label Page .

  2. Make sure you have an equal amount of Annotations per Label. In case one Label does not occur in every document, an unequal distribution is understandable. If a Label appears less often it will be predicted with lower confidence in the beginning. The more documents are added to training the higher the confidence for this label will become.

  3. You should have more Status: Training documents than Status: Test documents, at least you need two.

Those checks will not improve the accuracy but create a solid base to improve the AI.

Extraction AI setup

This step might feel surprising. If you reupload a document which was used by the AI to learn. Interestingly this is the fastest procedure to check if the AI can understand what how you are teaching, i. e. labeling the document.

When you reupload a document from the training data, your expectation should be every Annotation you created manually should be recreated after upload. If this is not the case the structure of your AI seems to be inconsistent.

On the left side you see the Training Document on the left side you see a copy of this document newly uploaded:

Improve NLP model by reviewing the training data

Make sure to see similar results as in the training data

When the model does not learn the structure you have labeled you need to review your Categories, Label Sets and Labels. Something seems to be wrong.

Background information

The Konfuzio AI does not work rule-based, but result-oriented. It considers the training data as the desired result and will set up rules for itself in order to apply them to new documents and to try to achieve a corresponding result. In order for it to be able to recognize clear structures in this process, a clearly structured approach should also be taken during the manual labeling. Irregularities will cause the AI to search for rules and structures that do not exist, making it more difficult for it to make the right decisions.

The more uniform or homogeneous the documents are among each other, the more accurate the results are. Standardized or normed documents are optimal. However, this is usually not the case and is out of one’s control. In principle, this is not a problem for Konfuzio, but it means that the importance of the quality and quantity of the training data increases with the heterogeneity of the documents.

Change the mode of extraction

Try to change from the Default “Word” to “Chracter” setup, see Detection mode of Extraction AI.

It is not for sure, but some users report they see an increase of performance of 10 % when switching to “Word based”.

Use automated Annotations

Let’s assume you started with a small number of Labels you want to extract, e. g. gross amount and date. After a while you find out you need to label “VAT ID”, too.

What we see in this scenario is, users forget to add the new Label to documents they have already labeled. The great thing is, once you have trained an AI you can “rerun” it on an document. By doing so the model will suggest additional Annotations you did not yet create. The model will never overwrite Annotations you have created or revised manually.

For example, for monetary amounts in receipts, you should either always label the currency (e. g. the euro symbol) or always omit it. It does not matter which way you choose. It is important to do this consistently in all documents and also within a document. Of course, this also applies to other units such as kg, m2 etc. and other composite information.

How to rerun a model:

  1. Go to Documents and select the documents

  2. Select “Rerun extraction”

  3. Press “Go” and wait for the extraction to finish

img.png

In case the model created new Annotations you will see their number as “Feedback required”.

img_1.png

Enter the Smartview and filter for “Feedback required” to review the Annotations created by your AI.

img_2.png

Background

What if a value is printed on every page?

Let’s take the following example. All pages of a document type contain the date in the upper right corner. Does the date need to be marked on all pages? In a document with many pages, this can become quite time-consuming. Typically, this is still done in the first document, then in the second document the date is marked on the first 3-4 pages and in the third document only on the first page.

This is where the following problem occurs. The AI will look for a reason why the date on the 5th page of the first document was relevant, but the one on the second page of the third document was not. But since there is no meaningful reason here, the AI will be “confused”, in human terms, which has a negative effect on the results. If you rerun the model on an document you can see where the model would expect Annotations. To prevent confusion: Either always label the repeating information on all pages or always only on the first page.

How does the evaluation treat values that appear multiple times?

Often a label can appear multiple times in a document, say two to three times. The AI then recognizes the information in, say, two of the three places. I.e. the information was extracted correctly - but how is this taken into account in the evaluation?

With our current evaluation (see version 2022-03-15_09-14-17 in the Changelog), each individual annotation is being checked just once per document (so if a training document contains the same information repeated in multiple places, but the AI only finds a few, the assigned score would be maximum because the information is retrieved successfully at least once). We are working on a new evaluation (currently unreleased and work in progress), where every single instance of the annotated information is being checked. So indeed, the score would be affected negatively. This change in our evaluation procedure is intentional and aimed at measuring further improvements to the AI in future versions of Konfuzio, with the goal of being able to extract all instances and only those instances.

How do I deal with class imbalance?

In multi-class classification problems, the accuracy of the algorithm depends, among other things, on the balance of the different classes. If instances of one class occur very frequently compared to the other classes, they also enter more strongly into the error function that the algorithm tries to optimize. Accordingly, the algorithm may be incentivized to correctly predict the majority class, but pay less attention to the other classes.

There are two main scenarios.

For example, if some identification number appears at the top of every page of the document, and it’s always the same number, especially with similar format surrounding the text and layout of the identification number, this doesn’t produce a class imbalance in Konfuzio. Without going into too much detail, this is because one of the first steps in the Konfuzio algorithm is the detection (or tokenization) of an item, which happens independently, before classification, and which uses a Regex approach that doesn’t behave like the typical classifiers (see our blog post for the technical details).

An example where class imbalance can be a problem: if Label A has many more annotations than Label B, and both labels have a lot of variance in their training annotations (e.g. names of persons and companies which can have many different values, and for some reason there are many more person names in the documents compared to company names).

Should I include punctuation?

For consistency, it is important that when reading individual words from texts, commas, periods, brackets and other punctuation marks are not included. You should always mark only the actual content that you want to read. Punctuation marks usually come from the context of the sentence structure, but are rather arbitrary based on the training data and thus not suitable to be analyzed for the purpose of predictions. Otherwise, the AI will look for a comma at the end of the word to be read in the future, even if it has nothing to do with the information sought.

What if some document layouts are rare in my training set?

Should it be avoided to include a very rarely occurring document layout in the AI training because it might have a negative impact on the model performance?

We usually find that new layouts will have a somewhat negative impact on model performance, when they are initially in low numbers in the training set. Gradually adding more can yield better results. In some cases, it may be a better idea to create a new category for some new layouts. Various configurations have to be tried before the optimal one is found.

Sophisticated checks for Experts

Download the Extraction AI model Evaluation, see Get evaluation as CSV file.

1. Filter by False Positives (FP) with high Confidence

FP are Annotations that the model predicts but are wrong. Most of the time there are two reasons for it. First, you forgot to label it. So the model will learn to predict them using some Documents in the Training Documents, but you forgot to label them in the other Documents. Second, you might mislabeled Annotations. In case of a mislabeled Annotation the model is highly confident that the Labeled you applied to an Annotation is wrong, which then creates a FP.

To detect missing or mislabeled Annotations filter for all FP in the result column and sort by the “confidence” value:

  1. Missing Annotations can cause FP with high Confidence most likely. If you see Annotations in this list, review those documents by clicking on the link in column “link”.

  2. Mislabeled Annotations can also be a cause of FP. Filter for all FP where the “offset_string” is equal to the “pred_offset_string”.

If you find a high number of False Positives where the “offset_string” is not equal to the “pred_offset_string” please try to change the AI model setup, see Change the mode of extraction

Please contact us if you find a high number of Annotation if the “section_label” differs from the “pred_section_label”.

2. Filter by True Positives (TP) and sort by Confidence in ascending order

TP are Annotations that the model predicted correctly. If you rank by Confidence, are there Annotations which have a low Confidence? The model was able to predict the Annotation correctly but with low confidence. There are two main reasons, why the confidence is low:

  1. The number of examples in the Training documents is too low: You can review the number of Annotations per Label by visiting Home > Data > Labels. You will find a column “Training Annotations”. If the number of Annotations is unbalanced accross Labels try to add more documents to the training.

  2. Labels are too similar or even relate to the same information: When multiple users work in the same project it can happen that users create two labels which refer to the same information but are named differently. The AI will be confused when to use which Label. Make your Labels refer to a unique information. If you see Labels which refer to the same information but named differently you need to delete the duplicated Label. Delete the Label with fewer Annotations, see Home > Data > Labels, and review all Documents again.

3. Filter by False Negatives (FN)

FN are Annotations that the AI couldn’t predict. Please review your Annotations, most probalby those Annotations are wrong. One example could be the “#” which is labeled as Currency.

img_5.png

4. Filter by Label

If you suspect that some Label is using wrong training data, a quick way to verify it is to filter by Label and check the “offset_string” column. You can detect wrong annotations by searching for outliers. For example, if you are only expecting numerical values for the annotations of the Label “House Number” but you see a word in the “offset_string” column.