Improve Extraction AI

You have trained your first Extraction AI? Great, now we will support you to reach 90 % + accuracy.

Reaching high levels of accuracy will provide you with a trade-off. Either you invest time to review your training data over and over again vs. gaining accuracy by doing so. This tutorial does not only how you improve accuracy but also provides you with an indication of how much time you should plan to invest.

Prerequisites

At any time make sure to check the following:

  1. Per Label you should have 10 Annotations otherwise those Labels will not become part of your AI. You can see the number of Annotations on the Label Page .

  2. Make sure you have an equal amount of Annotations per Label. In case one Label does not occur in every document, an unequal distribution is understandable. If a Label appears less often it will be predicted with lower confidence in the beginning. The more documents are added to training the higher the confidence for this label will become.

  3. You should have more Status: Training documents than Status: Test documents, at least you need two.

Those checks will not improve the accuracy but create a solid base to improve the AI.

Extraction AI setup

This step might feel surprising. If you reupload a document which was used by the AI to learn. Interestingly this is the fastest procedure to check if the AI can understand what how you are teaching, i. e. labeling the document.

When you reupload a document from the training data, your expectation should be every Annotation you created manually should be recreated after upload. If this is not the case the structure of your AI seems to be inconsistent.

On the left side you see the Training Document on the left side you see a copy of this document newly uploaded:

Improve NLP model by reviewing the training data

Make sure to see similar results as in the training data

When the model does not learn the structure you have labeled you need to review your Categories, Label Sets and Labels. Something seems to be wrong.

Background information

The Konfuzio AI does not work rule-based, but result-oriented. It considers the training data as the desired result and will set up rules for itself in order to apply them to new documents and to try to achieve a corresponding result. In order for it to be able to recognize clear structures in this process, a clearly structured approach should also be taken during the manual labeling. Irregularities will cause the AI to search for rules and structures that do not exist, making it more difficult for it to make the right decisions.

The more uniform or homogeneous the documents are among each other, the more accurate the results are. Standardized or normed documents are optimal. However, this is usually not the case and is out of one’s control. In principle, this is not a problem for Konfuzio, but it means that the importance of the quality and quantity of the training data increases with the heterogeneity of the documents.

Change the mode of extraction

Try to change from the Default “Word” to “Chracter” setup, see Detection mode of Extraction AI.

It is not for sure, but some users report they see an increase of performance of 10 % when switching to “Word based”.

Use automated Annotations

Let’s assume you started with a small number of Labels you want to extract, e. g. gross amount and date. After a while you find out you need to label “VAT ID”, too.

What we see in this scenario is, users forget to add the new Label to documents they have already labeled. The great thing is, once you have trained an AI you can “rerun” it on an document. By doing so the model will suggest additional Annotations you did not yet create. The model will never overwrite Annotations you have created or revised manually.

For example, for monetary amounts in receipts, you should either always label the currency (e. g. the euro symbol) or always omit it. It does not matter which way you choose. It is important to do this consistently in all documents and also within a document. Of course, this also applies to other units such as kg, m2 etc. and other composite information.

How to rerun a model:

  1. Go to Documents and select the documents

  2. Select “Rerun extraction”

  3. Press “Go” and wait for the extraction to finish

img.png

In case the model created new Annotations you will see their number as “Feedback required”.

img_1.png

Enter the Smartview and filter for “Feedback required” to review the Annotations created by your AI.

img_2.png

Background

What if a value is printed on every page?

Let’s take the following example. All pages of a document type contain the date in the upper right corner. Does the date need to be marked on all pages? In a document with many pages, this can become quite time-consuming. Typically, this is still done in the first document, then in the second document the date is marked on the first 3-4 pages and in the third document only on the first page.

This is where the following problem occurs. The AI will look for a reason why the date on the 5th page of the first document was relevant, but the one on the second page of the third document was not. But since there is no meaningful reason here, the AI will be “confused”, in human terms, which has a negative effect on the results. If you rerun the model on an document you can see where the model would expect Annotations. To prevent confusion: Either always label the repeating information on all pages or always only on the first page.

Should I include punctuation?

For consistency, it is important that when reading individual words from texts, commas, periods, brackets and other punctuation marks are not included. You should always mark only the actual content that you want to read. Punctuation marks usually come from the context of the sentence structure, but are rather arbitrary based on the training data and thus not suitable to be analyzed for the purpose of predictions. Otherwise, the AI will look for a comma at the end of the word to be read in the future, even if it has nothing to do with the information sought.

Sophisticated checks for Experts

Download the Extraction AI model Evaluation, see Get evaluation as CSV file.

1. Filter by False Positives (FP) with high Confidence

FP are Annotations that the model predicts but are wrong. Most of the time there are two reasons for it. First, you forgot to label it. So the model will learn to predict them using some Documents in the Training Documents, but you forgot to label them in the other Documents. Second, you might mislabeled Annotations. In case of a mislabeled Annotation the model is highly confident that the Labeled you applied to an Annotation is wrong, which then creates a FP.

To detect missing or mislabeled Annotations filter for all FP in the result column and sort by the “confidence” value:

  1. Missing Annotations can cause FP with high Confidence most likely. If you see Annotations in this list, review those documents by clicking on the link in column “link”.

  2. Mislabeled Annotations can also be a cause of FP. Filter for all FP where the “offset_string” is equal to the “pred_offset_string”.

If you find a high number of False Positives where the “offset_string” is not equal to the “pred_offset_string” please try to change the AI model setup, see Change the mode of extraction

Please contact us if you find a high number of Annotation if the “section_label” differs from the “pred_section_label”.

2. Filter by True Positives (TP) and sort by Confidence in ascending order

TP are Annotations that the model predicted correctly. If you rank by Confidence, are there Annotations which have a low Confidence? The model was able to predict the Annotation correctly but with low confidence. There are two main reasons, why the confidence is low:

  1. The number of examples in the Training documents is too low: You can review the number of Annotations per Label by visiting Home > Data > Labels. You will find a column “Training Annotations”. If the number of Annotations is unbalanced accross Labels try to add more documents to the training.

  2. Labels are too similar or even relate to the same information: When multiple users work in the same project it can happen that users create two labels which refer to the same information but are named differently. The AI will be confused when to use which Label. Make your Labels refer to a unique information. If you see Labels which refer to the same information but named differently you need to delete the duplicated Label. Delete the Label with fewer Annotations, see Home > Data > Labels, and review all Documents again.

3. Filter by False Negatives (FN)

FN are Annotations that the AI couldn’t predict. Please review your Annotations, most probalby those Annotations are wrong. One example could be the “#” which is labeled as Currency.

img_5.png

4. Filter by Label

If you suspect that some Label is using wrong training data, a quick way to verify it is to filter by Label and check the “offset_string” column. You can detect wrong annotations by searching for outliers. For example, if you are only expecting numerical values for the annotations of the Label “House Number” but you see a word in the “offset_string” column.