Categories

Refers to a group of Documents which can be extracted by one Extraction AI. Per Projects there can be multiple categories.

Add a category

img.png

Category details

img_2.png

img_1.png

Name

Verbose name of the category.

API Name

Technical name of the category used in the REST-API.

Description

A short description of the category. We highly recommend using it. Even the description seems to be trivial, it helps to share the knowledge between users.

Active Extraction AI

The extraction AI is used for all documents classified as the respective category.

Detection mode of Extraction AI

Level of granularity the extraction AI will be trained on. It defines the smallest element the extraction AI of this category can operate on.

Character

Use the character option in case the text you want to extract is within one word, as marked in the SmartView.

img_5.png

Word

The smallest element will be one word as marked in the SmartView.

img_3.png

Multiple words can be connected.

img_4.png

Sentence (Beta)

A full sentence is marked.

img_10.png

Paragraph (Beta)

A computer vision approach which will detect Paragraphs.

img_6.png

Extraction AI Parameters

img_7.png

There are advanced configurations of the training of the extraction AI that you can adjust to achieve a better fitting for your case.

These configurations allow generating features that may help your extraction AI to better identify the entities that you aim to detect such as specific names or dates. These entities will be learned based on your annotations. The extraction AI will get the text of the document, look through the words that you annotated and learn features associated with them. You can extend these features by specifying helpful words/expressions or the number of neighboring words that the extraction AI should consider when evaluating a word(s) in the document.

The configurations that you can specify are:

  • catchphrase features: look for the distance in lines from the word being evaluated to the specified expressions

  • substring features: look for the existence of a certain words(s) in the page

  • n nearest: look for n words in the neighborhood

  • n nearest across lines: look for the neighbors also across lines

  • separate labels: considers Labels shared by Label Sets as different Labels

You can adjust them in the field “Extraction AI parameters” of the category view. They must be a valid JSON.

Catchphrase features

Use this feature if the entities that you aim to identify appear after some specific words/expressions.

For example, if one wants to identify the invoice number and most of the documents in the dataset have the invoice number right after the expression “Invoice Nr.:” even in different locations in the document.

In the “Extraction AI parameters” field, pass the catchphrase features as a list with the key ‘ catchphrase_features’. By default, no catchphrase features are used.

{
  "catchphrase_features": [
    "Invoice Nr.:"
  ]
}

The way it works is:

  • For each phrase/expression in the catchphrase list, it collects the line numbers for which the document text matches it. The match is exact.

  • The lines are numerated splitting the document text by line break (‘\n’). Page breaks (‘\f’) are transformed first in line breaks (‘\n’).

  • For each annotation, a dictionary is added as an attribute that contains the number of lines between the line where the annotation is and the previous immediate catchphrase occurrence. The closest occurrence can be on the same line as the annotation (distance = 0). If there aren’t any occurrences before the annotation, the value of distance is -1.

  • Each catchphrase will correspond to a feature.

Restrictions
  • It only checks for catchphrases occurrences on previous lines or on the same line.

  • The catchphrase cannot be a multiline expression. It must be contained in a single line of text.

  • It searchs for exact matches (i.e. case sensitive).

  • It only considers the closest occurrence. Is not possible to consider the distance from multiple occurrences of the same catchphrase.

Substring features

Use this feature if the entities that you aim to identify normally appear associated with a word(s)/expression(s) on the page.

For example, if the entities that one wants to identify, belong to the obligations section of a contract, and we have that section identified in the header of the pages as “Obligations”, we can use this word as a substring feature.

In the “Extraction AI parameters” field, pass the substring features as a list with the key ‘substring_features’. By default, no substring features are used.

{
  "substring_features": [
    "Obligations"
  ]
}

The way it works is:

  • For each annotation, it checks the existence of each substring in the text of the page where the annotation is. The match is exact.

  • A list is created for each annotation to keep the information of the existence of the substrings. “True” is added if there is a match of the substring in the text of the page. “False” is added if there is no match or if there is no information from the page where the annotation is.

  • The list is added as an attribute to the annotation.

  • Each substring will correspond to a feature.

Restrictions
  • It searchs for exact matches (i.e. case sensitive).

N nearest

Use this feature if you think that providing more context can help to detect the entities that you want to identify.

In the “Extraction AI parameters” field, pass the number of neighboring words to consider as an int or a list with the key ‘n_nearest’. If you pass a list, the first number is considered the number of words to consider to the left and the second, the number of words to consider to the right. If you pass an int the same number of words are considered to the left and to the right. By default, the Extraction AI considers 2 words to the left and to the right.

{
  "n_nearest": [
    2,
    2
  ]
}

The way it works is:

  • For each annotation, it gets all left and right neighbors that are on the same line.

  • Being “n_nearest”: [l, r], the l closest neighbors to the left and the r closest neighbors to the right are selected.

  • For each of those neighbors, the offset string and the distance between the annotation and the neighbor are added as features.
    The distance is calculated as following:
    for left neighbors: annotation.x0 - neighbor[‘x1’]
    for righ neighbors: neighbor[‘x0’] - annotation.x1

  • If there are less than l and r neighbors, “fake” neighbors are added to reach the specified number.
    The “fake” neighbors have an empty offset string (“”) and a distance to the annotation of 100000.

  • For each of the selected neighbors, the same features as the ones used for the annotation are also added.

Restrictions
  • It only checks for neighboring words on the same text line.

N nearest across lines

Use this feature if the neighboring words should also be considered across the text lines. It completes the “n nearest” feature by searching for neighbors in other lines than the one where the annotation is.

For example, if an entity that you want to identify normally has a word on the line that precedes it. Or, in case where aren’t many words in the lines of the annotations, instead of adding “fake” neighbors, they are collected from the previous and next lines.

In the “Extraction AI parameters” field, pass the boolean variable “true” with the key ‘n_nearest_across_lines’. By default, the neighboring words are not considered across lines.

{
  "n_nearest_across_lines": true
}

The way it works is:

  • For each annotation, checks the previous lines (line by line), starting from the line where the annotation is.

  • For each word in that line, gets the minimum distance of the x coordinates between the annotation and the word.

  • For each of those words, the offset string and the distance between the coordinates of the annotation and the word are added as features, as well as the distance in lines to the annotation.

  • It stops checking the lines when the number of neighboring words is equal or superior to the specified number of left neighbors.

    • If the annotation has no left neighbors in its line, the limit is defined by the specified value for the left neighbors.

    • If the annotation has many left neighbors in its line, no words from the previous lines will be considered.

  • The same process happens for the right neighbors but the lines to be verified are those that are after the line where the annotation is.

Restrictions
  • The number of neighboring words across lines is limited by the “n_nearest” parameter.

Separate Labels

Use this feature if you have Labels shared by different label sets and the results of the extraction AI are not having the correct label set assigned.

For example, let’s consider that one wants to identify the surname of the sender and the surname of the receiver in invoices. There is a Label “Surname” associated with the label set “Sender” and also associated with the label set “Receiver”. However, after retraining an extraction AI, the results are identifying the surname of the sender as if it was from the receiver. We can use this feature to help the distinction.

In the “”Extraction AI parameters” field, pass the boolean variable “true” with the key ‘separate_labels’. By default, the separation does not occur.

{
  "separate_labels": true
}

The way it works is:

  • For each annotation, we get the name of the label and the name of the label set where it’s associated. For example, “Surname” from “Sender”.

  • A new label is created with the new name defined based on the name of the label and label set collected. For example, “Sender__Surname”.
    If the label already exists, this step is skipped.

  • The annotation is associated with this new label.

  • In the extraction step, the label set name is extracted from the resulting label and the label name is rewritten to its original version. For example, the result with the Label “Receiver__Surname” will correspond to a result with the label “Surname” in the Label Set “Receiver”.

  • Labels associated to the label set used as category will not be rewritten.

Extraction evaluation for active extraction AI

Summarizes the AI quality of corresponding label set.

img_9.png