Document Classification#

This tutorial is about the classification of documents with AI. The tutorial is not about extracting information from a document, but about the automated classification of documents into categories. This can be used to file documents and to optimize your document management and back-office tasks.

Using a public dataset, which you can access via this link, our goal is to classify receipts into five industries (cafe, restaurant, hotel, retail, and public transportation). Without creating rules, the AI learns from examples to which of the industries new receipts belong.

graph LR subgraph Users in Project Users --- Project end subgraph Category Project --- Coffee Project --- Train Project --- Retail Project --- Restaurant Project --- Transport end

Train Document AI to categorize receipts.#

1. Create project#

Have a look how to Create a Project

2. Create Categories#

Define your categories by adding categories. Here you just have to enter the name of your category (Here: “Café”, “Restaurant”, “Hotel”, “Retail” and “Public Transport”) and select your project.

3. Create training data#

Now click on DOCUMENTS to get to the document view. Here you can use your existing documents or upload new ones. Training the AI is especially easy if the file name indicates to which category this document belongs. Now we show the AI which documents belong to which category. We do this by selecting the corresponding category in the respective tab of the documents in the column “CATEGORY”. Changes will be auto-saved. This procedure is only possible if the documents are not in the training, test or preparation data set. However, if this is the case, you should first remove the documents from the dataset with the action “Remove from dataset” in order to assign the category to them afterwards. After you are done with this step, add the documents back to the training dataset. To get high quality results suitable for automated processing, you should have at least 50 documents per category. So with our 5 categories, we use a training dataset consisting of 250 documents. You can add more files to the test dataset to evaluate the Category AI model later (beta). It is very important that documents do not overlap here in any case. If you have a file that contains several document categories, it is crucial that you split them beforehand and upload them individually so that you can then assign the category to each of them separately.

4. Train Category AI#

Have a look here to train a Categorization AI fully automatically.

Wait for the e-mail to arrive after training has finished, and the new AI was evaluated on the test data.

5. Use Categorization AI#

To see if your Categorization AI Model is finished training, click HOME> “Categorization AIs”. Here you can also see a statistical evaluation of your Categorization AIs.

Go to Documents and simply upload new documents as a test to see if they are classified correctly. Here the AI should already automatically show the correct category in the column “CATEGORY”.

6. Integrate via API, SDK or CSV#

You can integrate the classification in many ways. There are unilimited use cases: Many companies use it to automatically tag document archives or split documents. This kind of preprocessing makes it very easy, for example, to filter contents of a certain document category. Have a look at our Integrations & API to see all options.