Konfuzio REST API

Visit API Version 2 here.

Preview Version 3

General

  • All list endpoints have pagination and sorting (created_at, asc/desc).

  • All detail endpoints support returning only a subset of the fields.

  • The document endpoint currently has additional filering in the form of created_at_before and created_at_after, this will be applied to all endpoints with a mixin over time.

Webhooks

  • Set per project with standard URL fields

  • Sends a POST request with JSON data to the URL

  • Current: document_created. Probably worth adding: document_deleted, airun_complete

  • Send webhooks with Celery so we don’t inflate extraction time with slow POST requests

  • We will provide short HOWTO and Django implementation example to secure the endpoint by only allowing POST requests from Konfuzio IPs. This sample implementation could also be used to unit test the webhooks.

Performance

Annotations

  • GET /api/documents/{document_id}/annotations/ (list annotations)

    • paginated

    • GET parameters to filter: is_correct/revised/created_by_machine/top_annotation

    • add url field to return the annotation’s permalink

  • GET /api/documents/{document_id}/annotations/{annotation_id}/ (retrieve annotation)

    • same as list but with a single instance

  • POST /api/documents/{document_id}/annotations/ (create annotations)

    • similar to the current smartview annotation creation endpoint

    • required parameters: the fields in SequenceAnnotationSerializer

    • original_bboxes should be renamed to bboxes (read/write)

    • document that we need EITHER start/end offset or bboxes

  • PUT/PATCH /api/documents/{document_id}/annotations/{annotation_id}/ (update annotation)

    • parameters are the same as create; they are optional in case of PATCH

    • should this still create a negative annotation? yes, add to the documentation that changing the label of an annotation might result in a negative copy in certain situations.

  • DELETE /api/documents/{document_id}/annotations/{annotation_id}/ (delete annotation)

    • should this still create a negative annotation? no

    • document that it’s probably better to send a PATCH request with revised=True and correct=False

Authentication

  • POST /api/token-auth/ (login)

    • should create a new token if you POST again

  • DELETE /api/token-auth/ (remove token)

Categories

  • GET /api/categories/ (list categories)

    • paginated

    • GET parameters to filter: project_id

    • without parameters it returns all categories the user can access

    • otherwise same as current (in testing)

  • GET /api/categories/{category_id}/ (retrieve category)

    • same as list but with a single instance

  • POST /api/categories/ (create category)

    • required parameters: project_id plus the fields in the serializer

    • same as current

  • PUT/PATCH /api/categories/{category_id}/ (update category)

    • parameters are the same as create except project; they are optional in case of PATCH (reference is the current admin panel)

    • same as current

  • DELETE /api/categories/{category_id}/ (delete category)

    • same as current

Category AIs

  • GET /api/category-ais/ (list category AIs)

    • paginated

    • GET parameters to filter: project_id

    • without parameters it returns all AIs the user can access

    • otherwise same as current (in testing)

  • GET /api/category-ais/{category_ai_id}/ (retrieve category AI)

    • same as list but with a single instance

  • POST /api/category-ais/ (train category AI)

    • required parameters: project_id plus the fields in the serializer

    • currently we have original_project but maybe it’s better to have project_id to be consistent

    • otherwise same as current

  • PUT/PATCH /api/category-ais/{category_ai_id}/ (update category AI)

    • parameters are the same as create except project; they are optional in case of PATCH

    • same as current

  • DELETE /api/category-ais/{category_id}/ (delete category AI)

    • same as current

Documents

  • General: normalize doc/docs -> document/documents

  • GET /api/documents/ (list documents)

    • paginated

    • GET parameters to filter: project_id

    • without parameters it returns all documents the user can access

    • current fields (excluding bbox)

  • GET /api/documents/{document_id}/ (retrieve document)

    • same as list but with a single instance

    • for labels and groups (annotation sets), two options: either plain list but we need to duplicate information in multiple objects, or allow some nesting to show the structure of the document

    • the way to go: probably the nested one, but in a consistent way: merge labels and groups into a single field (label_sets) which looks like this:

    "label_sets": [
      {
        "id": 1,
        "name": "Fahrgast",
        "labels": [
          {
            "id": 1,
            "name": "Anrede",
            "annotations": [
              {
                "id": 8937385,
                "value": "Herr S",
                "correct": true,
                "accuracy": 0.999995502681811,
                "bbox": {
                  "bottom": 286.2144,
                  "page_index": 0,
                  "top": 262.0152,
                  "x0": 40.14,
                  "x1": 546.9264,
                  "y0": 549.7848,
                  "y1": 573.984,
                  "line_index": 1
                },
                "start_offset": 1880,
                "end_offset": 2175
              }
            ]
          }
        ]
      }
    ]
    
    • we lose the label names as keys, but this allows to make the implementation much simpler, as we can just use serializers all the way: SectionLabelSerializer -> LabelSerializer -> AnnotationSerializer - without using custom methods with expensive queries to form the custom dict: we can just pass the correct queries to the serializers and have it figure it out. This also generates a well-formed swagger documentation with proper types and examples.

  • GET /api/documents/{document_id}/bbox/ (retrieve document’s bbox)

    • only return a document’s bbox

  • POST /api/documents/{document_id}/search/ (search a document)

    • required parameters: query

    • returns a list of matching bboxes for the query

  • GET /api/documents/{document_id}/pages/{page_number}/ (get a document’s page)

    • returns entities and image URLs for a document’s page

  • ~~POST /api/documents/ (create document)~~

    • ~~new Upload model containing project, data_file_name, dataset_status, category_template, callback_url, sync, extraction_url~~

    • ~~required parameters: project_id~~

    • ~~no files, JSON only~~

    • ~~returns metadata and the URL where to PUT the actual file (/api/documents/upload/{upload_id}/)~~

    • ~~if the file is not uploaded, the Upload instance is deleted after x minutes (1 hour?)~~

  • ~~PUT /api/documents/upload/{upload_id}/ (upload document)~~

    • ~~only accepts a single binary (no JSON)~~

    • ~~must be called after POST /api/documents/ with the specified ID that is returned~~

    • ~~upload_id is going to be different than the document_id so we might encode it (base64?) to avoid confusion~~

    • ~~once complete, the Upload instance is deleted and a Document instance is created~~

    • ~~returns the DocumentSerializer of the created instance~~

  • POST /api/documents/ (create document)

    • keep as it is now, and document with a warning that this endpoint only accepts multipart/form-data

  • PUT/PATCH /api/documents/{document_id}/ (update document details)

    • parameters: assignee, data_file_name, dataset_status, category_template, ?

    • pretty much the same as current

  • DELETE /api/documents/{document_id}/ (delete document)

    • same as current

  • paragraph, segmentation, summarization: do we need these and how should they be changed? (skip for now)

Extraction AIs

  • General: mirrors category AIs

  • GET /api/extraction-ais/ (list extraction AIs)

    • paginated

    • GET parameters to filter: project_id

    • without parameters it returns all AIs the user can access

    • otherwise same as current (in testing)

  • GET /api/extraction-ais/{extraction_ai_id}/ (retrieve extraction AI)

    • same as list but with a single instance

  • POST /api/extraction-ais/ (train extraction AI)

    • required parameters: project_id plus the fields in the serializer

    • currently we have opriginal_category but maybe it’s better to have category to be consistent

    • otherwise same as current

  • PUT/PATCH /api/extraction-ais/{extraction_ai_id}/ (update extraction AI)

    • parameters are the same as create; they are optional in case of PATCH

    • same as current

  • DELETE /api/extraction-ais/{extraction_id}/ (delete extraction AI)

    • same as current

Labels

  • GET /api/labels/ (list labels)

    • paginated

    • GET parameters to filter: project_id

    • without parameters it returns all labels the user can access

    • otherwise same as current

  • GET /api/labels/{label_id}/ (retrieve label)

    • same as list but with a single instance

  • POST /api/labels/ (create label)

    • required parameters: project_id plus the fields in the serializer

    • same as current

  • PUT/PATCH /api/labels/{label_id}/ (update label)

    • parameters are the same as create; they are optional in case of PATCH

    • same as current

  • DELETE /api/labels/{label_id}/ (delete label)

    • same as current

  • General: the sectionlabel/label relationship is shown here differently than it is in the admin, should be unified (so that you can change both from label and sectionlabel, probably)

Label Sets

  • General: label sets which are categories should be filtered out of the API

  • GET /api/label-sets/ (list label sets)

    • paginated

    • GET parameters to filter: project_id

    • without parameters it returns all label sets the user can access

    • otherwise same as current (in testing)

  • GET /api/label-sets/{label_set_id}/ (retrieve label set)

    • same as list but with a single instance

  • POST /api/label-sets/ (create label set)

    • required parameters: project_id plus the fields in the serializer

    • same as current

  • PUT/PATCH /api/label-sets/{label_set_id}/ (update label set)

    • parameters are the same as create except project; they are optional in case of PATCH (reference is the current admin panel)

    • same as current

  • DELETE /api/label-sets/{label_set_id}/ (delete label set)

    • same as current

  • To discuss further with Flo about moving Category to a separate model

Projects

  • GET /api/projects/ (list projects)

    • paginated

    • returns all projects the user can access

    • same fields as current

  • GET /api/projects/{project_id}/ (retrieve project)

    • same as list but with a single instance

  • POST /api/projects/ (create project)

    • same as current

  • PUT/PATCH /api/projects/{project_id}/ (update project)

    • parameters are the same as create; they are optional in case of PATCH

    • same as current

  • DELETE /api/projects/{project_id}/ (delete project)

    • same as current

  • GET /api/projects/{project_id}/members/ (list project’s members)

    • paginated

    • returns id and email

    • rationale not to have this as a separate endpoint: the members list doesn’t make sense outside the project context, unlike other models where having all instances regardless of project might be useful

  • POST /api/projects/{project_id}/members/ (add a project member)

    • required parameter: email

    • creates the User if it doesn’t exist

  • DELETE /api/projects/{project_id}/members/{member_id}/ (remove a project member)

  • rationale for not having PUT/PATCH endpoint for members: it doesn’t make sense to edit a member’s email here as it will change its User email; better to DELETE and POST a new one