DOKuStar Extraction
DOKuStar Family
DOKuStar Professional
DOKuStar Extraction
DOKuStar Validation
DOKuStar Capture Suite
DOKuStar ACT
DOKuStar ART
Document Technologies for SharePoint ®
DOKuStar Extraction processes mixed batches of scanned documents. This batch can consist of forms, one- or multi-page invoices or even complete process reports including all sorts of different documents. DOKuStar Extraction classifies the individual pages, bundles pages into documents and records and extracts the essential data from each document.
With the option „Invoice“ DOKuStar can process arbitrary commercial invoices. Now it possesses a knowledge base for the recognition of those data that must be selected from each invoice.
With this knowledge base DOKuStar can error-free and at a high recognition rate recognise invoice header data from any invoice material, without being trained with the customer’s invoices. And so the recognition result is tolerant against change of suppliers and/or changes in the invoice layout.
The procedure is therefore divided into three steps, each of which
is explained in detail below:
Document classification, Information extraction and Document creation.
Of course, it is also possible to perform each step independent from
the others.
DOKuStar Extraction consists of two components. DOKuStar Design Studio is the system administrators' tool. Here, they can define a project where they specify the document classes to be differentiated, the information to be extracted for each document type and how the individual pages were combined to form documents.
The DOKuStar Extraction Engine processes the individual document batches. First, a project file is loaded; next, it is processed image by image. The classification results and the extracted data are stored in an XML file. The DOKuStar Extraction Engine offers a programming interface for integrating this into other systems or specialised applications. It will usually run on a server without requiring any user monitoring.
DesignStudio
The DOKuStar Design Studio is the administrator’s workplace. Here, administrators define document classes, document types and the data to be extracted, such as index or data fields. Rules are generated that tell the system how to find the relevant information within the documents.
Version 3.8 of Design Studio comes with many new and improved features. To facilitate these new features, Design Studio has been equipped with a redesigned interface.
The set-up, input and result views are now independent of each other. Selecting a group of fields allows shared parameters to be modified simultaneously.
The current status of Design Studio can be saved. This allows users to seamlessly carry on working from where they left off, after restarting their computers. Statistics and monitors help to optimise recognition tasks. Every rule that is generated can be tested immediately. Feedback on the success or failure of a rule quickly leads to the highest recognition rates.Option "Classification"
The classification module determines the class of each page to be processed. The class is always defined by the application. So a commercial enterprise might require the classes "invoice", "delivery form", "order", "credit note" and "other". However, in practice the number of document classes tends to be much higher. In order to classify the current document, DOKuStar searches the document for tell-tale characteristics specified by the administrator. These tend to be certain keywords or phrases. The DOKuStar field type concept supports a large number of different characteristics. The characteristics can be linked using logical operators. This results in a reliable and comprehensible recognition of even subtle differences.
Besides document classes, DOKuStar also uses the concept of document types. These allow finer differentiation between similar classes, such as grouping invoices according to suppliers. This finer differentiation mechanism helps to optimise the system's recognition rate.
Option "Information Extraction from Domain Specific Application"
Classification itself is not enough to achieve automated document processing. There is also the task of extracting data from the documents. If this information is only used to find a document in an archive system, the process is called indexing. If the data is transferred into a specialised application for further processing, we refer to this as 'data capture'. From a technical point of view, both processes are identical and are treated as such within DOKuStar.
Compared with classification, data extraction is a much more demanding document interpretation task. But both steps are interdependent: The more detailed the roster of document classes and types, the more the data extraction process will resemble the familiar form-capture process. Click here for more on the drawbacks of a form-based approach.
DOKuStar offers field types to facilitate data extraction. These are essentially the same as the characteristics used to classify documents. There are specialised field types for frequent data types such as amounts, addresses and dates. Application-specific data types can be modelled using the Regular Expression field type.
To keep recurring identical information within the document separate, keywords and phrases can be defined, which are then assigned to the corresponding data type via the Key Value field type.
Document Creation
DOKuStar creates the document structure by performing the following steps:
- Every single page is classified according to its type.
- The Composer then assembles the classified pages into whole documents. For this to function, the module requires a description of the existing documents and the pages (including optional ones) that they are composed of.
- Next, the module determines the type of the assembled document. If there were individual pages that could not be classified, the page type may now emerge from the overall document structure.
- The required data is extracted from the documents. All data associated with a particular document is stored in a single results file.

Related Documents
DOKuStar Extraction Brochure (English - PDF)
DOKuStar Extraction Brochure (Deutsch - PDF)

International
Deutsch
Française
Italiano
USA