Open Text Corporation

Open Text Capture Document Extraction (DOKuStar)

Open Text Capture Document Extraction (DOKuStar)processes mixed batches of scanned documents. This batch can consist of forms, single- or multi-page invoices or even complete process reports including all sorts of different documents.Document Extraction classifies the individual pages, bundles pages into documents and records and extracts the essential data from each document.

With the option „Invoice“ Document Extractioncan process arbitrary commercial invoices. Now it possesses a knowledge base for the recognition of those data that must be selected from each invoice. With this knowledge base invoice header data from any invoice material are extracted at a high recognition rate, without being trained with templates of individual suppliers . And so the recognition result is tolerant against change of suppliers and/or changes in the invoice layout.

The procedure is divided into three steps: Document classification, information extraction and document creation. Of course, it is also possible to perform each step independent from the others.

Open Text Capture Document Extraction (DOKuStar)consists of two components. Design Studio is the system administrators' tool. Here, they can define a project where they specify the document classes to be differentiated, the information to be extracted for each document type and how the individual pages were combined to form documents.

Open Text Capture Document Extraction (DOKuStar) Engine processes the individual document batches. First, a project file is loaded; next, it is processed image by image. The classification results and the extracted data are stored in an XML file. The engine offers a programming interface for integrating this into other systems or specialised applications. It will usually run on a server without requiring any user monitoring.

Document Extraction (DOKuStar) offers field types to facilitate data extraction. These are essentially the same as the characteristics used to classify documents. There are specialised field types for frequent data types such as amounts, addresses and dates. Application-specific data types can be modelled using the Regular Expression field type.

To keep recurring identical information within the document separate, keywords and phrases can be defined, which are then assigned to the corresponding data type via the Key Value field type.

Design Studio

The  Design Studio is the administrator’s workplace. Here, administrators define document classes, document types and the data to be extracted, such as index or data fields. Rules are generated that tell the system how to find the relevant information within the documents.

Version 3.8 of Design Studio comes with many new and improved features. To facilitate these new features, Design Studio has been equipped with a redesigned interface. The set-up, input and result views are now independent of each other. Selecting a group of fields allows shared parameters to be modified simultaneously. The current status of Design Studio can be saved. This allows users to seamlessly carry on working from where they left off, after restarting their computers. Statistics and monitors help to optimise recognition tasks.

Every rule that is generated can be tested immediately. Feedback on the success or failure of a rule quickly leads to the highest recognition rates.

Option "Classification"

The classification module determines the class of each page to be processed. The class is always defined by the application. So a commercial enterprise might require the classes "invoice", "delivery form", "order", "credit note" and "other". However, in practice the number of document classes tends to be much higher. In order to classify the current document, Document Extraction searches the document for tell-tale characteristics specified by the administrator. These tend to be certain keywords or phrases. The field type concept supports a large number of different characteristics. The characteristics can be linked using logical operators. This results in a reliable and comprehensible recognition of even subtile differences.

Besides document classes, also the concept of document types is used. These allow finer differentiation between similar classes, such as grouping invoices according to suppliers. This finer differentiation mechanism helps to optimise the system's recognition rate.

Option "Information Extraction from Domain Specific Application"

Classification itself is not enough to achieve automated document processing. There is also the task of extracting data from the documents. If this information is only used to find a document in an archive system, the process is called indexing. If the data is transferred into a specialised application for further processing, we refer to this as 'data capture'. From a technical point of view, both processes are identical and are treated as such.

Compared with classification, data extraction is a much more demanding document interpretation task. But both steps are interdependent: The more detailed the roster of document classes and types, the more the data extraction process will resemble the familiar form-capture process.

Related Documents

English Open Text Capture Document Extraction Brochure (English - PDF)