Unstructured Forms

What is an Unstructured Form?

Anyone who has ever filed an income tax return is intimately familiar with a form. The lines and columns on the paper clearly constrain and separate the content to a specific location. What if the form is not designed that way? What if a form does not constrain data? What if the information is not always in the same place? This would be an example of an unstructured form.

Historically, automated document processing systems have focused on two extremes of document type: completely structured and completely unstructured. Forms, such as tax forms, census forms, etc., are the primary examples of completely structured documents. Automated data extraction, from completely structured documents, is reliant on the data element being in a known horizontal and vertical (X-Y) position. Data elements are located based on these known positions, and are read using optical character recognition (OCR/ICR) technology. The data elements are then entered into a structured database for further processing. At the other extreme of document processing are completely unstructured documents. Examples include letters, reports, newspapers, and magazines. In general, such documents do not present information in repeatable X-Y positions. In automated processing, unstructured documents typically are treated as consisting of blocks of text. OCR technology is used to convert the text from unstructured documents for entry into full-text databases.

However, there are a huge number of documents whose structure places them in between these two categories. Those are the loosely structured documents common in many applications and industries. Typical examples include invoices, purchase orders, certain real estate documents, etc. These documents contain known information in variable or even completely unknown locations. Some or all of the fields, called data elements, can vary their location on a page or even across pages. The data elements can shift into each other "territories" or even move across page boundaries. Spectrum InForm has developed processes and applications to handle unique documents, especially in the communications and utilities market.