Searching in the “Real World”

* Final gross prices may vary according to local VAT.

Get Access

Abstract

For many, "searching" is considered a mostly solved problem.  In fact, for text processing, this belief is factually based.  The problem is that most "real world" search applications involve "complex documents", and such applications are far from solved.  Complex documents, or less formally, "real world documents", comprise of a mixture of images, text, signatures, tables, logos, water-marks, stamps, etc, and are often available only in scanned hardcopy formats. Search systems for such document collections are currently unavailable.

We describe our efforts at building a complex document information processing (CDIP) prototype. This prototype integrates "point solution" (mature) technologies, such as OCR capability, signature matching and handwritten word spotting techniques, search and mining approaches, among others, to yield a system capable of searching "real world documents". The described prototype demonstrates the adage that "the whole is greater than the sum of its parts".