Parsing Common Document Types

Abstract

Rich-text file formats are a mixed blessing for Web 3.0 applications that require general processing of text and at least some degree of semantic understanding. On the positive side, rich text lets you use styling information such as headings, tables, and metadata to identify important or specific parts of documents. On the negative side, dealing with rich text is more complex than working with plain text. You’ll get more in-depth coverage of style markup in Chapter 10, but I’ll cover some basics here.

Keywords

Extractor Ruby 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Mark Watson 2009

Personalised recommendations