Automatic Location and Separation of Records: A Case Study in the Genealogical Domain
Locating specific chunks (records) of information within documents on the web is an interesting and nontrivial problem. If the problem of locating and separating records can be solved well, the longstanding problem of grouping extracted values into appropriate relationships in a record structure can be more easily resolved. Our solution is a hybrid of two well established techniques: (1) ontology-based extraction [ECJ + 99] and (2) vector space modeling [SM83]. To show that the technique has merit, we apply it to the particularly challenging task of locating and separating records for genealogical web documents, which tend to vary considerably in layout and format. Experiments we have conducted show this technique yields an average of 92% recall and 93% precision for locating and separating genealogical records in web documents.
KeywordsVector Space Modeling Record Location Participation Constraint Magnitude Measure Cosine Measure
Unable to display preview. Download preview PDF.
- [BLP01]Buttler, D., Liu, L., Calton, P.: A fully automated object extraction system for the world wide web. In: Proceedings of the 21st International Conference on Distributed Computing Systems (ICDC 2001), Mesa, Arizona (April 2001)Google Scholar
- [EJN99]Embley, D.W., Jiang, Y.S., Ng, Y.-K.: Record-boundary discovery in web documents. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD 1999), Philadelphia, Pennsylvania, 31 May - 3 June, pp. 467–478 (1999)Google Scholar
- [EKW92]Embley, D.W., Kurtz, B.D., Woodfield, S.N.: Object-oriented Systems. In: Analysis: A Model-Driven Approach, Prentice Hall, Englewood Cliffs (1992)Google Scholar
- [Emb80]Embley, D.W.: Programming with data frames for everyday data items. In: Proceedings of the 1980 National Computer Conference, Anaheim, California, May 1980, pp. 301–305 (1980)Google Scholar
- [EX00]Embley, D.W., Xu, L.: Record location and reconfiguration in unstructured multiple-record web documents. In: Proceedings of the Third International Workshop on the Web and Databases (WebDB 2000), Dallas, Texas, May 2000, pp. 123–128 (2000)Google Scholar
- [KT02]Kuhlins, S., Tredwell, R.: Toolkits for generating wrappers – a survey of software toolkits for automated data extraction from websites. In: Aksit, M., Mezini, M., Unland, R. (eds.) Objects, Components, Architectures, Services, and Applications for a Networked World – Proceedings of the 2002 International NetObjectDays Conference, Erfurt, Germany, October 2002, pp. 184–198 (2002)Google Scholar