Advertisement

Automatic Location and Separation of Records: A Case Study in the Genealogical Domain

  • Troy Walker
  • David W. Embley
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3289)

Abstract

Locating specific chunks (records) of information within documents on the web is an interesting and nontrivial problem. If the problem of locating and separating records can be solved well, the longstanding problem of grouping extracted values into appropriate relationships in a record structure can be more easily resolved. Our solution is a hybrid of two well established techniques: (1) ontology-based extraction [ECJ + 99] and (2) vector space modeling [SM83]. To show that the technique has merit, we apply it to the particularly challenging task of locating and separating records for genealogical web documents, which tend to vary considerably in layout and format. Experiments we have conducted show this technique yields an average of 92% recall and 93% precision for locating and separating genealogical records in web documents.

Keywords

Vector Space Modeling Record Location Participation Constraint Magnitude Measure Cosine Measure 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [BLP01]
    Buttler, D., Liu, L., Calton, P.: A fully automated object extraction system for the world wide web. In: Proceedings of the 21st International Conference on Distributed Computing Systems (ICDC 2001), Mesa, Arizona (April 2001)Google Scholar
  2. [ECJ+99]
    Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.-K., Smith, R.D.: Conceptual-model-based data extraction from multiple-record web pages. Data & Knowledge Engineering 31(3), 227–251 (1999)zbMATHCrossRefGoogle Scholar
  3. [EJN99]
    Embley, D.W., Jiang, Y.S., Ng, Y.-K.: Record-boundary discovery in web documents. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD 1999), Philadelphia, Pennsylvania, 31 May - 3 June, pp. 467–478 (1999)Google Scholar
  4. [EKW92]
    Embley, D.W., Kurtz, B.D., Woodfield, S.N.: Object-oriented Systems. In: Analysis: A Model-Driven Approach, Prentice Hall, Englewood Cliffs (1992)Google Scholar
  5. [Emb80]
    Embley, D.W.: Programming with data frames for everyday data items. In: Proceedings of the 1980 National Computer Conference, Anaheim, California, May 1980, pp. 301–305 (1980)Google Scholar
  6. [EX00]
    Embley, D.W., Xu, L.: Record location and reconfiguration in unstructured multiple-record web documents. In: Proceedings of the Third International Workshop on the Web and Databases (WebDB 2000), Dallas, Texas, May 2000, pp. 123–128 (2000)Google Scholar
  7. [KT02]
    Kuhlins, S., Tredwell, R.: Toolkits for generating wrappers – a survey of software toolkits for automated data extraction from websites. In: Aksit, M., Mezini, M., Unland, R. (eds.) Objects, Components, Architectures, Services, and Applications for a Networked World – Proceedings of the 2002 International NetObjectDays Conference, Erfurt, Germany, October 2002, pp. 184–198 (2002)Google Scholar
  8. [LRNdST02]
    Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)CrossRefGoogle Scholar
  9. [SM83]
    Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Troy Walker
    • 1
  • David W. Embley
    • 1
  1. 1.Department of Computer ScienceBrigham Young UniversityProvoUSA

Personalised recommendations