ReadFast: Structural Information Retrieval from Biomedical Big Text by Natural Language Processing

  • Michael Gubanov
  • Linda Shapiro
  • Anna Pyayt


While the problem to find needed information on the Web is being solved by the major search engines, access to the information in Big text, large-scale text datasets, and documents (Biomedical literature, e-books, conference proceedings, etc.) is still very rudimentary (Lin and Cohen (2010) A very fast method for clustering big text datasets. In: ECAI, Lisbon). Thus, keyword-search is often the only way to find the needle in the haystack. There is abundance of relevant research results in the Semantic Web research community that offers more robust access interfaces compared to keyword-search. Here we describe a new information retrieval engine that offers advanced user experience combining keyword-search with navigation over an automatically inferred hierarchical document index. The internal representation of the browsing index as a collection of UFOs (Gubanov et al. (2009) Ibm ufo repository. In: VLDB, Lyon; Gubanov et al. (2011) Learning unified famous objects (ufo) to bootstrap information integration. In: IEEE IRI, Las Vegas) yields more relevant search results and improves user experience.


Normalize Discount Cumulative Gain Term Extractor Needed Information Discount Cumulative Gain Natural Language Expression 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Adelberg B (1998) NoDoSE – a tool for semi-automatically extracting structured and semistructured data from text documents. In: SIGMOD record, SeattleGoogle Scholar
  2. 2.
    Agichtein E, Gravano L (2000) Snowball: extracting relations from large plain-text collections. In: ACM DL, San AntonioGoogle Scholar
  3. 3.
    Agichtein E, Ipeirotis P, Gravano L (2003) Modeling query-based access to text databases. In: WebDB, San DiegoGoogle Scholar
  4. 4.
    Agichtein E, Brill E, Dumais S (2006) Improving web search ranking by incorporating user behavior information. In: SIGIR, SeattleGoogle Scholar
  5. 5.
    Agrawal S, Chaudhuri S, Das G (2002) Dbxplorer: a system for keyword-based search over relational databases. In: ICDE, San JoseGoogle Scholar
  6. 6.
    Anyanwu K, Maduko A, Sheth A (2007) Sparq2l: towards support for subgraph extraction queries in rdf databases. In: WWW, BanffGoogle Scholar
  7. 7.
    Arocena GO, Mendelzon AO (1998) Weboql: restructuring documents, databases, and webs. In: ICDE, OrlandoGoogle Scholar
  8. 8.
    Banko M, Brill E, Dumais S, Lin J (2002) Askmsr: question answering using the worldwide web. In: EMNLP, PhiladelphiaGoogle Scholar
  9. 9.
    Brin S (1998) Extracting patterns and relations from the world wide web. In: EDBT, ValenciaGoogle Scholar
  10. 10.
    Cai Y, Dong XL, Halevy A, Liu JM, Madhavan J (2005) Personal information management with semex. In: SIGMOD, BaltimoreGoogle Scholar
  11. 11.
    Califf ME, Mooney RJ (1998) Relational learning of pattern-match rules for information extraction. In: AAAI, MadisonGoogle Scholar
  12. 12.
    Chakrabarti S (2007) Dynamic personalized pagerank in entity-relation graphs. In: WWW, BanffGoogle Scholar
  13. 13.
    Cheng T, Yan X, Chang KCC (2007) Entityrank: searching entities directly and holistically. In: VLDB, ViennaGoogle Scholar
  14. 14.
    Crescenzi V, Mecca G (1998) Grammars have exceptions. J Inf Syst (Special issue on Semistructured Data) 23(9):539–565Google Scholar
  15. 15.
    Crescenzi V, Mecca G, Merialdo P (2001) Roadrunner: towards automatic data extraction from large web sites. In: VLDB, RomaGoogle Scholar
  16. 16.
    Crestani F (1997) Application of spreading activation techniques in information retrieval. Artif Intell Rev 11:453CrossRefGoogle Scholar
  17. 17.
    Diederich J, Balke WT, Thaden U (2007) Demonstrating the semantic growbag: automatically creating topic facets for faceteddblp. In: JCDL, VancouverGoogle Scholar
  18. 18.
    Dong X, Halevy A (2007) Indexing dataspaces. In: SIGMOD, BeijingGoogle Scholar
  19. 19.
    Downey D, Etzioni O, Soderland S, Weld D (2004) Learning text patterns for web information extraction and assessment. In: AAAI, San JoseGoogle Scholar
  20. 20.
    Embley DW, Campbell DM, Jiang YS, Liddle SW, Ng YK, Quass D, Smith, RD (1999) Conceptual-model-based data extraction from multiple-record web pages. Data Knowl Eng 31:227–251CrossRefzbMATHGoogle Scholar
  21. 21.
    Etzioni O, Cafarella M, Downey D, Kok S, Popescu A, Shaked T, Soderland S, Weld D, Yates A (2004) Web-scale information extraction in knowitall. In: WWW, ManhattanGoogle Scholar
  22. 22.
    Freitag D (1998) Machine learning for information extraction in informal domains. Ph.D. thesis, Carnegie Mellon UniversityGoogle Scholar
  23. 23.
    Gubanov M, Shapiro L (2011) Using unified famous objects (ufo) to automate Alzheimer’s disease diagnosis. In: IEEE BIBM, AtlantaGoogle Scholar
  24. 24.
    Gubanov MN, Popa L, Ho H, Pirahesh H, Chang P, Chen L (2009) Ibm ufo repository. In: VLDB, LyonGoogle Scholar
  25. 25.
    Gubanov M, Shapiro L, Pyayt A (2011) Learning unified famous objects (ufo) to bootstrap information integration. In: IEEE IRI, Las VegasGoogle Scholar
  26. 26.
    Hammer J, McHugh J, Garcia-Molina H (1997) Semistructured data: the TSIMMIS experience. In: Proceedings of the East-European workshop on advances in databases and information systems, St. PetersburgGoogle Scholar
  27. 27.
    He H, Wang H, Yang J, Yu PS (2007) Blinks: ranked keyword searches on graphs. In: SIGMOD, BeijingGoogle Scholar
  28. 28.
    Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. Technical report S2K-92-09Google Scholar
  29. 29.
    Hristidis V, Papakonstantinou Y (2002) Discover: keyword search in relational databases. In: VLDB, Hong KongGoogle Scholar
  30. 30.
    Hsu CN, Dung MT (1998) Generating finite-state transducers for semi-structured data extraction from the web. J Inf Syst (Special issue on Semistructured Data) 23(9):521–538Google Scholar
  31. 31.
  32. 32.
    Järvelin K, Kekäläinen J (2000) IR evaluation methods for retrieving highly relevant documents. In: SIGIR, AthensGoogle Scholar
  33. 33.
    Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 60:493–502CrossRefGoogle Scholar
  34. 34.
    Klein D, Manning C (2007) Fast exact inference with a factored model for natural language parsing. In: NIPS, VancouverGoogle Scholar
  35. 35.
    Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118:15–68MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Laender A, Ribeiro-Neto B, Silva A, Teixeira J (2002) A brief survey of web data extraction tools. In: SIGMOD record, Madison,Google Scholar
  37. 37.
    Laender AHF, Ribeiro-Neto B, da Silva AS (2002) Debye – date extraction by example. Data Knowl Eng 40(2):121–154CrossRefzbMATHGoogle Scholar
  38. 38.
    Lin F, Cohen WW (2010) A very fast method for clustering big text datasets. In: ECAI, LisbonGoogle Scholar
  39. 39.
    Liu L, Pu C, Han W (2000) XWRAP: an XML-enabled wrapper construction system for web information sources. In: ICDE, San DiegoGoogle Scholar
  40. 40.
    Madhavan J, Cohen S, Dong X, Halevy A, Jeffery S, Ko D, Yu C (2007) Navigating the seas of structured web data. In: CIDR, AsilomarGoogle Scholar
  41. 41.
    Nie Z, Ma Y, Shi S, Wen JR, Ma WY (2007) Web object retrieval. In: WWW, BanffGoogle Scholar
  42. 42.
    Ribeiro-Neto BA, Laender AHF, da Silva AS (1999) Extracting semi-structured data through examples. In: CIKM, Kansas CityGoogle Scholar
  43. 43.
    Sahuguet A, Azavant F (2001) Building intelligent web applications using lightweight wrappers. Data Knowl Eng 36:283–316CrossRefzbMATHGoogle Scholar
  44. 44.
    Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18:613–620CrossRefzbMATHGoogle Scholar
  45. 45.
    Sayyadian M, LeKhac H, Doan A, Gravano L (2007) Efficient keyword search across heterogeneous relational databases. In: ICDE, IstanbulGoogle Scholar
  46. 46.
    Sekine S (2006) On-demand information extraction. In: COLING/ACL, SydneyGoogle Scholar
  47. 47.
    Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34:233CrossRefzbMATHGoogle Scholar
  48. 48.
    Udrea O, Getoor L, Miller RJ (2007) Leveraging data and structure in ontology integration. In: SIGMOD, BeijingGoogle Scholar
  49. 49.
    Vanderwende L, Kacmarcik G, Suzuki H, Menezes A (2005) Mindnet: an automatically-created lexical resource. In: HLT/EMNLP, VancouverGoogle Scholar

Copyright information

© Springer-Verlag Wien 2013

Authors and Affiliations

  1. 1.University of WashingtonSeattleUSA
  2. 2.Stanford UniversityStanfordUSA

Personalised recommendations