Data Mining and Knowledge Discovery

, Volume 27, Issue 3, pp 372–395 | Cite as

Growing a list

  • Benjamin Letham
  • Cynthia Rudin
  • Katherine A. Heller
Article

Abstract

It is easy to find expert knowledge on the Internet on almost any topic, but obtaining a complete overview of a given topic is not always easy: information can be scattered across many sources and must be aggregated to be useful. We introduce a method for intelligently growing a list of relevant items, starting from a small seed of examples. Our algorithm takes advantage of the wisdom of the crowd, in the sense that there are many experts who post lists of things on the Internet. We use a collection of simple machine learning components to find these experts and aggregate their lists to produce a single complete and meaningful list. We use experiments with gold standards and open-ended experiments without gold standards to show that our method significantly outperforms the state of the art. Our method uses the ranking algorithm Bayesian Sets even when its underlying independence assumption is violated, and we provide a theoretical generalization bound to motivate its use.

Keywords

Set completion Ranking Internet data mining Collective intelligence 

Supplementary material

10618_2013_329_MOESM1_ESM.pdf (467 kb)
Supplementary material 1 (pdf 466 KB)

References

  1. Beg MMS, Ahmad N (2003) Soft computing techniques for rank aggregation on the world wide web. World Wide Web 6(1):5–22CrossRefGoogle Scholar
  2. Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526MathSciNetMATHGoogle Scholar
  3. Carlson A, Betteridge J, Kisiel B, Settles B, Hruschka ER, Mitchell TM (2010a) Toward an architecture for never-ending language learning. In: Proceedings of the 24th conference on artificial intelligence, AAAI ’10.Google Scholar
  4. Carlson A, Betteridge J, Wang RC, Hruschka ER, Mitchell TM (2010b) Coupled semi-supervised learning for information extraction. In: Proceedings of the 3rd ACM international conference on web search and data mining, WSDM ’10, pp 101–110.Google Scholar
  5. Chang CH, Lui SC (2001) IEPAD: Information extraction based on pattern discovery. In: Proceedings of the 10th international conference on world wide web, WWW ’01, pp 681–688.Google Scholar
  6. Dwork C, Kumar R, Naor M, Sivakumar D (2001) Rank aggregation methods for the web. In: Proceedings of the 10th international conference on world wide web, WWW ’01, pp 613–622.Google Scholar
  7. Etzioni O, Cafarella M, Downey D, Popescu AM, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised named-entity extraction from the web: an experimental study. Artif Intell 165(1):91–134CrossRefGoogle Scholar
  8. Freitag D (1998) Information extraction from HTML: application of a general machine learning approach. In: Proceedings of the 15th national conference on artificial intelligence, AAAI ’98, pp 517–523.Google Scholar
  9. Ghahramani Z, Heller KA (2005) Bayesian sets. In: Advances in neural information processing systems 18, NIPS ’05, pp 435–442.Google Scholar
  10. Gupta R, Sarawagi S (2009) Answering table augmentation queries from unstructured lists on the web. Proceedings of the VLDB Endowment 2:289–300Google Scholar
  11. Heller KA, Ghahramani Z (2006) A simple Bayesian framework for content-based image retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR ’06, pp 2110–2117.Google Scholar
  12. Hsu DF, Taksa I (2005) Comparing rank and score combination methods for data fusion in information retrieval. Inf Retr 8(3):449–480CrossRefGoogle Scholar
  13. Jindal P, Roth D (2011) Learning from negative examples in set-expansion. In: Proceedings of the 2011 11th IEEE international conference on data mining, ICDM ’11, pp 1110–1115.Google Scholar
  14. Kozareva Z, Riloff E, Hovy E (2008) Semantic class learning from the web with hyponym pattern linkage graphs. In: Proceedings of the 46th annual meeting of the association for computational linguistics: human language technologies, ACL ’08, pp 1048–1056.Google Scholar
  15. Kushmerick N (1997) Wrapper induction for information extraction. PhD thesis, University of Washington.Google Scholar
  16. Lalmas M (2011) Aggregated search. In: Melucci M, Baeza-Yates R (eds) Advanced topics on information retrieval. Springer, BerlinGoogle Scholar
  17. Liu B, Grossman R, Zhai Y (2003) Mining data records in web pages. In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03, pp 601–606.Google Scholar
  18. Paşca M (2007a) Organizing and searching the world wide web of facts—step two: harnessing the wisdom of the crowds. In: Proceedings of the 16th international conference on world wide web, WWW ’07, pp 101–110.Google Scholar
  19. Paşca M (2007b) Weakly-supervised discovery of named entities using web search queries. In: Proceedings of the 16th ACM conference on information and knowledge management, CIKM ’07, pp 683–690.Google Scholar
  20. Pantel P, Crestan E, Borkovsky A, Popescu AM, Vyas V (2009) Web-scale distributional similarity and entity set expansion. In: Proceedings of the 2009 conference on empirical methods in natural language processing, EMNLP ’09, pp 938–947.Google Scholar
  21. Renda ME, Straccia U (2003) Web metasearch: rank versus score based rank aggregation methods. In: Proceedings of the 2003 ACM symposium on applied computing, SAC ’03, pp 841–846.Google Scholar
  22. Sadamitsu K, Saito K, Imamura K, Kikui G (2011) Entity set expansion using topic information. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, ACL ’11, vol 2, pp 726–731.Google Scholar
  23. Sarmento L, Jijkoun V, de Rijke M, Oliveira E (2007) “More like these” : growing entity classes from seeds. In: Proceedings of the 16th ACM conference on information and knowledge management, CIKM ’07, pp 959–962.Google Scholar
  24. Soderland S, Cardie C, Mooney R (1999) Learning information extraction rules for semi-structured and free text. Mach Learn 34(1–3):233–272MATHCrossRefGoogle Scholar
  25. Tran MV, Nguyen TT, Nguyen TS, Le HQ (2010) Automatic named entity set expansion using semantic rules and wrappers for unary relations. In: Proceedings of the 2010 international conference on Asian language processing, IALP ’10, pp 170–173.Google Scholar
  26. Verma S, Hruschka ER (2012) Coupled bayesian sets algorithm for semi-supervised learning and information extraction. In: Proceedings of the 2012 European conference on machine learning and knowledge discovery in databases, ECML PKDD ’12, pp 307–322.Google Scholar
  27. Wang J, Lochovsky FH (2003) Data extraction and label assignment for web databases. In: Proceedings of the 12th international conference on world wide web, WWW ’03, pp 187–196.Google Scholar
  28. Wang RC, Cohen WW (2007) Language-independent set expansion of named entities using the web. In: Proceedings of the 2007 7th IEEE international conference on data mining, ICDM ’07, pp 342–350.Google Scholar
  29. Wang RC, Cohen WW (2008) Iterative set expansion of named entities using the web. In: Proceedings of the 2008 8th IEEE international conference on data mining, ICDM ’08, pp 1091–1096.Google Scholar
  30. Zhai Y, Liu B (2005) Web data extraction based on partial tree alignment. In: Proceedings of the 14th international conference on world wide web, WWW ’05, pp 76–85.Google Scholar
  31. Zhang L, Liu B (2011) Entity set expansion in opinion documents. In: Proceedings of the 22nd ACM conference on hypertext and hypermedia, HT ’11, pp 281–290.Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  • Benjamin Letham
    • 1
  • Cynthia Rudin
    • 2
  • Katherine A. Heller
    • 3
  1. 1.Operations Research CenterMassachusetts Institute of TechnologyCambridgeUSA
  2. 2.MIT Sloan School of ManagementMassachusetts Institute of TechnologyCambridgeUSA
  3. 3.Center for Cognitive Neuroscience, Statistical ScienceDuke UniversityDurhamUSA

Personalised recommendations