The VLDB Journal

, Volume 22, Issue 5, pp 711–726 | Cite as

Hybrid entity clustering using crowds and data

  • Jongwuk Lee
  • Hyunsouk Cho
  • Jin-Woo Park
  • Young-rok Cha
  • Seung-won HwangEmail author
  • Zaiqing Nie
  • Ji-Rong Wen
Special Issue Paper


Query result clustering has attracted considerable attention as a means of providing users with a concise overview of results. However, little research effort has been devoted to organizing the query results for entities which refer to real-world concepts, e.g., people, products, and locations. Entity-level result clustering is more challenging because diverse similarity notions between entities need to be supported in heterogeneous domains, e.g., image resolution is an important feature for cameras, but not for fruits. To address this challenge, we propose a hybrid relationship clustering algorithm, called Hydra, using co-occurrence and numeric features. Algorithm Hydra captures diverse user perceptions from co-occurrence and disambiguates different senses using feature-based similarity. In addition, we extend Hydra into \({\mathsf{Hydra }_\mathsf{gData }}\) with different sources, i.e., entity types and crowdsourcing. Experimental results show that the proposed algorithms achieve effectiveness and efficiency in real-life and synthetic datasets.


Entity-level search Subspace clustering Hybrid entity clustering Crowdsourcing 



This research was supported by the Ministry of Knowledge Economy (MKE), Korea and Microsoft Research, under IT/SW Creative research program supervised by the NIPA (National IT Industry Promotion Agency). (NIPA-2012-H0503-12-1036).


  1. 1.
    Aggarwal, C.C.: A human-computer cooperative system for effective high dimensional clustering. In: KDD (2001)Google Scholar
  2. 2.
    Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: SIGMOD (1999)Google Scholar
  3. 3.
    Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. In: SIGMOD (2000)Google Scholar
  4. 4.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high-dimensional data for a data mining applications. In: SIGMOD (1998)Google Scholar
  5. 5.
    Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S.: Diversifying search results. In: WSDM, pp. 5–14 (2009)Google Scholar
  6. 6.
    Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: VLDB, pp. 586–597 (2002)Google Scholar
  7. 7.
    Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD Conference, pp. 783–794 (2010)Google Scholar
  8. 8.
    Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised clustering. In: KDD, pp. 59–68 (2004)Google Scholar
  9. 9.
    Bazzanella, B., Stoermer, H., Bouquet, P.: Entity type disambiguation in user queries. JIKM 10(3), 209–224 (2011)Google Scholar
  10. 10.
    Bilenko, M., Basu, S., Sahami, M.: Adaptive product normalization: Using online learning for recored linkage in comparison shopping. In: ICDM (2005)Google Scholar
  11. 11.
    Bouquet, P., Palpanas, T., Stoermer, H., Vignolo, M.: A conceptual model for a web-scale entity name system. In: ASWC, pp. 46–60 (2009)Google Scholar
  12. 12.
    Carterette, B., Chandar, P.: Probabilistic models of ranking novel documents for faceted topic retrieval. In: CIKM, pp. 1287–1296 (2009)Google Scholar
  13. 13.
    Cheng, C.-H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: KDD (1999)Google Scholar
  14. 14.
    Cheng, D., Kannan, R., Vempala, S., Wang, G.: A divide-merge methodology for clustering. In: TODS (2005)Google Scholar
  15. 15.
    Chierichetti, F., Kumar, R., Pandey, S., Vassilvitskii, S.: Finding the jaccard median. In: SODA, pp. 293–311 (2010)Google Scholar
  16. 16.
    Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: SIGMOD, pp. 201–212 (1998)Google Scholar
  17. 17.
    Cui, Y., Hasler, N., Thormählen, T., Seidel, H.-P.: Scale invariant feature transform with irregular orientation histogram binning. In: ICIAR, pp. 258–267 (2009)Google Scholar
  18. 18.
    Doan, A., Ramakrishnan, R., Halevy, A.Y.: Crowdsourcing systems on the world-wide web. Commun. ACM 54(4), 86–96 (2011)CrossRefGoogle Scholar
  19. 19.
    Franklin, M.J., Kossmann, D., Kraska, T., Ramesh, S., Xin, R.: CrowdDB: answering queries with crowdsourcing. In: SIGMOD, pp. 61–72 (2011)Google Scholar
  20. 20.
    Goil, S., Nagesh, H., Choudhary, A.: Mafia: efficient and scalable subspace clustering for very large data sets. Technical Report, Northwesthen University (1999)Google Scholar
  21. 21.
    Gomes, R., Welinder, P., Krause, A., Perona, P.: Crowdclustering. In: NIPS, pp. 558–566 (2011)Google Scholar
  22. 22.
    Hearst, M.A., Pedersen, J.O.: Re-examining the cluster hypothesis: Scatter/Gather on retrieval results. In: SIGIR (1996)Google Scholar
  23. 23.
    Jain, A., Pennacchiotti, M.: Open entity extraction from web search query logs. In: COLING, pp. 510–518 (2010)Google Scholar
  24. 24.
    Jang, M., Park, J.-W., Hwang, S.: Predictive mining of comparable entities from the web. In: AAAI (2012)Google Scholar
  25. 25.
    Ji, X., Xu, W., Zhu, S.: Document clustering with prior knowledge. In: SIGIR (2006)Google Scholar
  26. 26.
    Jindal, N., Liu, B.: Identifying comparative sentences in text documents. In: SIGIR, pp. 244–251 (2006)Google Scholar
  27. 27.
    Lee, J., Hwang, S., Nie, Z., Wen, J.-R.: Query result clustering for object-level search. In: KDD, pp. 1205–1214 (2009)Google Scholar
  28. 28.
    Lee, J., Hwang, S., Nie, Z., Wen, J.-R.: Navigation system for product search. In: ICDE, pp. 1113–1116 (2010)Google Scholar
  29. 29.
    Lee, T., Wang, Z., Wang, H., Hwang, S.: Web scale taxonomy cleansing. PVLDB 4(12), 1295–1306 (2011)Google Scholar
  30. 30.
    Li, S., Lin, C.-Y., Song, Y.-I., Li, Z.: Comparable entity mining from comparative questions. In: ACL, pp. 650–658 (2010)Google Scholar
  31. 31.
    Liu, Y., Li, W., Lin, Y., Jing, L.: Spectral geometry for simultaneously clustering and ranking query search results. In: SIGIR (2008)Google Scholar
  32. 32.
    Marcus, A., Wu, E., Madden, S., Miller, R.C.: Crowdsourced databases: Query processing with people. In: CIDR, pp. 211–214 (2011)Google Scholar
  33. 33.
    Mecca, G., Raunich, S., Pappalardo, A.: A new algorithm for clustering search results. Data Knowl. Eng. 62(3), 504–522 (2007)Google Scholar
  34. 34.
    Nie, Z., Ma, Y., Shi, S., Wen, J.-R., Ma, W.-Y.: Web object retrieval. In: WWW (2007)Google Scholar
  35. 35.
    Nie, Z., Wen, J.-R., Ma, W.-Y.: Object-level vertical search. In: CIDR (2007)Google Scholar
  36. 36.
    Nie, Z., Wen, J.-R., Ma, W.-Y.: Statistical entity extraction from the web. Proc. IEEE 100(9), 2675–2687 (2012) Google Scholar
  37. 37.
    Nie, Z., Zhang, Y., Wen, J.-R., Ma, W.-Y.: Object-level ranking: bringing order to web objects. In: WWW (2005)Google Scholar
  38. 38.
    Parameswaran, A.G., Polyzotis, N.: Answering queries using humans, algorithms and databases. In: CIDR, pp. 160–166 (2011)Google Scholar
  39. 39.
    Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. SIGKDD Newsletter 6(1), 90–105 (2004)CrossRefGoogle Scholar
  40. 40.
    Patrikainen, A., Melia, M.: Comparing subspace clusterings. TKDE 18(7), 902–916 (2006)Google Scholar
  41. 41.
    Radlinski, F., Dumais, S.T.: Improving personalized web search using result diversification. In: SIGIR, pp. 691–692 (2006)Google Scholar
  42. 42.
    Scripps, J., Tan, P.-N.: Clustering in the presence of bridge-nodes. In: SDM (2006)Google Scholar
  43. 43.
    Selke, J., Lofi, C., Balke, W.-T.: Pushing the boundaries of crowd-enabled databases with query-driven schema expansion. PVLDB 5(6), 538–549 (2012)Google Scholar
  44. 44.
    Song, Y., Wang, H., Wang, Z., Li, H., Chen, W.: Short text conceptualization using a probabilistic knowledgebase. In: IJCAI, pp. 2330–2336 (2011)Google Scholar
  45. 45.
    Wagstaff, K., Cardie, C., Rogers, S., Schroedl, S.: Conttrainted k-means clustering with background knowledge. In: ICML (2001)Google Scholar
  46. 46.
    Wang, J., Kraska, T., Franklin, M.J., Feng, J.: CrowdER: crowdsourcing entity resolution. PVLDB 5(11), 1483–1494 (2012)Google Scholar
  47. 47.
    Wang, X., Zhai, C.: Learn from web search logs to organize search results. In: SIGIR (2007)Google Scholar
  48. 48.
    Wang, X.-J., Ma, W.-Y., He, Q.-C., Li, X.: Grouping web image search result. In: ACM Multimedia, pp. 436–439 (2004)Google Scholar
  49. 49.
    Whang, S.E., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with negative rules. VLDB J. 18(6), 1261–1277 (2009)CrossRefGoogle Scholar
  50. 50.
    Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. In: PVLDB (2013)Google Scholar
  51. 51.
    Woo, K.-G., Lee, J.-H., Kim, M.-H., Lee, Y.-J.: FINDIT: a fast intelligent subspace clusteing algorithm using diemsnion voting. Inform. Softw. Technol. 46(4), 255–271 (2004)CrossRefGoogle Scholar
  52. 52.
    Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: SIGIR (2003)Google Scholar
  53. 53.
    Yip, K.Y., Cheung, D.W., Ng, M.K.: HARP: A practical projected clustering algorithm. TKDE 16(11), 1387–1397 (2004)Google Scholar
  54. 54.
    Yip, K.Y., Cheung, D.W., Ng, M.K.: On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: ICDE (2005)Google Scholar
  55. 55.
    Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR (1998)Google Scholar
  56. 56.
    Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y., Ma, J.: Learning to cluster web search results. In: SIGIR (2004)Google Scholar
  57. 57.
    Zhu, X., Ghahramani, Z., Lafferty, J.D.: Semi-supervised learning using gaussian fields and harmonic functions. In: ICML, pp. 912–919 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Jongwuk Lee
    • 1
  • Hyunsouk Cho
    • 1
  • Jin-Woo Park
    • 1
  • Young-rok Cha
    • 1
  • Seung-won Hwang
    • 1
    Email author
  • Zaiqing Nie
    • 2
  • Ji-Rong Wen
    • 3
  1. 1.Pohang University of Science and Technology (POSTECH)PohangRepublic of Korea
  2. 2.Microsoft Research AsiaBeijingPeople’s Republic of China
  3. 3.Renmin University of ChinaBeijingPeople’s Republic of China

Personalised recommendations