Journal of Computer Science and Technology

, Volume 32, Issue 5, pp 858–876 | Cite as

Crowd-Guided Entity Matching with Consolidated Textual Data

  • Zhi-Xu Li
  • Qiang Yang
  • An LiuEmail author
  • Guan-Feng Liu
  • Jia Zhu
  • Jia-Jie Xu
  • Kai Zheng
  • Min Zhang
Regular Paper


Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra consolidated textual information (CText) of the record, but seldom work has been done on using the CText for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CText since there are hundreds or thousands of words with each piece of CText, while existing topic models either cannot work well since there are no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each piece of CText, and then measure the similarity between CText on the multiple sub-topic dimensions. To avoid ignoring some hidden important sub-topics, we let the crowd help us decide weights of different sub-topics in doing EM. Our empirical study on two real-world datasets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.


entity matching consolidated textual data crowdsourcing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11390_2017_1769_MOESM1_ESM.pdf (57 kb)
ESM 1 (PDF 57 kb)


  1. [1]
    Koudas N, Sarawagi S, Srivastava D. Record linkage: Similarity measures and algorithms. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2006, pp.802-803.Google Scholar
  2. [2]
    Parkhomenko E, Tritchler D, Beyene J. Sparse canonical correlation analysis with application to genomic data integration. Statistical Applications in Genetics and Molecular Biology, 2009, 8(1): Article No. 1.Google Scholar
  3. [3]
    Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: A survey. IEEE Trans. Knowledge and Data Engineering, 2007, 19(1): 1–16.CrossRefGoogle Scholar
  4. [4]
    Ektefa M, Jabar M A, Sidi F, Memar S, Ibrahim H, Ramli A. A threshold-based similaritymeasure for duplicate detection. In Proc. IEEE Conf. Open Systems, September 2011, pp.37-41.Google Scholar
  5. [5]
    Gao C, Hong X G, Peng Z H, Chen H D. Web trace duplication detection based on context. In Proc. the Int. Conf. Web Information Systems and Mining, September 2011, pp.292-301.Google Scholar
  6. [6]
    Das D, Martins A F T. A Survey on Automatic Text Summarization. The MIT Press, 2007.Google Scholar
  7. [7]
    Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3: 993–1022.zbMATHGoogle Scholar
  8. [8]
    Landauer T K, Foltz PW, Laham D. An introduction to latent semantic analysis. Discourse Processes, 1998, 25(2/3): 259–284.CrossRefGoogle Scholar
  9. [9]
    Hofmann T. Probabilistic latent semantic analysis. In Proc. the 15th Conf. Uncertainty in Artificial Intelligence, August 1999, pp.289-296.Google Scholar
  10. [10]
    Kim D, Wang H X, Oh A. Context-dependent conceptualization. In Proc. the 23rd Int. Joint Conf. Artificial Intelligence, August 2013, pp.2654-2661.Google Scholar
  11. [11]
    Guo S T, Dong X L, Srivastava D, Zajac R. Record linkage with uniqueness constraints and erroneous values. Proc. the VLDB Endowment, 2010, 3(1/2): 417–428.CrossRefGoogle Scholar
  12. [12]
    Sun L W, Franklin M J, Krishnan S, Xin R S. Finegrained partitioning for aggressive data skipping. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.1115-1126.Google Scholar
  13. [13]
    Sarwar B, Karypis G, Konstan J, Riedl J. Item-based collaborative filtering recommendation algorithms. In Proc. the 10th Int. Conf. World Wide Web, May 2001, pp.285-295.Google Scholar
  14. [14]
    Feng J H, Li G L, Wang H N, Feng J H. Incremental quality inference in crowdsourcing. In Proc. the 19th Int. Conf. Database Systems for Advanced Applications, April 2014, pp.453-467.Google Scholar
  15. [15]
    Aizawa A, Oyama K. A fast linkage detection scheme for multi-source information integration. In Proc. the Int. Workshop on Challenges on Web Information Retrieval and Integration, April 2005, pp.30-39.Google Scholar
  16. [16]
    Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowledge and Data Engineering, 2012, 24(9): 1537–1555.CrossRefGoogle Scholar
  17. [17]
    Borthwick A, Goldberg A, Cheung P, Winkel A. Batch Automated Blocking and Record Matching. The US Press, 2011.Google Scholar
  18. [18]
    Yang Q, Li Z X, Jiang J, Zhao P P, Liu G F, Liu A, Zhu J. NokeaRM: Employing non-key attributes in record matching. In Proc. the 16th Int. Conf. Web-Age Information Management, June 2015, pp.438-442.Google Scholar
  19. [19]
    Villarreal S E G, Brena R F. Topic mining based on graph local clustering. In Proc. the 10th Int. Conf. Artificial Intelligence: Advances in Soft Computing, November 2011, pp.201-212.Google Scholar
  20. [20]
    Dhamankar R, Lee Y, Doan A H, Halevy A, Domingos P. iMAP: Discovering complex semantic matches between database schemas. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2004, pp.383-394.Google Scholar
  21. [21]
    Weiss S M, Indurkhya N, Zhang T, Damerau F. Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, 2005.Google Scholar
  22. [22]
    Hassell J, Aleman-Meza B, Arpinar I B. Ontology-driven automatic entity disambiguation in unstructured text. In Proc. the 5th Int. Conf. the Semantic Web, November 2006, pp.44-57.Google Scholar
  23. [23]
    Zhang X, LeCun Y. Text understanding from scratch. arXiv:1502.01710, 2016., August 2017.Google Scholar
  24. [24]
    Kim S J, Lee J H. Method of mining subtopics using dependency structure and anchor texts. In Proc. the 19th Int. Conf. String Processing and Information Retrieval, October 2012, pp.277-283.Google Scholar
  25. [25]
    Wu M W, Zhang C D, Lan W Y, Wu Q Q. Text topic mining based on LDA and co-occurrence theory. In Proc. the 7th Int. Conf. Computer Science & Education, July 2012, pp.525-528.Google Scholar
  26. [26]
    Li GL, Wang J N, Zheng Y D, Franklin M J. Crowdsourced data management: A survey. IEEE Trans. Knowledge and Data Engineering, 2016, 28(9): 2296–2319.CrossRefGoogle Scholar
  27. [27]
    Doan A H, Ramakrishnan R, Halevy A Y. Crowdsourcing systems on the world-wide web. Communications of the ACM, 2011, 54(4): 86–96.CrossRefGoogle Scholar
  28. [28]
    Gu B B, Li Z X, Yang Q, Xie Q, Liu A, Liu G F, Zheng K, Zhang X L. Web-ADARE: A web-aided data repairing system. Neurocomputing, 2017, 253: 201–214.CrossRefGoogle Scholar
  29. [29]
    Li G L, Chai C L, Fan J, Weng X P, Li J, Zheng Y D, Li Y B, Yu X, Zhang X H, Yuan H T. CDB: Optimizing queries with crowd-based selections and joins. In Proc. the ACM Int. Conf. Management of Data, May 2017, pp.1463-1478.Google Scholar
  30. [30]
    Jiang L L, Wang Y F, Hoffart J, Weikum G. Crowdsourced entity markup. In Proc. the 1st Int. Conf. Crowdsourcing the Semantic Web, October 2013, pp.59-68.Google Scholar
  31. [31]
    Wang J N, Kraska T, Franklin M J, Feng J H. Crowder: Crowdsourcing entity resolution. Proc. the VLDB Endowment, 2012, 5(11): 1483–1494.CrossRefGoogle Scholar
  32. [32]
    Gu B B, Li Z X, Zhang X L, Liu A, Liu G F, Zheng K, Zhao L, Zhou X F. The interaction between schema matching and record matching in data integration. IEEE Trans. Knowledge and Data Engineering, 2017, 29(1): 186–199.CrossRefGoogle Scholar
  33. [33]
    Demartini G, Difallah D E, Cudré-Mauroux P. ZenCrowd: Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proc. the 21st Int. Conf. World Wide Web, April 2012, pp.469-478.Google Scholar
  34. [34]
    Gokhale C, Das S, Doan A H, Naughton J F, Rampalli N, Shavlik J, Zhu X J. Corleone: Hands-off crowdsourcing for entity matching. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.601-612.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Zhi-Xu Li
    • 1
    • 2
  • Qiang Yang
    • 1
  • An Liu
    • 1
    Email author
  • Guan-Feng Liu
    • 1
  • Jia Zhu
    • 3
  • Jia-Jie Xu
    • 1
  • Kai Zheng
    • 1
    • 4
  • Min Zhang
    • 1
  1. 1.School of Computer Science and TechnologySoochow UniversitySuzhouChina
  2. 2.Guangdong Key Laboratory of Big Data Analysis and ProcessingGuangzhouChina
  3. 3.School of ComputerSouth China Normal UniversityGuangzhouChina
  4. 4.Beijing Key Laboratory of Big Data Management and Analysis MethodsBeijingChina

Personalised recommendations