Veracity Analysis and Object Distinction



The World Wide Web has become the most important information source for most of us. Unfortunately, there is no guarantee for the correctness of information on the web, and different web sites often provide conflicting information on a subject. In this section we study two problems about correctness of information on the web. The first one is Veracity, i.e., conformity to truth, which studies how to find true facts from a large amount of conflicting information on many subjects that is provided by various web sites. We design a general framework for the Veracity problem, and invent an algorithm called TruthFinder, which utilizes the relationships between web sites and their information, i.e., a web site is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy web sites. The second problem is object distinction, i.e., how to distinguish different people or objects sharing identical names. This is a nontrivial task, especially when only very limited information is associated with each person or object. We develop a general object distinction methodology called DISTINCT, which combines two complementary measures for relational similarity: set resemblance of neighbor tuples and random walk probability, and analyze subtle linkages effectively. The method takes a set of distinguishable objects in the database as training set without seeking for manually labeled data and applies SVM to weigh different types of linkages.


  1. 1.
    Princeton Survey Research Associates International. Leap of faith: Using the Internet despite the dangers. Results of a National Survey of Internet Users for Consumer Reports WebWatch, Oct 2005.Google Scholar
  2. 2.
    J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In SODA, pages 668–677, San Francisco, CA, 1998.Google Scholar
  3. 3.
    L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.Google Scholar
  4. 4.
    A. Borodin, G. Roberts, J. Rosenthal, and P. Tsaparas. Link analysis ranking: Algorithms, theory, and experiments. ACM Transactions on Internet Technology, 5(1):231–297, 2005.CrossRefGoogle Scholar
  5. 5.
    X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the Web. IEEE Transaction on Knowledge and Data Engineering, 20(6):796–808, 2008.CrossRefGoogle Scholar
  6. 6.
    Logistical Equation from Wolfram MathWorld. Equation.html, Accessed on 2009/08/01.
  7. 7.
    Sigmoid Function from Wolfram MathWorld. html, Accessed on 2009/08/01.
  8. 8.
    W. Winkler. The State of Record Linkage and Current Research Problems. Stat. Research Div., U.S. Bureau of Census, 1999.Google Scholar
  9. 9.
    I. Bhattacharya and L. Getoor. Relational clustering for multi-type entity resolution. In MRDM workshop, Chicago, IL, 2005.Google Scholar
  10. 10.
    M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In SIGKDD, pages 39–48, Washington, DC, 2003.Google Scholar
  11. 11.
    S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, pages 313–324, San Diego, CA, 2003.Google Scholar
  12. 12.
    I. Felligi, and A. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(328):1183–1210, 1969.CrossRefGoogle Scholar
  13. 13.
    D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SDM, pages 262–273, Newport Beach, CA, 2005.Google Scholar
  14. 14.
    L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491–500, Trondheim, Norway, 2001.Google Scholar
  15. 15.
    C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121–168, 1998.CrossRefGoogle Scholar
  16. 16.
    A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31:264–323, 1999.CrossRefGoogle Scholar
  17. 17.
    Y. Weiss. Correctness of local probability propagation in graphical models with loops. Neural Computation, 12(1):1–41, 2000.PubMedCrossRefGoogle Scholar
  18. 18.
    P. N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-Wesley, Boston, MA, 2005.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Microsoft ResearchRedmondUSA
  2. 2.UIUCUrbanaUSA
  3. 3.Department of Computer ScienceUniversity of Illinois at ChicagoChicagoUSA

Personalised recommendations