High-Level Rules for Integration and Analysis of Data: New Challenges

  • Bogdan Alexe
  • Douglas Burdick
  • Mauricio A. Hernández
  • Georgia Koutrika
  • Rajasekar Krishnamurthy
  • Lucian Popa
  • Ioana R. Stanoi
  • Ryan Wisnesky
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8000)


Data integration remains a perenially difficult task. The need to access, integrate and make sense of large amounts of data has, in fact, accentuated in recent years. There are now many publicly available sources of data that can provide valuable information in various domains. Concrete examples of public data sources include: bibliographic repositories (DBLP, Cora, Citeseer), online movie databases (IMDB), knowledge bases (Wikipedia, DBpedia, Freebase), social media data (Facebook and Twitter, blogs). Additionally, a number of more specialized public data repositories are starting to play an increasingly important role. These repositories include, for example, the U.S. federal government data, congress and census data, as well as financial reports archived by the U.S. Securities and Exchange Commission (SEC).


Schema Mapping Extraction Rule Entity Resolution Semistructured Data Skolem Function 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alexe, B., ten Cate, B., Kolaitis, P.G., Tan, W.C.: Designing and Refining Schema Mappings via Data Examples. In: SIGMOD, pp. 133–144 (2011)Google Scholar
  2. 2.
    Arasu, A., Ré, C., Suciu, D.: Large-Scale Deduplication with Constraints Using Dedupalog. In: ICDE, pp. 952–963 (2009)Google Scholar
  3. 3.
    Balakrishnan, S., Chu, V., Hernández, M.A., Ho, H., Krishnamurthy, R., Liu, S., Pieper, J., Pierce, J.S., Popa, L., Robson, C., Shi, L., Stanoi, I.R., Ting, E.L., Vaithyanathan, S., Yang, H.: Midas: Integrating Public Financial Data. In: SIGMOD, pp. 1187–1190 (2010)Google Scholar
  4. 4.
    Beyer, K., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M., Kanne, C.C., Ozcan, F., Shekita, E.: Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. In: VLDB (2011)Google Scholar
  5. 5.
    Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1) (2007)Google Scholar
  6. 6.
    Bleiholder, J., Naumann, F.: Data Fusion. ACM Comput. Surv. 41(1) (2008)Google Scholar
  7. 7.
    Burdick, D., Hernández, M.A., Ho, H., Koutrika, G., Krishnamurthy, R., Popa, L., Stanoi, I.R., Vaithyanathan, S., Das, S.: Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study. IEEE Data Eng. Bull. 34(3), 60–67 (2011)Google Scholar
  8. 8.
    Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan., S.: SystemT: An Algebraic Approach to Declarative Information Extraction. In: ACL, pp. 128–137 (2010)Google Scholar
  9. 9.
    Chiticariu, L., Kolaitis, P.G., Popa, L.: Interactive Generation of Integrated Schemas. In: SIGMOD Conference, pp. 833–846 (2008)Google Scholar
  10. 10.
    Dalvi, N.N., Kumar, R., Pang, B., Ramakrishnan, R., Tomkins, A., Bohannon, P., Keerthi, S., Merugu, S.: A Web of Concepts. In: PODS, pp. 1–12 (2009)Google Scholar
  11. 11.
    Doan, A., Naughton, J.F., Ramakrishnan, R., Baid, A., Chai, X., Chen, F., Chen, T., Chu, E., DeRose, P., Gao, B.J., Gokhale, C., Huang, J., Shen, W., Vuong, B.Q.: Information Extraction Challenges in Managing Unstructured Data. SIGMOD Record 37(4), 14–20 (2008)CrossRefGoogle Scholar
  12. 12.
    Dong, X., Halevy, A.Y., Madhavan, J.: Reference Reconciliation in Complex Information Spaces. In: SIGMOD Conference, pp. 85–96 (2005)Google Scholar
  13. 13.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A Survey. IEEE TKDE 19(1), 1–16 (2007)Google Scholar
  14. 14.
    Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: Schema Mapping Creation and Data Exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  15. 15.
    Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data Exchange: Semantics and Query Answering. TCS 336(1), 89–124 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Interaction between Record Matching and Data Repairing. In: SIGMOD Conference, pp. 469–480 (2011)Google Scholar
  17. 17.
    Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. J. Am. Statistical Assoc. 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  18. 18.
    Fletcher, G.H.L., Gyssens, M., Paredaens, J., Gucht, D.V.: On the Expressive Power of the Relational Algebra on Finite Sets of Relation Pairs. IEEE TKDE 21(6), 939–942 (2009)Google Scholar
  19. 19.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A.: Declarative Data Cleaning: Language, Model, and Algorithms. In: VLDB, pp. 371–380 (2001)Google Scholar
  20. 20.
    Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The Lixto Data Extraction Project - Back and Forth between Theory and Practice. In: PODS, pp. 1–12 (2004)Google Scholar
  21. 21.
    Gottlob, G., Senellart, P.: Schema Mapping Discovery from Data Instances. Journal of the Association for Computing Machinery (JACM) 57(2) (2010)Google Scholar
  22. 22.
    Hernández, M.A., Koutrika, G., Krishnamurthy, R., Popa, L., Wisnesky, R.: HIL: A High-Level Scripting Language for Entity Integration. In: EDBT, pp. 549–560 (2013)Google Scholar
  23. 23.
    Hernández, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: SIGMOD Conference, pp. 127–138 (1995)Google Scholar
  24. 24.
    Ohori, A.: A Polymorphic Record Calculus and Its Compilation. ACM Trans. Program. Lang. Syst. 17(6), 844–895 (1995)CrossRefGoogle Scholar
  25. 25.
    Ohori, A., Buneman, P.: Type Inference in a Database Programming Language. In: LISP and Functional Programming, pp. 174–183 (1988)Google Scholar
  26. 26.
    Rahm, E., Thor, A., Aumueller, D., Do, H.H., Golovin, N., Kirsten, T.: iFuice - Information Fusion utilizing Instance Correspondences and Peer Mappings. In: WebDB, pp. 7–12 (2005)Google Scholar
  27. 27.
    Sarma, A.D., Parameswaran, A.G., Garcia-Molina, H., Widom, J.: Synthesizing View Definitions from Data. In: ICDT, pp. 89–103 (2010)Google Scholar
  28. 28.
    Wand, M.: Complete Type Inference for Simple Objects. In: LICS, pp. 37–44 (1987)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Bogdan Alexe
    • 1
  • Douglas Burdick
    • 1
  • Mauricio A. Hernández
    • 1
  • Georgia Koutrika
    • 2
  • Rajasekar Krishnamurthy
    • 1
  • Lucian Popa
    • 1
  • Ioana R. Stanoi
    • 1
  • Ryan Wisnesky
    • 3
  1. 1.IBM Almaden Research CenterUSA
  2. 2.HP LabsUSA
  3. 3.School of Engineering and Applied SciencesHarvard UniversityUSA

Personalised recommendations