Optimizing Monitoring Queries over Distributed Data

  • Frank Neven
  • Dieter Van de Craen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3896)

Abstract

Scientific data in the life sciences is distributed over various independent multi-format databases and is constantly expanding. We discuss a scenario where a life science research lab monitors over time the results of queries to remote databases beyond their control. Queries are registered at a local system and get executed on a daily basis in batch mode. The goal of the paper is to study evaluation strategies minimizing the total number of accesses to databases when evaluating all queries in bulk. We use an abstraction based on the relational model with fan-out constraints and conjunctive queries. We show that the above problem remains np-hard in two restricted settings: queries of bounded depth and the scenario with a fixed schema. We further show that both restrictions taken together results in a tractable problem. As the constant for the latter algorithm is too high to be feasible in practice, we present four heuristic methods that are experimentally compared on randomly generated and biologically motivated schemas. Our algorithms are based on a greedy method and approximations for the shortest common super sequence problem.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Altinel, M., Franklin, M.J.: Efficient filtering of XML documents for selective dissemination of information. In: Proc. of the 26th International Conference on Very Large Data Bases (VLDB 2000), pp. 53–64. Morgan Kaufmann, San Francisco (2000)Google Scholar
  2. 2.
    Ashburner, M., et al.: Gene Ontology: tool for the unification of biology. Nature Genetics 25(1), 25–29 (2000)CrossRefGoogle Scholar
  3. 3.
    Bairoch, A., Apweiler, R.: The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic Acids Research 24(1), 21–25 (1996)CrossRefGoogle Scholar
  4. 4.
    Bernstein, P.A., Goodman, N., Wong, E., Reeve, C.L., Rothnie Jr., J.B.: Query processing in a system for distributed databases (SDD-1). ACM Transactions on Database Systems 6(4), 602–625 (1981)MATHCrossRefGoogle Scholar
  5. 5.
    Bilofsky, H.S., et al.: The GenBank Genetic Sequence Databank. Nucleic Acids Research 14, 1–4 (1986)CrossRefGoogle Scholar
  6. 6.
    Chandra, A., Merlin, P.: Optimal implementation of conjunctive queries in relational data bases. In: Proceedings 9th ACM Symposium on Theory of Computing (STOC 1977), pp. 77–90. ACM Press, New York (1977)CrossRefGoogle Scholar
  7. 7.
    Foulser, D.E., Li, M., Yang, Q.: Theory and algorithms for plan merging. Artificial Intelligence 57(2-3), 143–181 (1992)MATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman (1979)Google Scholar
  9. 9.
    Hokamp, K., Wolfe, K.: What’s new in the library? What’s new in GenBank? Let PubCrawler tell you. Trends in Genetics 15(11), 471–472 (1999)CrossRefGoogle Scholar
  10. 10.
    Jiang, T., Li, M.: On the approximation of shortest common supersequences and longest common subsequences. SIAM Journal on Computing 24(5), 1122–1139 (1995)MATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28(1), 27–30 (2000)CrossRefGoogle Scholar
  12. 12.
    Kushilevitz, E., Nisan, N.: Communication complexity. Cambridge University Press, Cambridge (1997)MATHGoogle Scholar
  13. 13.
    Lacroix, Z., Critchlow, T.: Bioinformatics: Managing Scientific Data. Morgan Kaufmann, San Francisco (2003)Google Scholar
  14. 14.
    Lu, H., Ooi, B., Goh, C.: On global multidatabase query optimization. SIGMOD Record 21(4), 6–11 (1992)CrossRefGoogle Scholar
  15. 15.
    Raeiha, K.J., Ukkonen, E.: Shortest common supersequence problem over binary alphabet is NP-complete. Theoretical Computer Science 16(2), 187–198 (1981)CrossRefMathSciNetGoogle Scholar
  16. 16.
    Roy, P., Seshadri, S., Sudarshan, S., Bhobe, S.: Efficient and extensible algorithms for multi query optimization. In: Proceedings of the 2000 ACM SIGMOD international conference on Management of data (SIGMOD 2000), pp. 249–260. ACM Press, New York (2000)CrossRefGoogle Scholar
  17. 17.
    Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing Surveys 22(3), 183–236 (1990)CrossRefGoogle Scholar
  18. 18.
    Suciu, D.: Distributed query evaluation on semistructured data. ACM Transactions on Database Systems 27(1), 1–62 (2002)CrossRefGoogle Scholar
  19. 19.
    Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to data mining. Addison Wesley, Reading (2005)Google Scholar
  20. 20.
    Van de Craen, D.: Biologically motivated schema, http://alpha.uhasselt.be/~lucp1631/files/biodbschema.pdf
  21. 21.
    Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the ACM 21(1), 168–173 (1974)MATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Wang, C., Chen, M.: On the complexity of distributed query optimization. IEEE Transactions on Knowledge and Data Engineering 8(4), 650–662 (1996)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Frank Neven
    • 1
  • Dieter Van de Craen
    • 1
  1. 1.Hasselt University and Transnational University of LimburgDiepenbeekBelgium

Personalised recommendations