Abstract
Many scientific and engineering fields produce large volume of spatiotemporal data. The storage, retrieval, and analysis of such data impose great challenges to database systems design. Analysis of scientific spatiotemporal data often involves computing functions of all point-to-point interactions. One such analytics, the Spatial Distance Histogram (SDH), is of vital importance to scientific discovery. Recently, algorithms for efficient SDH processing in large-scale scientific databases have been proposed. These algorithms adopt a recursive tree-traversing strategy to process point-to-point distances in the visited tree nodes in batches, thus require less time when compared to the brute-force approach where all pairwise distances have to be computed. Despite the promising experimental results, the complexity of such algorithms has not been thoroughly studied. In this paper, we present an analysis of such algorithms based on a geometric modeling approach. The main technique is to transform the analysis of point counts into a problem of quantifying the area of regions where pairwise distances can be processed in batches by the algorithm. From the analysis, we conclude that the number of pairwise distances that are left to be processed decreases exponentially with more levels of the tree visited. This leads to the proof of a time complexity lower than the quadratic time needed for a brute-force algorithm and builds the foundation for a constant-time approximate algorithm. Our model is also general in that it works for a wide range of point spatial distributions, histogram types, and space-partitioning options in building the tree.
Similar content being viewed by others
References
Allen, M.: Introduction to Molecular Dynamics Simulation. John von Neumann Institute of Computing, NIC Seris, vol. 23 (2003)
Allen M.P., Tildesley D.J.: Computer Simulations of Liquids. Clarendon Press, Oxford (1987)
Arya, M., Cody, W.F., Faloutsos, C., Richardson, J., Toya, A.: QBISM: Extending a DBMS to Support 3D Medical Images. In: ICDE, pp. 314–325, (1994)
Bamdad M., Alavi S., Najafi B., Keshavarzi E.: A new expression for radial distribution function and infinite shear modulus of lennard-jones fluids. Chem. Phys. 325, 554–562 (2006)
Barnes J., Hut P.: A hierarchical O(N log N) force-calculation algorithm. Nature 324(4), 446–449 (1986)
Brown, P.G.: Overview of scidb: large scale array storage, processing and analysis. In: SIGMOD Conference, pp. 963–968 (2010)
Callahan P.B., Kosaraju S.R.: A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. J. ACM 42(1), 67–90 (1995)
Cormen T.H., Leiserson C.E., Rivest R.L., Stein C.: Introduction to Algorithms, pp. 73–75 2nd edn. MIT Press and McGraw-Hill, Cambridge (2001)
Csabai, I., Trencseni, M., Dobos, L., Jozsa, P., Herczegh, G., Purger, N., Budavari, T., Szalay, A.S.: Spatial indexing of large multidimensional databases. In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Resarch (CIDR), pp. 207–218 (2007)
Eltabakh, M.Y., Ouzzani, M., Aref, W.G.: BDBMS—a database management system for biological data. In: Proceedings of the 3rd Biennial Conference on Innovative Data Systems Resarch (CIDR), pp. 196–206 (2007)
Feig M., Abdullah M., Johnsson L., Pettitt B.M.: Large scale distributed data repository: design of a molecular dynamics trajectory database. Future Gener. Comput. Syst. 16(1), 101–110 (1999)
Filipponi A.: The radial distribution function probed by X-ray absorption spectroscopy. J. Phys. Condens. Matt. 6, 8415–8427 (1994)
Finocchiaro G., Wang T., Hoffmann R., Gonzalez A., Wade R.: DSMM: a database of simulated molecular motions. Nucl. Acids Res. 31(1), 456–457 (2003)
Frenkel D., Smit B.: Understanding Molecular Simulation: From Algorithm to Applications, volume 1 of Computational Science Series. Academic Press, New York (2002)
Gawlick, D., Lenkov, D., Yalamanchi, A., Chernobrod, L.: Applications for expression data in relational database system. In: ICDE, pp. 609–620 (2004)
Gray, A.G., Moore, A.W.: N-body problems in statistical learning. In: Advances in Neural Information Processing Systems (NIPS), pp. 521–527, MIT Press (2000)
Gray J., Liu D., Nieto-Santisteban M., Szalay A., DeWitt D., Heber G.: Scientific data management in the coming decade. SIGMOD Rec. 34(4), 34–41 (2005)
Greengard L., Rokhlin V.: A fast algorithm for particle simulations. J. Comput. Phys. 135(12), 280–292 (1987)
Heber, G., Gray, J.: Supporting finite element analysis with a relational database backend. Part I: there is life beyond files. Technical Report MSR-TR-2005-49, Microsoft Research (2005)
Hess B., Kutzner C., van der Spoel D., Lindahl E.: GROMACS 4: algorithms for highly efficient, load-balanced, and scalable molecular simulation. J. Chem. Theory Comput. 4(3), 435–447 (2008)
Howe, B., Maier, D., Bright, L.: Smoothing the ROI curve for scientific data management applications. In: CIDR, pp. 185–195 (2007)
Klasky, S., Ludaescher, B., Parashar, M.: The Center for Plasma Edge Simulation Workflow Requirements. In: EEE Workshop on Workflow and Data Flow for Scientific Applications (SciFlow’06), pp. 73–73 (1991)
Krishnamurthy L., Nadeau J., Ozsoyoglu G., Ozsoyoglu M., Schaeffer G., Tasan M., Xu W.: Pathways database system: an integrated system for biological pathways. Bioinformatics 19(8), 930–937 (2003)
Ma, X., Winslett, M., Norris, J., Jiao, X., Fiedler, R.: Godiva: lightweight data management for scientific visualization applications. In: ICDE, pp. 732–744 (2004)
Moore A.W., Connolly A.J., Genovese C., Gray A., Grone L., Kanidoris N. II, Nichol R.C., Schneider J., Szalay A.S., Szapudi I., Wasserman L.: Mining the Sky, volume 2001 of ESO Astrophysics Symposia, Chapter Fast Algorithms and Efficient Statistics: N-Point Correlation Functions, pp. 71–82. Springer, Heidelberg (2006)
Omeltchenko A., Campbell T.J., Kalia R.K., Liu X., Nakano A., Vashishta P.: Scalable I/O of large-scale molecular dynamics simulations: a data-compression algorithm. Comput. Phys. Commun. 131, 78–85 (2000)
Orenstein J.A.: Multidimensional tries used for associative searching. Inf. Process. Lett. 14(4), 150–157 (1982)
Patel J.M.: The role of declarative querying in bioinformatics. OMICS J. Integr. Biol. 7(1), 89–91 (2003)
Samet H.: The quadtree and related hierarchical data structures. ACM Comput. Surv. 16(2), 187–260 (1984)
Springel V., White S.D.M., Jenkins A., Frenk C.S., Yoshida N., Gao L., Navarro J., Thacker R., Croton D., Helly J., Peacock J.A., Cole S., Thomas P., Couchman H., Evrard A., Colberg J., Pearce F.: Simulations of the formation, evolution and clustering of galaxies and quasars. Nature 435, 629–636 (2005)
Stark J.L., Murtagh F.: Astronomical Image and Data Analysis. Springer, Heidelberg (2002)
Stonebraker, M., Madden, S., Abadi, D.J., Harizopoulos, S., Hachem, N., Helland, P.: The End of an Architectural Era (It’s Time for a Complete Rewrite). In: VLDB, pp. 1150–1160 (2007)
Szalay, A.S., Gray, J., Thakar, A., Kunszt, P.Z., Malik, T., Raddick, J., Stoughton, C., vandenBerg, J.: The SDSS Skyserver: public access to the sloan digital sky server data. In: Proceedings of International Conference on Management of Data (SIGMOD), pp. 570–581 (2002)
Szapudi I.: A new method for calculating counts in cells. Astrophys. J. 493(1), 39–51 (1998)
Szapudi I., Colombi S., Bernardeau F.: Cosmic statistics of statistics. Mon. Notes Roy. Astron. Soc. 310(2), 428–444 (1999)
Tao Y., Sun J., Papadias D.: Analysis of predictive spatio-temporal queries. ACM Trans. Database Syst. 28(4), 295–336 (2003)
Tu, Y.-C., Chen, S., Pandit, S.: Computing Spatial Distance Histograms Efficiently in Scientific Databases. Technical Report CSE/08-103, http://www.cse.usf.edu/~ytu/pub/tr/pdh.pdf, Department of Computer Science and Engineering, University of South Florida (2008)
Tu, Y.-C., Chen, S., Pandit, S.: Computing distance histograms efficiently in scientific databases. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 796–807 (2009)
Türker, C., Akal, F., Joho, D., Schlapbach, R.: B-fabric: an open source life sciences data management system. In: SSDBM, pp. 185–190 (2009)
Xu, W., Ozer, S., Gutell, R.R.: Covariant evolutionary event analysis for base interaction prediction using a relational database management system for RNA. In: SSDBM, pp. 200–216 (2009)
Author information
Authors and Affiliations
Corresponding author
Additional information
Work was done when Chen was a visiting professor at the University of South Florida.
Rights and permissions
About this article
Cite this article
Chen, S., Tu, YC. & Xia, Y. Performance analysis of a dual-tree algorithm for computing spatial distance histograms. The VLDB Journal 20, 471–494 (2011). https://doi.org/10.1007/s00778-010-0205-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-010-0205-7