Token list based information search in a multi-dimensional massive database

Shen, Haiying; Li, Ze; Li, Ting

doi:10.1007/s10844-013-0289-9

Token list based information search in a multi-dimensional massive database

Published: 27 December 2013

Volume 42, pages 567–594, (2014)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Haiying Shen¹,
Ze Li² &
Ting Li³

281 Accesses
Explore all metrics

Abstract

Finding proximity information is crucial for massive database search. Locality Sensitive Hashing (LSH) is a method for finding nearest neighbors of a query point in a high-dimensional space. It classifies high-dimensional data according to data similarity. However, the “curse of dimensionality” makes LSH insufficiently effective in finding similar data and insufficiently efficient in terms of memory resources and search delays. The contribution of this work is threefold. First, we study a Token List based information Search scheme (TLS) as an alternative to LSH. TLS builds a token list table containing all the unique tokens from the database, and clusters data records having the same token together in one group. Querying is conducted in a small number of groups of relevant data records instead of searching the entire database. Second, in order to decrease the searching time of the token list, we further propose the Optimized Token list based Search schemes (OTS) based on index-tree and hash table structures. An index-tree structure orders the tokens in the token list and constructs an index table based on the tokens. Searching the token list starts from the entry of the token list supplied by the index table. A hash table structure assigns a hash ID to each token. A query token can be directly located in the token list according to its hash ID. Third, since a single-token based method leads to high overhead in the results refinement given a required similarity, we further investigate how a Multi-Token List Search scheme (MTLS) improves the performance of database proximity search. We conducted experiments on the LSH-based searching scheme, TLS, OTS, and MTLS using a massive customer data integration database. The comparison experimental results show that TLS is more efficient than an LSH-based searching scheme, and OTS improves the search efficiency of TLS. Further, MTLS per forms better than TLS when the number of tokens is appropriately chosen, and a two-token adjacent token list achieves the shortest query delay in our testing dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A locality-aware similar information searching scheme

Article 12 October 2014

M-Grid: a distributed framework for multidimensional indexing and querying of location based data

Article 13 March 2017

Local Sensitive Hashing for Proximity Searching

References

Aberer, K., Cudrè-Mauroux, P., Hauswirth, M. (2003). The chatty web: emergent semantics through gossiping. In Proceedings of the 12nd international world wide web conference.
Alimohammadi, D. (2003). Meta-tag: a means to control the process of web indexing. Online Information Review, 27(4), 238–242.
Article Google Scholar
Andoni, A. (2005). Lsh algorithm and implementation (e2lsh). http://web.mit.edu/andoni/www/LSH/index.html.
Andoni, A., & Indyk, P. (2005). E2lsh 0.1 user manual. http://web.mit.edu/andoni/www/LSH/index.html.
Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A. (1994). An optimal algorithm for approximate nearest neighbor searching. In Proceedings 5th ACM-SIAM symposium discrete algorithms.
Bayer, R., & McCreight, E. (1970). Organization and maintenance of large ordered indices. In Proceedings of ACM-SIGFIDET workshop on data description and access (pp. 107–141).
Beckmann, N., Kriegel, H., Schneider, R., Seeger, B. (1990). The r*-tree: an efficient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD international conference on management of data (pp. 322–331).
Bennett, K.P., Fayyad, U., Geiger, D. (1999). Density-based indexing for approximate nearest-neighbor queries. In Proceedings of KDD.
Bentle, J.L., Friedman, J.H., Finkel, R.A. (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3), 209–226.
Article Google Scholar
Berchtold, S., Keim, D.A., Kriegel, H.-P. (1996). The x-tree: an index structure for high-dimensional data. In Proceedings of the 22nd international conference on very large databases (pp. 28–39).
Berrani, S.A., Amsaleg, L., Grosr, P. (2003). Approximate searches: k-neighbors + precision. In Proceedings of CIKM.
Berry, M.W., Drmac, Z., Jessup, E.R. (1999). Matrices vector spaces, and information retrieval. SIAM Review, 41(2), 335–362.
Article MATH MathSciNet Google Scholar
Blachman, N. (2007). Google guide, making searching even easier. http://www.googleguide.com/google_works.html.
Bohm, C., Berchtold, S., Keim, D.A. (2001). Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3), 322–373.
Article Google Scholar
Brin, S. (1995). Near neighbor search in large metric space. In Proceedings of the 21st international conference on VLDB.
Chaudhuri, S., Church, K., Konig, A., Sui, L. (2007). Heavy-tailed distributions and multi-keyword queries. In Proceedings of SIGIR.
Chen, H., Jin, H., Wang, J., Chen, L., Liu, Y., Ni, L. (2008). Efficient multi-keyword search over p2p web. In Proceedings of WWW (pp. 989–998).
Chen, H., Yan, J., Jin, H., Liu, Y., Ni, L. (2010). TSS: efficient term set search in large peer-to-peer textual collections. TC, 59(7), 969–980.
MathSciNet Google Scholar
Comer, D. (1979). The ubiquitous B-tree. Computing Surveys, 11(2), 121–138.
Article MATH Google Scholar
Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions of Information Theory, IT-13(1), 21–27.
Article Google Scholar
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S. (2003). Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of DIMACS workshop on streaming data analysis and mining.
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th annual symposium on computational geometry (SCG).
Deerwester, S., Dumais, S.T., Landauer, T.K., Fumas, G.W., Harshman, R.A. (1990). Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6), 391–407.
Article Google Scholar
Fagin, R. (1998). Fuzzy queries in multimedia database systems. In Proceedings ACM symposium on principles of database systems.
Filho, R.F.S., Traina, A.J.M., Traina, J.C., Faloutsos, C. (2001). Similarity search without tears: the omni family of all-purpose access methods. In Proceedings of ICDE.
Fu, A., Chan, P.M.S., Cheung, Y.L., Moon, Y.S. (2000). Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances. VLDB Journal, 9(2), 154–173.
Article Google Scholar
Gionis, A., Indyk, P., Motwani, R. (1999). Similarity search in high dimensions via hashing. In Proceedings of international conference on very large data bases (VLDB) (pp. 518–529).
Grossman, D.A., & Frieder, O. (2004). iFlow: information retrieval. The Netherlands: Springer.
Book Google Scholar
Guttman, A. (1984). R-trees: a dynamic index structure for spatial searching. In Proceedings of the SIGMOD conference (pp. 47–57).
Halevy, A.Y., Ives, Z.G., Mork, P., Tatarinov, I. (2003). Piazza: data management infrastructure for semantic web applications. In Proceedings of the 12nd international world wide web conference.
Hu, J.J., Tang, C.J., Peng, J., Li, C., Yuan, C.A., Chen, A.L. (2005). A clustering algorithm based absorbing nearest neighbors. In 6th International conference of WAIM.
Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the 30th annual ACM symposium on theory of computing.
Kleinberg, J.M. (1997). Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of ACM symposium on theory of computing (STOC).
Kruskal, J.B., & Wish, M. (1978). Multidimensional scaling. Beverly Hills: SAGE publication.
Kulkami, S., & Orlandic, R. (2006). High-dimensional similarity search using data sensitive space partitioning. Lecture Notes in Computer Science (LNCS), 4080(2006), 738–750.
Google Scholar
Lam, H., Perego, R., Quan, N., Silvestri, F. (2009). Entry pairing in inverted file. In Proceedings of WISE (Vol. 5802, pp. 511–522).
Li, C., Chang, E., Garcia-Molina, H.,Wiederhold, G. (2002). Clustering for approximate similarity search in high-dimensional spaces. IEEE Transactions of Knowledge and Data Engineering, 14(4), 792–808.
Google Scholar
Li, T., Shen, H., Rosequist, A. (2008). Token list based data searching in a multi-dimensional massive database. In Proceedings of The 4th international conference on data mining (DMIN).
Loccoz, N.M. (2005). High-dimensional access methods for efficient similarity queries. Technical Report TR-2005-05-05, Universite De GENEVE.
Long, X., & Suel, T. (2005). Three-level caching for efficient query processing in large Web search engines. In Proceedings of WWW (pp. 257–266).
Luu, T., Skobeltsyn, G., Klemm, F., Puh, M., Zarko, I., Rajman, M., Aberer, K. (2008). AlvisP2P: scalable peer-to-peer text retrieval in a structured P2P network. PVLDB, 1(2), 1424–1427.
Google Scholar
Nejdl, W., Siberski, W., Wolpers, M., Schmnitz, C. (2003). Routing and clustering in schema-based super peer networks. In Proceedings of IPTPS.
Nejdl, W., Wolpers, M., Siberski, W., Löser, A., Bruckhorst, I., Schlosser, M., Schmitz, C. (2003). Super-peer-based routing and clustering strategies for rdf-based peer-to-peer networks. In Proceedings of the 12nd international world wide web conference.
Niblack, C.W., Barber, R., Equitz, W., Flickner, M.D., Glasman, E.H., Petkovic, D., Yanker, P., Faloutsos, C., Taubin, G. (1993). The QBIC project: querying images by content using color, texture and shape. In Proceedings of SPIE: storage and retrieval for image and video database.
Panigrahy, R. (2006). Nearest neighbor search using kd-trees. Technical report, Stanford University.
Qi, X., & Davison, B. (2009). Web page classification: features and algorithms. ACM Computing Surveys, 41(2), 1–31.
Article Google Scholar
Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. International Student Edition, McGraw-Hill.
Sellis, T., Roussopoulos, N., Faloutsos, C. (1997). Multidimensional access methods: trees have grown everywhere. In Proceedings of the 23rd international conference on very large data bases.
Shen, H., Li, Z., Li, T. (2008). An investigation on multi-token list based proximity search in multi-dimensional massive database. In Proceedings of the international conference on convergence and hybrid information technology (ICCIT).
Skobeltsyn, G., Luu, T., Zarko, I., Rajman, M., Aberer, K. (2009). Query-driven indexing for scalable peer-to-peer text retrieval. Future Generation Computing Systems, 25(1), 89–99.
Article Google Scholar
Weth, C., & Datta, A. (2012). Multiterm keyword search in NoSQL systems. IEEE Internet Computing, 16(1), 34–42.
Article Google Scholar
White, D.A., & Jain, R. (1996). Algorithm and strategies for similarity retrieval. Technical Report VCL-96-101, University of California.
Yianlios, P.N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM symposium on discrete algorithms.
Zolotarev, V.M. (1986). One-dimensional stable distributions. American Mathematical Society.

Download references

Acknowledgments

This research was supported in part by U.S. NSF grants IIS-1354123, CNS-1254006, CNS-1249603, OCI-1064230, CNS-1049947, CNS-0917056 and, CNS-1025652, Microsoft Research Faculty Fellowship 8300751, Microsoft Research Faculty Fellowship 8300751, and the United States Department of Defense 238866. Early versions of this work were presented in the Proceedings of DMIN’08 (Li et al. 2008) and ICCIT’08 (Shen et al. 2008). We would like to thank Mr. Yuhua Lin for his valuable comments in addressing the review feedback.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Clemson University, Clemson, SC, 29634, USA
Haiying Shen
MicroStrategy, Tysons Corner, Fairfax, VA, 22182, USA
Ze Li
Wal-mart Stores Inc., Bentonville, AR, 72716, USA
Ting Li

Authors

Haiying Shen
View author publications
You can also search for this author in PubMed Google Scholar
Ze Li
View author publications
You can also search for this author in PubMed Google Scholar
Ting Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haiying Shen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shen, H., Li, Z. & Li, T. Token list based information search in a multi-dimensional massive database. J Intell Inf Syst 42, 567–594 (2014). https://doi.org/10.1007/s10844-013-0289-9

Download citation

Received: 27 March 2013
Revised: 06 November 2013
Accepted: 08 November 2013
Published: 27 December 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s10844-013-0289-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Token list based information search in a multi-dimensional massive database

Abstract

Access this article

Similar content being viewed by others

A locality-aware similar information searching scheme

M-Grid: a distributed framework for multidimensional indexing and querying of location based data

Local Sensitive Hashing for Proximity Searching

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Token list based information search in a multi-dimensional massive database

Abstract

Access this article

Similar content being viewed by others

A locality-aware similar information searching scheme

M-Grid: a distributed framework for multidimensional indexing and querying of location based data

Local Sensitive Hashing for Proximity Searching

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation