Skip to main content
Log in

Token list based information search in a multi-dimensional massive database

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Finding proximity information is crucial for massive database search. Locality Sensitive Hashing (LSH) is a method for finding nearest neighbors of a query point in a high-dimensional space. It classifies high-dimensional data according to data similarity. However, the “curse of dimensionality” makes LSH insufficiently effective in finding similar data and insufficiently efficient in terms of memory resources and search delays. The contribution of this work is threefold. First, we study a Token List based information Search scheme (TLS) as an alternative to LSH. TLS builds a token list table containing all the unique tokens from the database, and clusters data records having the same token together in one group. Querying is conducted in a small number of groups of relevant data records instead of searching the entire database. Second, in order to decrease the searching time of the token list, we further propose the Optimized Token list based Search schemes (OTS) based on index-tree and hash table structures. An index-tree structure orders the tokens in the token list and constructs an index table based on the tokens. Searching the token list starts from the entry of the token list supplied by the index table. A hash table structure assigns a hash ID to each token. A query token can be directly located in the token list according to its hash ID. Third, since a single-token based method leads to high overhead in the results refinement given a required similarity, we further investigate how a Multi-Token List Search scheme (MTLS) improves the performance of database proximity search. We conducted experiments on the LSH-based searching scheme, TLS, OTS, and MTLS using a massive customer data integration database. The comparison experimental results show that TLS is more efficient than an LSH-based searching scheme, and OTS improves the search efficiency of TLS. Further, MTLS per forms better than TLS when the number of tokens is appropriately chosen, and a two-token adjacent token list achieves the shortest query delay in our testing dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25

Similar content being viewed by others

References

  • Aberer, K., Cudrè-Mauroux, P., Hauswirth, M. (2003). The chatty web: emergent semantics through gossiping. In Proceedings of the 12nd international world wide web conference.

  • Alimohammadi, D. (2003). Meta-tag: a means to control the process of web indexing. Online Information Review, 27(4), 238–242.

    Article  Google Scholar 

  • Andoni, A. (2005). Lsh algorithm and implementation (e2lsh). http://web.mit.edu/andoni/www/LSH/index.html.

  • Andoni, A., & Indyk, P. (2005). E2lsh 0.1 user manual. http://web.mit.edu/andoni/www/LSH/index.html.

  • Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A. (1994). An optimal algorithm for approximate nearest neighbor searching. In Proceedings 5th ACM-SIAM symposium discrete algorithms.

  • Bayer, R., & McCreight, E. (1970). Organization and maintenance of large ordered indices. In Proceedings of ACM-SIGFIDET workshop on data description and access (pp. 107–141).

  • Beckmann, N., Kriegel, H., Schneider, R., Seeger, B. (1990). The r*-tree: an efficient and robust access method for points and rectangles. In Proceedings of the ACM SIGMOD international conference on management of data (pp. 322–331).

  • Bennett, K.P., Fayyad, U., Geiger, D. (1999). Density-based indexing for approximate nearest-neighbor queries. In Proceedings of KDD.

  • Bentle, J.L., Friedman, J.H., Finkel, R.A. (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3), 209–226.

    Article  Google Scholar 

  • Berchtold, S., Keim, D.A., Kriegel, H.-P. (1996). The x-tree: an index structure for high-dimensional data. In Proceedings of the 22nd international conference on very large databases (pp. 28–39).

  • Berrani, S.A., Amsaleg, L., Grosr, P. (2003). Approximate searches: k-neighbors + precision. In Proceedings of CIKM.

  • Berry, M.W., Drmac, Z., Jessup, E.R. (1999). Matrices vector spaces, and information retrieval. SIAM Review, 41(2), 335–362.

    Article  MATH  MathSciNet  Google Scholar 

  • Blachman, N. (2007). Google guide, making searching even easier. http://www.googleguide.com/google_works.html.

  • Bohm, C., Berchtold, S., Keim, D.A. (2001). Searching in high-dimensional spaces: index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3), 322–373.

    Article  Google Scholar 

  • Brin, S. (1995). Near neighbor search in large metric space. In Proceedings of the 21st international conference on VLDB.

  • Chaudhuri, S., Church, K., Konig, A., Sui, L. (2007). Heavy-tailed distributions and multi-keyword queries. In Proceedings of SIGIR.

  • Chen, H., Jin, H., Wang, J., Chen, L., Liu, Y., Ni, L. (2008). Efficient multi-keyword search over p2p web. In Proceedings of WWW (pp. 989–998).

  • Chen, H., Yan, J., Jin, H., Liu, Y., Ni, L. (2010). TSS: efficient term set search in large peer-to-peer textual collections. TC, 59(7), 969–980.

    MathSciNet  Google Scholar 

  • Comer, D. (1979). The ubiquitous B-tree. Computing Surveys, 11(2), 121–138.

    Article  MATH  Google Scholar 

  • Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions of Information Theory, IT-13(1), 21–27.

    Article  Google Scholar 

  • Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S. (2003). Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of DIMACS workshop on streaming data analysis and mining.

  • Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th annual symposium on computational geometry (SCG).

  • Deerwester, S., Dumais, S.T., Landauer, T.K., Fumas, G.W., Harshman, R.A. (1990). Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6), 391–407.

    Article  Google Scholar 

  • Fagin, R. (1998). Fuzzy queries in multimedia database systems. In Proceedings ACM symposium on principles of database systems.

  • Filho, R.F.S., Traina, A.J.M., Traina, J.C., Faloutsos, C. (2001). Similarity search without tears: the omni family of all-purpose access methods. In Proceedings of ICDE.

  • Fu, A., Chan, P.M.S., Cheung, Y.L., Moon, Y.S. (2000). Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances. VLDB Journal, 9(2), 154–173.

    Article  Google Scholar 

  • Gionis, A., Indyk, P., Motwani, R. (1999). Similarity search in high dimensions via hashing. In Proceedings of international conference on very large data bases (VLDB) (pp. 518–529).

  • Grossman, D.A., & Frieder, O. (2004). iFlow: information retrieval. The Netherlands: Springer.

    Book  Google Scholar 

  • Guttman, A. (1984). R-trees: a dynamic index structure for spatial searching. In Proceedings of the SIGMOD conference (pp. 47–57).

  • Halevy, A.Y., Ives, Z.G., Mork, P., Tatarinov, I. (2003). Piazza: data management infrastructure for semantic web applications. In Proceedings of the 12nd international world wide web conference.

  • Hu, J.J., Tang, C.J., Peng, J., Li, C., Yuan, C.A., Chen, A.L. (2005). A clustering algorithm based absorbing nearest neighbors. In 6th International conference of WAIM.

  • Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the 30th annual ACM symposium on theory of computing.

  • Kleinberg, J.M. (1997). Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of ACM symposium on theory of computing (STOC).

  • Kruskal, J.B., & Wish, M. (1978). Multidimensional scaling. Beverly Hills: SAGE publication.

  • Kulkami, S., & Orlandic, R. (2006). High-dimensional similarity search using data sensitive space partitioning. Lecture Notes in Computer Science (LNCS), 4080(2006), 738–750.

    Google Scholar 

  • Lam, H., Perego, R., Quan, N., Silvestri, F. (2009). Entry pairing in inverted file. In Proceedings of WISE (Vol. 5802, pp. 511–522).

  • Li, C., Chang, E., Garcia-Molina, H.,Wiederhold, G. (2002). Clustering for approximate similarity search in high-dimensional spaces. IEEE Transactions of Knowledge and Data Engineering, 14(4), 792–808.

    Google Scholar 

  • Li, T., Shen, H., Rosequist, A. (2008). Token list based data searching in a multi-dimensional massive database. In Proceedings of The 4th international conference on data mining (DMIN).

  • Loccoz, N.M. (2005). High-dimensional access methods for efficient similarity queries. Technical Report TR-2005-05-05, Universite De GENEVE.

  • Long, X., & Suel, T. (2005). Three-level caching for efficient query processing in large Web search engines. In Proceedings of WWW (pp. 257–266).

  • Luu, T., Skobeltsyn, G., Klemm, F., Puh, M., Zarko, I., Rajman, M., Aberer, K. (2008). AlvisP2P: scalable peer-to-peer text retrieval in a structured P2P network. PVLDB, 1(2), 1424–1427.

    Google Scholar 

  • Nejdl, W., Siberski, W., Wolpers, M., Schmnitz, C. (2003). Routing and clustering in schema-based super peer networks. In Proceedings of IPTPS.

  • Nejdl, W., Wolpers, M., Siberski, W., Löser, A., Bruckhorst, I., Schlosser, M., Schmitz, C. (2003). Super-peer-based routing and clustering strategies for rdf-based peer-to-peer networks. In Proceedings of the 12nd international world wide web conference.

  • Niblack, C.W., Barber, R., Equitz, W., Flickner, M.D., Glasman, E.H., Petkovic, D., Yanker, P., Faloutsos, C., Taubin, G. (1993). The QBIC project: querying images by content using color, texture and shape. In Proceedings of SPIE: storage and retrieval for image and video database.

  • Panigrahy, R. (2006). Nearest neighbor search using kd-trees. Technical report, Stanford University.

  • Qi, X., & Davison, B. (2009). Web page classification: features and algorithms. ACM Computing Surveys, 41(2), 1–31.

    Article  Google Scholar 

  • Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. International Student Edition, McGraw-Hill.

  • Sellis, T., Roussopoulos, N., Faloutsos, C. (1997). Multidimensional access methods: trees have grown everywhere. In Proceedings of the 23rd international conference on very large data bases.

  • Shen, H., Li, Z., Li, T. (2008). An investigation on multi-token list based proximity search in multi-dimensional massive database. In Proceedings of the international conference on convergence and hybrid information technology (ICCIT).

  • Skobeltsyn, G., Luu, T., Zarko, I., Rajman, M., Aberer, K. (2009). Query-driven indexing for scalable peer-to-peer text retrieval. Future Generation Computing Systems, 25(1), 89–99.

    Article  Google Scholar 

  • Weth, C., & Datta, A. (2012). Multiterm keyword search in NoSQL systems. IEEE Internet Computing, 16(1), 34–42.

    Article  Google Scholar 

  • White, D.A., & Jain, R. (1996). Algorithm and strategies for similarity retrieval. Technical Report VCL-96-101, University of California.

  • Yianlios, P.N. (1993). Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM symposium on discrete algorithms.

  • Zolotarev, V.M. (1986). One-dimensional stable distributions. American Mathematical Society.

Download references

Acknowledgments

This research was supported in part by U.S. NSF grants IIS-1354123, CNS-1254006, CNS-1249603, OCI-1064230, CNS-1049947, CNS-0917056 and, CNS-1025652, Microsoft Research Faculty Fellowship 8300751, Microsoft Research Faculty Fellowship 8300751, and the United States Department of Defense 238866. Early versions of this work were presented in the Proceedings of DMIN’08 (Li et al. 2008) and ICCIT’08 (Shen et al. 2008). We would like to thank Mr. Yuhua Lin for his valuable comments in addressing the review feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiying Shen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shen, H., Li, Z. & Li, T. Token list based information search in a multi-dimensional massive database. J Intell Inf Syst 42, 567–594 (2014). https://doi.org/10.1007/s10844-013-0289-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-013-0289-9

Keywords

Navigation