Skip to main content
Log in

Adapting ADtrees for improved performance on large datasets with high-arity features

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The ADtree, a data structure useful for caching sufficient statistics, has been successfully adapted to grow lazily when memory is limited and to update sequentially with an incrementally updated dataset. However, even these modified forms of the ADtree still exhibit inefficiencies in terms of both space usage and query time, particularly on datasets with very high dimensionality and with high-arity features. We propose four modifications to the ADtree, each of which can be used to improve size and query time under specific types of datasets and features. These modifications also provide an increased ability to precisely control how an ADtree is built and to tune its size given external memory or speed requirements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agarwal D, Agrawal R, Khanna R, Kota N (2010) Estimating rates of rare events with multiple hierarchies through scalable log-linear models. In: Proceedings of the 16th ACM SIGKDD conference on knowledge discovery and data mining, pp 213–222

  2. Anderson B, Moore A (1998) Adtrees for fast counting and for fast learning of association rules. In: Proceedings of the 4th international conference on knowledge discovery in data mining. AAAI Press, pp 134–138

  3. Bentley J (1975) Multidimensional binary search trees used for associative searching. Commun Assoc Comput Mach 18(9): 509–517

    MathSciNet  MATH  Google Scholar 

  4. Chen H, Liu J, Furuse K, Yu JX, Ohbo N (2011) Indexing expensive functions for efficient multi-dimensional similarity search. Knowl Inf Syst 27(2): 165–192

    Article  Google Scholar 

  5. Fuchs H, Kedem Z, Naylor B (1980) On visible surface generation by a priori tree structures. In: International conference on computer graphics and interactive techniques, pp 124–133

  6. Gaede V, Gunther O (1998) Multidimensional access methods. Assoc Comput Mach Comput Surv 30(2): 170–231

    Article  Google Scholar 

  7. Huang Z, Sun S, Wang W (2010) Efficient mining of skyline objects in subspaces over data streams. Knowl Inf Syst 22(2): 159–183

    Article  Google Scholar 

  8. Kim Y, Chung C-W, Lee S-L, Kim D-H (2011) Distance approximation techniques to reduce the dimensionality for multimedia databases. Knowl Inf Syst 28(1): 227–248

    Article  Google Scholar 

  9. Komarek P, Moore A (2000) A dynamic adaptation of ad-trees for efficient machine learning on large data sets. In: Proceedings of the international conference on machine learning (ICML), pp 495–502

  10. Koufakou A, Secretan J, Georgiopoulos M (2011) Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data. Knowl Inf Syst 29(3): 697–725

    Article  Google Scholar 

  11. Moore A, Lee MS (1998) Cached sufficient statistics for efficient machine learning with large datasets. J Artif Intell Res 8: 67–91

    MathSciNet  MATH  Google Scholar 

  12. Roure J, Moore A (2006) Sequential update of adtrees. In: Proceedings of the 23rd international conference on machine learning, pp 769–776

  13. Rymon R (1993) An se-tree based characterization of the induction problem. In: International conference on machine learning, pp 268–275

  14. Toutanova K, Manning C (2000) Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the joint conference on empirical methods in natural language processing and very large corpora, pp 63–70

  15. University of Pennsylvania Linguistic Data Consortium (n.d.) http://www.ldc.upenn.edu/

  16. Van Dam R, Langkilde-Geary I, Ventura D (2008) Adapting adtrees for high arity features. In: Proceedings of the association for the advancement of artificial intelligence, pp 708–713

  17. Van Dam R, Ventura D (2007) Adtrees for sequential data and n-gram counting. In: Proceedings of the IEEE conference on systems, man, and cybernetics, pp 492–497

  18. Yu H-F, Hsieh C-J, Chang K-W, Lin C-J (2010) Large linear classification when data cannot fit in memory. In: Proceedings of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 833–842

  19. Zhang M, Alhajj R (2010) Effectiveness of naq-tree as index structure for similarity search in high-dimensional metric space. Knowl Inf Syst 22(1): 159–183

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dan Ventura.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Van Dam, R., Langkilde-Geary, I. & Ventura, D. Adapting ADtrees for improved performance on large datasets with high-arity features. Knowl Inf Syst 35, 525–552 (2013). https://doi.org/10.1007/s10115-012-0510-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0510-0

Keywords

Navigation