Abstract
The ADtree, a data structure useful for caching sufficient statistics, has been successfully adapted to grow lazily when memory is limited and to update sequentially with an incrementally updated dataset. However, even these modified forms of the ADtree still exhibit inefficiencies in terms of both space usage and query time, particularly on datasets with very high dimensionality and with high-arity features. We propose four modifications to the ADtree, each of which can be used to improve size and query time under specific types of datasets and features. These modifications also provide an increased ability to precisely control how an ADtree is built and to tune its size given external memory or speed requirements.
Similar content being viewed by others
References
Agarwal D, Agrawal R, Khanna R, Kota N (2010) Estimating rates of rare events with multiple hierarchies through scalable log-linear models. In: Proceedings of the 16th ACM SIGKDD conference on knowledge discovery and data mining, pp 213–222
Anderson B, Moore A (1998) Adtrees for fast counting and for fast learning of association rules. In: Proceedings of the 4th international conference on knowledge discovery in data mining. AAAI Press, pp 134–138
Bentley J (1975) Multidimensional binary search trees used for associative searching. Commun Assoc Comput Mach 18(9): 509–517
Chen H, Liu J, Furuse K, Yu JX, Ohbo N (2011) Indexing expensive functions for efficient multi-dimensional similarity search. Knowl Inf Syst 27(2): 165–192
Fuchs H, Kedem Z, Naylor B (1980) On visible surface generation by a priori tree structures. In: International conference on computer graphics and interactive techniques, pp 124–133
Gaede V, Gunther O (1998) Multidimensional access methods. Assoc Comput Mach Comput Surv 30(2): 170–231
Huang Z, Sun S, Wang W (2010) Efficient mining of skyline objects in subspaces over data streams. Knowl Inf Syst 22(2): 159–183
Kim Y, Chung C-W, Lee S-L, Kim D-H (2011) Distance approximation techniques to reduce the dimensionality for multimedia databases. Knowl Inf Syst 28(1): 227–248
Komarek P, Moore A (2000) A dynamic adaptation of ad-trees for efficient machine learning on large data sets. In: Proceedings of the international conference on machine learning (ICML), pp 495–502
Koufakou A, Secretan J, Georgiopoulos M (2011) Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data. Knowl Inf Syst 29(3): 697–725
Moore A, Lee MS (1998) Cached sufficient statistics for efficient machine learning with large datasets. J Artif Intell Res 8: 67–91
Roure J, Moore A (2006) Sequential update of adtrees. In: Proceedings of the 23rd international conference on machine learning, pp 769–776
Rymon R (1993) An se-tree based characterization of the induction problem. In: International conference on machine learning, pp 268–275
Toutanova K, Manning C (2000) Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the joint conference on empirical methods in natural language processing and very large corpora, pp 63–70
University of Pennsylvania Linguistic Data Consortium (n.d.) http://www.ldc.upenn.edu/
Van Dam R, Langkilde-Geary I, Ventura D (2008) Adapting adtrees for high arity features. In: Proceedings of the association for the advancement of artificial intelligence, pp 708–713
Van Dam R, Ventura D (2007) Adtrees for sequential data and n-gram counting. In: Proceedings of the IEEE conference on systems, man, and cybernetics, pp 492–497
Yu H-F, Hsieh C-J, Chang K-W, Lin C-J (2010) Large linear classification when data cannot fit in memory. In: Proceedings of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 833–842
Zhang M, Alhajj R (2010) Effectiveness of naq-tree as index structure for similarity search in high-dimensional metric space. Knowl Inf Syst 22(1): 159–183
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Van Dam, R., Langkilde-Geary, I. & Ventura, D. Adapting ADtrees for improved performance on large datasets with high-arity features. Knowl Inf Syst 35, 525–552 (2013). https://doi.org/10.1007/s10115-012-0510-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0510-0