Adapting ADtrees for improved performance on large datasets with high-arity features

Van Dam, Robert; Langkilde-Geary, Irene; Ventura, Dan

doi:10.1007/s10115-012-0510-0

Adapting ADtrees for improved performance on large datasets with high-arity features

Regular Paper
Published: 24 June 2012

Volume 35, pages 525–552, (2013)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Robert Van Dam¹,
Irene Langkilde-Geary¹ &
Dan Ventura¹

208 Accesses
2 Citations
Explore all metrics

Abstract

The ADtree, a data structure useful for caching sufficient statistics, has been successfully adapted to grow lazily when memory is limited and to update sequentially with an incrementally updated dataset. However, even these modified forms of the ADtree still exhibit inefficiencies in terms of both space usage and query time, particularly on datasets with very high dimensionality and with high-arity features. We propose four modifications to the ADtree, each of which can be used to improve size and query time under specific types of datasets and features. These modifications also provide an increased ability to precisely control how an ADtree is built and to tune its size given external memory or speed requirements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DistLODStats: Distributed Computation of RDF Dataset Statistics

25 $$+$$ Years of Query Processing - From a Single, Stored Data Set to Big Data (and Beyond)

A Review of Scalable Approaches for Frequent Itemset Mining

References

Agarwal D, Agrawal R, Khanna R, Kota N (2010) Estimating rates of rare events with multiple hierarchies through scalable log-linear models. In: Proceedings of the 16th ACM SIGKDD conference on knowledge discovery and data mining, pp 213–222
Anderson B, Moore A (1998) Adtrees for fast counting and for fast learning of association rules. In: Proceedings of the 4th international conference on knowledge discovery in data mining. AAAI Press, pp 134–138
Bentley J (1975) Multidimensional binary search trees used for associative searching. Commun Assoc Comput Mach 18(9): 509–517
MathSciNet MATH Google Scholar
Chen H, Liu J, Furuse K, Yu JX, Ohbo N (2011) Indexing expensive functions for efficient multi-dimensional similarity search. Knowl Inf Syst 27(2): 165–192
Article Google Scholar
Fuchs H, Kedem Z, Naylor B (1980) On visible surface generation by a priori tree structures. In: International conference on computer graphics and interactive techniques, pp 124–133
Gaede V, Gunther O (1998) Multidimensional access methods. Assoc Comput Mach Comput Surv 30(2): 170–231
Article Google Scholar
Huang Z, Sun S, Wang W (2010) Efficient mining of skyline objects in subspaces over data streams. Knowl Inf Syst 22(2): 159–183
Article Google Scholar
Kim Y, Chung C-W, Lee S-L, Kim D-H (2011) Distance approximation techniques to reduce the dimensionality for multimedia databases. Knowl Inf Syst 28(1): 227–248
Article Google Scholar
Komarek P, Moore A (2000) A dynamic adaptation of ad-trees for efficient machine learning on large data sets. In: Proceedings of the international conference on machine learning (ICML), pp 495–502
Koufakou A, Secretan J, Georgiopoulos M (2011) Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data. Knowl Inf Syst 29(3): 697–725
Article Google Scholar
Moore A, Lee MS (1998) Cached sufficient statistics for efficient machine learning with large datasets. J Artif Intell Res 8: 67–91
MathSciNet MATH Google Scholar
Roure J, Moore A (2006) Sequential update of adtrees. In: Proceedings of the 23rd international conference on machine learning, pp 769–776
Rymon R (1993) An se-tree based characterization of the induction problem. In: International conference on machine learning, pp 268–275
Toutanova K, Manning C (2000) Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the joint conference on empirical methods in natural language processing and very large corpora, pp 63–70
University of Pennsylvania Linguistic Data Consortium (n.d.) http://www.ldc.upenn.edu/
Van Dam R, Langkilde-Geary I, Ventura D (2008) Adapting adtrees for high arity features. In: Proceedings of the association for the advancement of artificial intelligence, pp 708–713
Van Dam R, Ventura D (2007) Adtrees for sequential data and n-gram counting. In: Proceedings of the IEEE conference on systems, man, and cybernetics, pp 492–497
Yu H-F, Hsieh C-J, Chang K-W, Lin C-J (2010) Large linear classification when data cannot fit in memory. In: Proceedings of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 833–842
Zhang M, Alhajj R (2010) Effectiveness of naq-tree as index structure for similarity search in high-dimensional metric space. Knowl Inf Syst 22(1): 159–183
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Brigham Young University, Provo, UT, USA
Robert Van Dam, Irene Langkilde-Geary & Dan Ventura

Authors

Robert Van Dam
View author publications
You can also search for this author in PubMed Google Scholar
Irene Langkilde-Geary
View author publications
You can also search for this author in PubMed Google Scholar
Dan Ventura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dan Ventura.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Van Dam, R., Langkilde-Geary, I. & Ventura, D. Adapting ADtrees for improved performance on large datasets with high-arity features. Knowl Inf Syst 35, 525–552 (2013). https://doi.org/10.1007/s10115-012-0510-0

Download citation

Received: 01 October 2010
Revised: 17 November 2011
Accepted: 25 February 2012
Published: 24 June 2012
Issue Date: June 2013
DOI: https://doi.org/10.1007/s10115-012-0510-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adapting ADtrees for improved performance on large datasets with high-arity features

Abstract

Access this article

Similar content being viewed by others

DistLODStats: Distributed Computation of RDF Dataset Statistics

25 $$+$$ Years of Query Processing - From a Single, Stored Data Set to Big Data (and Beyond)

A Review of Scalable Approaches for Frequent Itemset Mining

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Adapting ADtrees for improved performance on large datasets with high-arity features

Abstract

Access this article

Similar content being viewed by others

DistLODStats: Distributed Computation of RDF Dataset Statistics

25 $$+$$ Years of Query Processing - From a Single, Stored Data Set to Big Data (and Beyond)

A Review of Scalable Approaches for Frequent Itemset Mining

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation