Skip to main content

Modeling and Imputation of Large Incomplete Multidimensional Datasets

Part of the Lecture Notes in Computer Science book series (LNCS,volume 2454)

Abstract

The presence of missing or incomplete data is a commonplace in large real-word databases. In this paper, we study the problem of missing values which occur at the measure dimension of data cube. We propose a two-part mixture model, which combines the logistic model and loglinear model together, to predict and impute the missing values. The logistic model here is applied to predict missing positions while the loglinear model is applied to compute the estimation. Experimental results on real datasets and synthetic datasets are presented.

Keywords

  • Logistic Model
  • Range Query
  • Synthetic Dataset
  • Loglinear Model
  • Data Cube

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

The datacube is a widely used data model for On-Line Analytical Processing (OLAP).A datacube is a multidimensional data abstraction, where aggregated measures of the combinations of dimension values are kept.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Agresti.Categorical Data Analysis,Wiley Series in Probability and Mathematical Statistics,1990.

    Google Scholar 

  2. D. Barbara, H. Garcia-Molina, and D. Porter.The management of probabilistic data, IEEE Transactions on Knowledge and Data Engineering.Vol.4,no.5,page 487–502,1992.

    CrossRef  Google Scholar 

  3. D. Barbará, and X. Wu. Loglinear Based Quasi Cubes, Journal of Information and Intelligent System(JIIS), Vol 16(3),P255–276, Kluwer academic publishers.

    Google Scholar 

  4. J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube:A relational aggregation operator generalizing group-by,cross-tabs and sub-totals, In Proceedings of the 12th International Conference on Data Engineering,pages 152–159,1996.

    Google Scholar 

  5. J. W. Grzymala-Busse, and M. Hu. A Comparison of Several Approaches to Missing Attribute Values in Data Mining.In Proceedings of the second International Conference on Rough Sets and Current Trends in Computing,RSCTC 2000.

    Google Scholar 

  6. J. W. Grzymala-Busse.On the unknown attribute values in functional dependencies,In Proceedings of Methodologies for Intelligent Systems, Lecture Notes in AI,542, page 368–377,1991.

    Google Scholar 

  7. D. W. Hosmer, S. Lemeshow.Applied Logistic Regression, John Wiley and Sons, Inc.1989.

    Google Scholar 

  8. T. Imielinski, and W. Lipski. Incomplete Information in Relational Databases, Journal of ACM,31(4), page 761–791,1984.

    CrossRef  MATH  MathSciNet  Google Scholar 

  9. R. A. Little, and D.B. Rubin.Statistical analysis with missing data,New York, John Wiley and Sons,1987.

    Google Scholar 

  10. P. van der Putten, M. van Someren. COIL Challenge 2000:The Insurance Company Case, Sentient Machine Research, Amsterdam, June 2000.

    Google Scholar 

  11. J. R. Quinlan.Induction of decision trees,Machine Learning,vol.1,page 81–106,1986.

    Google Scholar 

  12. J. R. Quinlan.Unknown attribute values in induction,In Proceedings of the Sixth International Machqine Learning Workshop, page 164–168,1989.

    Google Scholar 

  13. D. B. Rubin, Multiple Imputation for Nonresponse in Surveys,Wiley Series in Probability and Mathematical Statistics,1987.

    Google Scholar 

  14. J. L. Schafer. Analysis of Incomplete Multivariate Data, Book number 72 in the Chapman and Hall series Monographs on Statistics and Applied Probability.London, Chapman and Hall,1997.

    Google Scholar 

  15. J. L. Schafer, and M.K. Olsen. Modeling and imputation of semicontinuous survey variables,In Proceedings of Federal Committee on Statistical Methodology (FCSM) Reseach Conference,Nov,1999.

    Google Scholar 

  16. T. Y. Young, and T.W. Calvert. Classification, Estimation and Pattern Recognition.Elsevier,1974.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wu, X., Barbará, D. (2002). Modeling and Imputation of Large Incomplete Multidimensional Datasets. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2002. Lecture Notes in Computer Science, vol 2454. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46145-0_28

Download citation

  • DOI: https://doi.org/10.1007/3-540-46145-0_28

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44123-6

  • Online ISBN: 978-3-540-46145-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics