Conclusions
Boolean regression classes of models are powerful modeling tools having associated NML models which can be easily computed and used in MDL inference, in particular for factor selection.
Comparing the MDL methods based on the two-part codes with those based on the NML models, we note that the former is faster to evaluate, but the latter provides a significantly shorter codelength and hence a better description of the data. When analyzing the gene expression data, speed may be a major concern, since one has to test ({skk/n}) possible groupings of k genes, with n in the order of thousands and usually less than 10. The two-partcodes may then be used for pre-screening of the gene groupings, to remove the obviously poor performers, and then the NML model could be applied to obtain the final selection from a smaller pool of candidates. The running time for all our experiments reported here is in the order of tens of minutes.
The use of the MDL principle for classification with the class of Boolean models provides an effective classification method as demonstrated with the important cancer classification example based on gene expression data. The NML model for the class M(θ, k, f) was used for the selection of informative feature genes. When using the sets of feature genes, selected by NML model, we achieved classification error rates significantly lower than those reported recently for the same data set.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barron, A., Rissanen, J., Bin, Y. (1998) The minimum description length principle in coding and modeling. IEEE Trans. on Information Theory, Special commemorative issue: Information Theory 1948–1998, 44:6, 2743–2760.
Dudoit, S., Fridlyand, J., Speed, T.P. (2000) Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Dept. of Statistics University of California, Berkeley, Technical Report 576.
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S. (1999) Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286, 531–537.
Kim, S., Dougherty, E.R. (2000) Coefficient of determination in nonlinear signal processing. Signal Processing, 80, 2219–2235.
Hieter, P., Boguski, M. (1997) Functional genomics: it’s all how you read it. Science 278, 601–602.
Jacob, F., Monod, J. (1961) Genetic regulatory mechanisms in the synthesis of proteins. Journal of Molecular Biology 3, 318–356.
Kim, S., Dougherty, E.R., Chen, Y., Sivakumar, K., Meltzer, P., Trent, J.M., Bitnner, M. (2000). Multivariate measurement of gene expression relationships. Genomics, 67, 201–209.
Linde, Y., Buzo, A., Gray, R.M. (1980) An algorithm for vector quantization design. IEEE Transactions on Communications, 28, 84–95.
Rissanen, J. (1978) Modelling by shortest data description. Automatica, 14, 465–471.
Rissanen, J. (1984) Universal coding, information, prediction and estimation. IEEE Trans. on Information Theory, 30, 629–636.
Rissanen, J. (1986) Stochastic complexity and modeling. Ann. Statist., 14, 1080–1100.
Rissanen, J. (2000) MDL Denoising. IEEE Trans. on Information Theory, IT-46:7, 2537–2543.
Rissanen, J. (2001) Strong optimality of the normalized ML models as universal codes and information in data. IEEE Trans. on Information Theory, IT-47:5, 1712–1717.
Russel, P.J. (2000) Fundamentals of genetics. 2nd edition, San Francisco: Addison Wesly Longman Inc.
Schena, M., Shalon, D., Davis, R.W., Brown, P.O. (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470.
Shtarkov, Yu. M. (1987) Universal sequential coding of single messages. Translated from Problems of Information Transmission, 23:3, 3–17.
Tabus, I., Astola, J. (2001) On the Use of MDL Principle in Gene Expression Prediction. Journal of Applied Signal Processing, 2001:4, 297–303.
Tabus, I., Astola, J. (2000) MDL Optimal Design for Gene Expression Prediction from Microarray Measurements. Tampere University of Technology, Technical Report, ISBN.952-15-0529-X.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Kluwer Academic Publishers
About this chapter
Cite this chapter
Tabus, I., Rissanen, J., Astola, J. (2003). Normalized Maximum Likelihood Models for Boolean Regression with Application to Prediction and Classification in Genomics. In: Zhang, W., Shmulevich, I. (eds) Computational and Statistical Approaches to Genomics. Springer, Boston, MA. https://doi.org/10.1007/0-306-47825-0_10
Download citation
DOI: https://doi.org/10.1007/0-306-47825-0_10
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4020-7023-5
Online ISBN: 978-0-306-47825-3
eBook Packages: Springer Book Archive