A Simple Algorithm for Topic Identification in 0–1 Data

Seppänen, Jouni K.; Bingham, Ella; Mannila, Heikki

doi:10.1007/978-3-540-39804-2_38

Jouni K. Seppänen¹⁰,
Ella Bingham¹⁰ &
Heikki Mannila¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2838))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2204 Accesses
9 Citations

Abstract

Topics in 0–1 datasets are sets of variables whose occurrences are positively connected together. Earlier, we described a simple generative topic model. In this paper we show that, given data produced by this model, the lift statistics of attributes can be described in matrix form. We use this result to obtain a simple algorithm for finding topics in 0–1 data. We also show that a problem related to the identification of topics is NP-hard. We give experimental results on the topic identification problem, both on generated and real data.

Download to read the full chapter text

Chapter PDF

Hierarchical Latent Tree Analysis for Topic Detection

Hierarchical Topic Model Inference by Community Discovery on Word Co-occurrence Networks

Topic Grouper: An Agglomerative Clustering Approach to Topic Modeling

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)
Article Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41, 391–407 (1990)
Article Google Scholar
Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent semantic indexing: A probabilistic analysis. In: PODS 1998, pp. 159–168 (1998)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: SIGIR 1999, Berkeley, CA, pp. 50–57 (1999)
Google Scholar
Carreira-Perpiñán, M.A., Renals, S.: Practical identifiability of finite mixtures of multivariate Bernoulli distributions. Neural Computation 12, 141–152 (2000)
Article Google Scholar
Gyllenberg, M., Koski, T., Reilink, E., Verlaan, M.: Non-uniqueness in probabilistic numerical identification of bacteria. J. Appl. Prob. 31, 542–548 (1994)
Article MATH MathSciNet Google Scholar
Cadez, I.V., Smyth, P., Mannila, H.: Probabilistic modeling of transaction data with applications to profiling, visualization, and prediction. In: Provost, F., Srikant, R. (eds.) Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Fransisco, CA, pp. 37–46 (2001)
Google Scholar
Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons, Chichester (2001)
Book Google Scholar
Clifton, C., Cooley, R.: TopCat: Data mining for topic identification in a text corpus. In: Principles of Data Mining and Knowledge Discovery, pp. 174–183 (1999)
Google Scholar
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Knowledge Discovery and Data Mining, pp. 269–274 (2001)
Google Scholar
Bingham, E., Mannila, H., Seppänen, J.K.: Topics in 0-1 data. In: Hand, D., Keim, D., Ng, R. (eds.) Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), Edmonton, Alberta, Canada, pp. 450–455 (2002)
Google Scholar
Castelo, R., Feelders, A., Siebes, A.: MAMBO: Discovering association rules based on conditional independencies. In: Hoffmann, F., Adams, N., Fisher, D., Guimarães, G., Hand, D.J. (eds.) IDA 2001. LNCS, vol. 2189, pp. 289–298. Springer, Heidelberg (2001)
Chapter Google Scholar
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems (2000)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Neural Information Processing Systems 14 (2001)
Google Scholar
Minka, T., Lafferty, J.: Expectation-propagation for the generative aspect model. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, Edmonton, Canada (2002)
Google Scholar
Buntine, W.: Variational extensions to EM and multinomial PCA. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, pp. 23–34. Springer, Heidelberg (2002)
Chapter Google Scholar
Comon, P.: Independent component analysis — a new concept? Signal Processing 36, 287–314 (1994)
Article MATH Google Scholar
Jutten, C., Herault, J.: Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing 24, 1–10 (1991)
Article MATH Google Scholar
Das, G., Mannila, H., Ronkainen, P.: Similarity of attributes by external probes. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 23–29 (1998)
Google Scholar
Pajunen, P., Karhunen, J.: A maximum likelihood approach to nonlinear blind source separation. In: Proc. Int. Conf. Artif. Neural Networks, pp. 541–546 (1997)
Google Scholar
Girolami, M.: A generative model for sparse discrete bianry data with non-uniform categorical priors. In: Proc. European Symposium on Artificial Neural Networks, Bruges, Belgium, pp. 1–6 (2000)
Google Scholar
Silverstein, C., Brin, S., Motwani, R.: Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery 2, 39–68 (1998)
Article Google Scholar
Tan, P.N., Kumar, V.: Interestingness measures for association patterns: A perspective. Technical Report TR00-036, University of Minnesota (KDD 2000), Workshop on Postprocessing in Machine Learning and Data Mining) (2000)
Google Scholar
Mannila, H., Patrikainen, A., Seppänen, J.K., Kere, J.: Long-range control of expression in yeast. Bioinformatics 18, 482–483 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Computer and Information Science and HIIT Basic Research Unit, Helsinki University of Technology,
Jouni K. Seppänen, Ella Bingham & Heikki Mannila

Authors

Jouni K. Seppänen
View author publications
You can also search for this author in PubMed Google Scholar
Ella Bingham
View author publications
You can also search for this author in PubMed Google Scholar
Heikki Mannila
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Nova Gorica, Nova Gorica, Slovenia
Nada Lavrač
Rudjer Bošković Institute, Bijenička 54, 10000, Zagreb, Croatia
Dragan Gamberger
Jozef Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Ljupčo Todorovski
Leiden Institute of Advanced Computer Science, Leiden University,
Hendrik Blockeel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Seppänen, J.K., Bingham, E., Mannila, H. (2003). A Simple Algorithm for Topic Identification in 0–1 Data. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds) Knowledge Discovery in Databases: PKDD 2003. PKDD 2003. Lecture Notes in Computer Science(), vol 2838. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39804-2_38

Download citation

DOI: https://doi.org/10.1007/978-3-540-39804-2_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20085-7
Online ISBN: 978-3-540-39804-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

A Simple Algorithm for Topic Identification in 0–1 Data

Abstract

Chapter PDF

Similar content being viewed by others

Hierarchical Latent Tree Analysis for Topic Detection

Hierarchical Topic Model Inference by Community Discovery on Word Co-occurrence Networks

Topic Grouper: An Agglomerative Clustering Approach to Topic Modeling

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Simple Algorithm for Topic Identification in 0–1 Data

Abstract

Chapter PDF

Similar content being viewed by others

Hierarchical Latent Tree Analysis for Topic Detection

Hierarchical Topic Model Inference by Community Discovery on Word Co-occurrence Networks

Topic Grouper: An Agglomerative Clustering Approach to Topic Modeling

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation