Analyzing Entities and Topics in News Articles Using Statistical Topic Models

Newman, David; Chemudugunta, Chaitanya; Smyth, Padhraic; Steyvers, Mark

doi:10.1007/11760146_9

Analyzing Entities and Topics in News Articles Using Statistical Topic Models

David Newman²¹,
Chaitanya Chemudugunta²¹,
Padhraic Smyth²¹ &
…
Mark Steyvers²²

Conference paper

1677 Accesses
49 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3975))

Abstract

Statistical language models can learn relationships between topics discussed in a document collection and persons, organizations and places mentioned in each document. We present a novel combination of statistical topic models and named-entity recognizers to jointly analyze entities mentioned (persons, organizations and places) and topics discussed in a collection of 330,000 New York Times news articles. We demonstrate an analytic framework which automatically extracts from a large collection: topics; topic trends; and topics that relate entities.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Klimt, B., Yang, Y.: A New Dataset for Email Classification Research. In: 15th European Conference on Machine Learning (2004)
Google Scholar
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1, 67–88 (1999)
Google Scholar
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2002)
Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. American Society of Information Science 41(6), 391–407 (1990)
Article Google Scholar
Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 37, 573–595 (1994)
Article MathSciNet Google Scholar
Hofmann, T.: Probabilistic Latent Semantic Indexing. In: 22nd Int’l. Conference on Research and Development in Information Retrieval (1999)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 1, 993–1022 (2003)
Article Google Scholar
Minka, T., La, J.: Expectation-Propagation for the Generative Aspect Model. In: 18th Conference on Uncertainty and Artificial Intelligence (2002)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding Scientific Topics. National Academy of Sciences 101 (suppl. 1), 5228–5235 (2004)
Google Scholar
Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of Population Structure using Multilocus Genotype Data. Genetics 155, 945–959 (2000)
Google Scholar
Buntine, W., Perttu, S., Tuulos, V.: Using Discrete PCA on Web Pages. In: Proceedings of the Workshop W1 on Statistical Approaches for Web Mining (SAWM), Italy, pp. 99–110 (2004)
Google Scholar
McCallum, A., Corrada-Emmanuel, A., Wang, X.: Topic and Role Discovery in Social Networks. In: 19th Joint Conference on Artificial Intelligence (2005)
Google Scholar
Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic Author-Topic Models for Information Discovery. In: 10th ACM SIGKDD (2004)
Google Scholar
Newman, D.J., Block, S.: Probabilistic Topic Decomposition of an Eighteenth-Century Newspaper. Journal American Society for Information Science and Technology (2006)
Google Scholar
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The Author-Topic Model for Authors and Documents. In: 20th Int’l. Conference on Uncertainty in AI (2004)
Google Scholar
Blei, D., Jordan, M.: Modeling Annotated Data. In: 26th International ACM SIGIR, pp. 127–134 (2003)
Google Scholar
Griffiths, T., Steyvers, M., Blei, D.M., Tenenbaum, J.B.: Integrating Topics and Syntax. Advances in Neural Information Processing Systems 17 (2004)
Google Scholar
Steyvers, M., Griffiths, T.L.: Probabilistic Topic Models. In: Landauer, T. (ed.) Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, Mahwah (2006)
Google Scholar
Brill E.: Some Advances in Transformation-Based Part of Speech Tagging. National Conference on Artificial Intelligence (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, UC Irvine, Irvine, CA
David Newman, Chaitanya Chemudugunta & Padhraic Smyth
Department of Cognitive Science, UC Irvine, Irvine, CA
Mark Steyvers

Authors

David Newman
View author publications
You can also search for this author in PubMed Google Scholar
Chaitanya Chemudugunta
View author publications
You can also search for this author in PubMed Google Scholar
Padhraic Smyth
View author publications
You can also search for this author in PubMed Google Scholar
Mark Steyvers
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information and Computer Science, University of California, Irvine
Sharad Mehrotra
MIS Department, University of Arizona, 85721, Tucson, AZ, USA
Daniel D. Zeng
Department of Management Information Systems, Eller College of Management, The University of Arizona, 85721, AZ, USA
Hsinchun Chen
University of Texas at Dallas,
Bhavani Thuraisingham
Chinese Academy of Sciences, 100190, Beijing, China
Fei-Yue Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Newman, D., Chemudugunta, C., Smyth, P., Steyvers, M. (2006). Analyzing Entities and Topics in News Articles Using Statistical Topic Models. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, FY. (eds) Intelligence and Security Informatics. ISI 2006. Lecture Notes in Computer Science, vol 3975. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11760146_9

Download citation

DOI: https://doi.org/10.1007/11760146_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34478-0
Online ISBN: 978-3-540-34479-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics