Analyzing Entities and Topics in News Articles Using Statistical Topic Models

  • David Newman
  • Chaitanya Chemudugunta
  • Padhraic Smyth
  • Mark Steyvers
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3975)


Statistical language models can learn relationships between topics discussed in a document collection and persons, organizations and places mentioned in each document. We present a novel combination of statistical topic models and named-entity recognizers to jointly analyze entities mentioned (persons, organizations and places) and topics discussed in a collection of 330,000 New York Times news articles. We demonstrate an analytic framework which automatically extracts from a large collection: topics; topic trends; and topics that relate entities.


Topic Model Latent Dirichlet Allocation News Article Latent Semantic Analysis Latent Semantic Indexing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Klimt, B., Yang, Y.: A New Dataset for Email Classification Research. In: 15th European Conference on Machine Learning (2004)Google Scholar
  2. 2.
    Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval 1, 67–88 (1999)Google Scholar
  3. 3.
    Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2002)Google Scholar
  4. 4.
    Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. American Society of Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  5. 5.
    Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 37, 573–595 (1994)CrossRefMathSciNetGoogle Scholar
  6. 6.
    Hofmann, T.: Probabilistic Latent Semantic Indexing. In: 22nd Int’l. Conference on Research and Development in Information Retrieval (1999)Google Scholar
  7. 7.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 1, 993–1022 (2003)CrossRefGoogle Scholar
  8. 8.
    Minka, T., La, J.: Expectation-Propagation for the Generative Aspect Model. In: 18th Conference on Uncertainty and Artificial Intelligence (2002)Google Scholar
  9. 9.
    Griffiths, T.L., Steyvers, M.: Finding Scientific Topics. National Academy of Sciences 101 (suppl. 1), 5228–5235 (2004)Google Scholar
  10. 10.
    Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of Population Structure using Multilocus Genotype Data. Genetics 155, 945–959 (2000)Google Scholar
  11. 11.
    Buntine, W., Perttu, S., Tuulos, V.: Using Discrete PCA on Web Pages. In: Proceedings of the Workshop W1 on Statistical Approaches for Web Mining (SAWM), Italy, pp. 99–110 (2004)Google Scholar
  12. 12.
    McCallum, A., Corrada-Emmanuel, A., Wang, X.: Topic and Role Discovery in Social Networks. In: 19th Joint Conference on Artificial Intelligence (2005)Google Scholar
  13. 13.
    Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic Author-Topic Models for Information Discovery. In: 10th ACM SIGKDD (2004)Google Scholar
  14. 14.
    Newman, D.J., Block, S.: Probabilistic Topic Decomposition of an Eighteenth-Century Newspaper. Journal American Society for Information Science and Technology (2006)Google Scholar
  15. 15.
    Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The Author-Topic Model for Authors and Documents. In: 20th Int’l. Conference on Uncertainty in AI (2004)Google Scholar
  16. 16.
    Blei, D., Jordan, M.: Modeling Annotated Data. In: 26th International ACM SIGIR, pp. 127–134 (2003)Google Scholar
  17. 17.
    Griffiths, T., Steyvers, M., Blei, D.M., Tenenbaum, J.B.: Integrating Topics and Syntax. Advances in Neural Information Processing Systems 17 (2004)Google Scholar
  18. 18.
    Steyvers, M., Griffiths, T.L.: Probabilistic Topic Models. In: Landauer, T. (ed.) Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, Mahwah (2006)Google Scholar
  19. 19.
    Brill E.: Some Advances in Transformation-Based Part of Speech Tagging. National Conference on Artificial Intelligence (1994)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • David Newman
    • 1
  • Chaitanya Chemudugunta
    • 1
  • Padhraic Smyth
    • 1
  • Mark Steyvers
    • 2
  1. 1.Department of Computer ScienceUC IrvineIrvine
  2. 2.Department of Cognitive ScienceUC IrvineIrvine

Personalised recommendations