Using LDA and Time Series Analysis for Timestamping Documents

Chiru, Costin-Gabriel; Sarker, Bishnu

doi:10.1007/978-3-319-55789-2_4

Costin-Gabriel Chiru⁴ &
Bishnu Sarker⁴

Part of the book series: Contributions to Statistics ((CONTRIB.STAT.))

Included in the following conference series:

International Work-Conference on Time Series Analysis

2094 Accesses

Abstract

Identifying the moment of time when a book was published is an important problem that might help solving the problem of authorship identification and could also shed some light into identifying the realities of the human society during different periods of time. In this paper, we present an attempt to estimate the publication date of books based on the time series analysis of their content. The main assumption of this experiment is that the subject of a book is often specific to a time period. Therefore, it is likely to use topic modeling to learn a model that might be used to timestamp different books, given for training many books from similar periods of time. To validate the assumption, we built a corpus of 10 thousand books and used LDA to extract the topics from them. Then, we extracted the time series of particular terms from each topic using Google Books N-gram Corpus. By heuristically combining the words’ time series and the topics from a document, we have built that document’s time series. Finally, we applied peak detection algorithms to timestamp the document.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Google Scholar
Chen, E.: Introduction to Latent Dirichlet Allocation. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/22 Aug 2011
AlSumait, L., Barbará, D., Domeniconi, C.: On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Data Mining, 2008. ICDM’08, pp. 3–12 (2008)
Google Scholar
Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., The Google Books Team, Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)
Google Scholar
Sparavigna, A.C., Marazzato, R.: Using Google Ngram viewer for scientific referencing and history of science. arXiv preprint arXiv:1512.01364 (2015)
Montagne, M., Morgan, M.: Drugs on the internet, part IV: Google’s Ngram viewer analytic tool applied to drug literature. Subst. Use Misuse 48(5), 415–419 (2013)
Article Google Scholar
Patrick, J.: Using the Google N-Gram corpus to measure cultural complexity. Literary Linguist. Comput. 28(4), 668–675 (2013)
Article Google Scholar
Koplenig, A.: The impact of lacking metadata for the measurement of cultural and linguistic change using the Google ngram data set—reconstructing the composition of the german corpus in times of WWII. In: Digital Scholarship in the Humanities, fqv037 (2015)
Google Scholar
Islam, A., Mei, J., Milios, E.E., Keselj, V.: When was macbeth written? mapping book to time. In: Computational Linguistics and Intelligent Text Processing. Springer International Publishing, pp. 73–84 (2015)
Google Scholar
Szymanski, T., Lynch, G.: UCD: Diachronic Text Classification with Character, Word, and Syntactic N-grams. SemEval 2015, 879–883 (2015)
Google Scholar
Garcia-Fernandez, A., Ligozat, A.-L., Dinarelli, M., Bernhard, D.: When was it written? automatically determining publication dates. In: String Processing and Information Retrieval, pp. 221–236 (2011)
Google Scholar
Popa, T., Rebedea, T., Chiru, C.: Detecting and describing historical periods in a large corpora. ICTAI 2014, 764–770 (2014)
Google Scholar
Yusuke, S.: PDFMiner. http://euske.github.io/pdfminer/index.html (2008)
Digital Research Infrastructure for the Arts and Humanities: Topic modeling with MALLET. https://de.dariah.eu/tatom/topic_model_mallet.html#topic-model-mallet (2015)
Ankarloo, B., Clark, S., Monter, W.: Witchcraft and magic in Europe. The Athlone Press (2002)
Google Scholar

Download references

Acknowledgements

This work has been funded by University Politehnica of Bucharest, through the “Excellence Research Grants” Program, UPB – GEX. Identifier: UPB–EXCELENȚĂ–2016 Aplicarea metodelor de învățare automată în analiza seriilor de timp (Applying machine learning techniques in time series analysis), Contract number 09/26.09.2016.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University Politehnica from Bucharest, 313 Splaiul Independetei, Bucharest, Romania
Costin-Gabriel Chiru & Bishnu Sarker

Authors

Costin-Gabriel Chiru
View author publications
You can also search for this author in PubMed Google Scholar
Bishnu Sarker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Costin-Gabriel Chiru .

Editor information

Editors and Affiliations

CITIC-UGR, University of Granada, Granada, Spain
Ignacio Rojas
CITIC-UGR, University of Granada, Granada, Spain
Héctor Pomares
CITIC-UGR, University of Granada, Granada, Spain
Olga Valenzuela

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chiru, CG., Sarker, B. (2017). Using LDA and Time Series Analysis for Timestamping Documents. In: Rojas, I., Pomares, H., Valenzuela, O. (eds) Advances in Time Series Analysis and Forecasting. ITISE 2016. Contributions to Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-55789-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-55789-2_4
Published: 03 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55788-5
Online ISBN: 978-3-319-55789-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics