Abstract
Identifying the moment of time when a book was published is an important problem that might help solving the problem of authorship identification and could also shed some light into identifying the realities of the human society during different periods of time. In this paper, we present an attempt to estimate the publication date of books based on the time series analysis of their content. The main assumption of this experiment is that the subject of a book is often specific to a time period. Therefore, it is likely to use topic modeling to learn a model that might be used to timestamp different books, given for training many books from similar periods of time. To validate the assumption, we built a corpus of 10 thousand books and used LDA to extract the topics from them. Then, we extracted the time series of particular terms from each topic using Google Books N-gram Corpus. By heuristically combining the words’ time series and the topics from a document, we have built that document’s time series. Finally, we applied peak detection algorithms to timestamp the document.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Chen, E.: Introduction to Latent Dirichlet Allocation. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/22 Aug 2011
AlSumait, L., Barbará, D., Domeniconi, C.: On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Data Mining, 2008. ICDM’08, pp. 3–12 (2008)
Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., The Google Books Team, Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)
Sparavigna, A.C., Marazzato, R.: Using Google Ngram viewer for scientific referencing and history of science. arXiv preprint arXiv:1512.01364 (2015)
Montagne, M., Morgan, M.: Drugs on the internet, part IV: Google’s Ngram viewer analytic tool applied to drug literature. Subst. Use Misuse 48(5), 415–419 (2013)
Patrick, J.: Using the Google N-Gram corpus to measure cultural complexity. Literary Linguist. Comput. 28(4), 668–675 (2013)
Koplenig, A.: The impact of lacking metadata for the measurement of cultural and linguistic change using the Google ngram data set—reconstructing the composition of the german corpus in times of WWII. In: Digital Scholarship in the Humanities, fqv037 (2015)
Islam, A., Mei, J., Milios, E.E., Keselj, V.: When was macbeth written? mapping book to time. In: Computational Linguistics and Intelligent Text Processing. Springer International Publishing, pp. 73–84 (2015)
Szymanski, T., Lynch, G.: UCD: Diachronic Text Classification with Character, Word, and Syntactic N-grams. SemEval 2015, 879–883 (2015)
Garcia-Fernandez, A., Ligozat, A.-L., Dinarelli, M., Bernhard, D.: When was it written? automatically determining publication dates. In: String Processing and Information Retrieval, pp. 221–236 (2011)
Popa, T., Rebedea, T., Chiru, C.: Detecting and describing historical periods in a large corpora. ICTAI 2014, 764–770 (2014)
Yusuke, S.: PDFMiner. http://euske.github.io/pdfminer/index.html (2008)
Digital Research Infrastructure for the Arts and Humanities: Topic modeling with MALLET. https://de.dariah.eu/tatom/topic_model_mallet.html#topic-model-mallet (2015)
Ankarloo, B., Clark, S., Monter, W.: Witchcraft and magic in Europe. The Athlone Press (2002)
Acknowledgements
This work has been funded by University Politehnica of Bucharest, through the “Excellence Research Grants” Program, UPB – GEX. Identifier: UPB–EXCELENȚĂ–2016 Aplicarea metodelor de învățare automată în analiza seriilor de timp (Applying machine learning techniques in time series analysis), Contract number 09/26.09.2016.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Chiru, CG., Sarker, B. (2017). Using LDA and Time Series Analysis for Timestamping Documents. In: Rojas, I., Pomares, H., Valenzuela, O. (eds) Advances in Time Series Analysis and Forecasting. ITISE 2016. Contributions to Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-55789-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-55789-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55788-5
Online ISBN: 978-3-319-55789-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)