Abstract
Language is constantly changing, with words being created or disappearing over time. Moreover, the usage of different words tends to fluctuate due to influences from different fields, such as historical events, cultural movements or scientific discoveries. These changes are reflected in the written texts and thus, by tracking them, one can determine the moment when these texts were written. In this paper, we present an application based on time series analysis built on top of the Google Books N-gram corpus to determine the time stamp of different written texts. The application is using two heuristics: words’ fingerprinting, to find the time interval when they were most probable used, and words’ importance for the given text, to weight the influence of words’ fingerprinting for estimating the text time stamp. Combining these two heuristics allows time stamping of that text.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jurafsky, D., Martin, J.: Speech and Language Processing. Prentice Hall (2000)
Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., The Google Books Team, Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M.A., Aiden, E.L.: Quantitative analysis of culture using millions of digitized books. Science 331(6014), 176–182 (2011)
Fromkin, V., Robert, R., Hyams, N.: An Introduction to Language, 7th edn. Thomson Wadswor (2003)
Wijaya, D.T., Yeniterzi, R.: Understanding semantic change of words over centuries. In: DETECT’11, pp. 35–40 (2011)
Mitra, S., Mitra, R., Riedl, M., Biemann, C., Mukherjee, A., Goyal, P.: That’s sick dude!: automatic identification of word sense change across different timescales. In: 52nd ACL, pp. 1020–1029 (2014)
Petersen, A.M., Tenenbaum, J., Havlin, S., Stanley, H.E.: Statistical laws governing fluctuations in word use from word birth to word death. Sci. Rep. 2, 313 (2012)
Garcia-Fernandez, A., Ligozat, A.-L., Dinarelli, M., Bernhard, D.: When was it written? Automatically determining publication dates. In: String Processing and Information Retrieval, pp. 221–236 (2011)
de Jong, F., Rode, H., Hiemstra, D.: Temporal language models for the disclosure of historical text. In: Proceedings of the AHC’05, pp. 161–168 (2005)
Szymanski, T., Lynch, G.: UCD: diachronic text classification with character, word, and syntactic N-grams. In: SemEval 2015, 879–883 (2015)
Zimmermann, R.: Dating hitherto undated old English texts based on text-internal criteria. http://www.old-engli.sh/my-research.php
Rubner, Y., Tomasi, C., Guibas, L. J.: A metric for distributions with applications to image databases. In: Computer Vision and Image Understanding, pp. 86–109 (2004)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2001)
Acknowledgements
This work has been funded by University Politehnica of Bucharest, through the “Excellence Research Grants” Program, UPB–GEX. Identifier: UPB–EXCELENȚĂ–2016 Aplicarea metodelor de învățare automată în analiza seriilor de timp (Applying machine learning techniques in time series analysis), Contract number 09/26.09.2016.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Chiru, CG., Toia, M. (2017). Using Time Series Analysis for Estimating the Time Stamp of a Text. In: Rojas, I., Pomares, H., Valenzuela, O. (eds) Advances in Time Series Analysis and Forecasting. ITISE 2016. Contributions to Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-55789-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-55789-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55788-5
Online ISBN: 978-3-319-55789-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)