Sub-document Timestamping: A Study on the Content Creation Dynamics of Web Documents

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9819)

Abstract

The creation time of documents is an important kind of information in temporal information retrieval, especially for document clustering, timeline construction and search engine improvements. Considering the manner in which content on the Web is created, updated & deleted, the common assumption that each document has only one creation time is not suitable for Web documents. In this paper, we investigate to what extent this assumption is wrong. We introduce two methods to timestamp individual parts (sub-documents) of Web documents and analyze in detail the creation & update dynamics of three classes of Web documents.

Keywords

Timestamping Sub-documents Internet Archive 

References

  1. 1.
    Adar, E., Teevan, J., Dumais, S.T., Elsas, J.L.: The web changes everything: understanding the dynamics of web content. In: WSDM 2009, pp. 282–291 (2009)Google Scholar
  2. 2.
    Baeza-Yates, R., Pereira, Á., Ziviani, N.: Genealogical trees on the web: a search engine user perspective. In: WWW 2008, pp. 367–376. ACM (2008)Google Scholar
  3. 3.
    Bernard, S., Heutte, L., Adam, S.: Influence of hyperparameters on random forest accuracy. In: Benediktsson, J.A., Kittler, J., Roli, F. (eds.) MCS 2009. LNCS, vol. 5519, pp. 171–180. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  4. 4.
    Campos, R., Dias, G., Jorge, A.M., Jatowt, A.: Survey of temporal information retrieval and related applications. ACM Comput. Surv. (CSUR) 47(2), 15 (2015)Google Scholar
  5. 5.
    Chambers, N.: Labeling documents with timestamps: learning from their time expressions. In: ACL 2012, pp. 98–106 (2012)Google Scholar
  6. 6.
    Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler (1999)Google Scholar
  7. 7.
    Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)Google Scholar
  8. 8.
    Cormack, G., Smucker, M., Clarke, C.: Efficient & effective spam filtering & re-ranking for large web datasets. Inf. Retrieval 14(5), 441–465 (2011)CrossRefGoogle Scholar
  9. 9.
    de Jong, F., Rode, H., Hiemstra, D.: Temporal language models for the disclosure of historical text. Royal Netherlands Academy of Arts and Sciences (2005)Google Scholar
  10. 10.
    Döhling, L., Leser, U.: Extracting and aggregating temporal events from text. In: WWW 2014, pp. 839–844 (2014)Google Scholar
  11. 11.
    Ge, T., Chang, B., Li, S., Sui, Z.: Event-based time label propagation for automatic dating of news articles. In: EMNLP 2013, pp. 1–11 (2013)Google Scholar
  12. 12.
    Jatowt, A., Kawai, Y., Ohshima, H., Tanaka, K.: What can history tell us?: towards different models of interaction with document histories. In: ACM HyperText 2008, pp. 5–14 (2008)Google Scholar
  13. 13.
    Jatowt, A., Kawai, Y., Tanaka, K.: Detecting age of page content. In: Proceedings of the 9th Annual ACM International Workshop on Web Information and Data Management, pp. 137–144. ACM (2007)Google Scholar
  14. 14.
    Jones, R., Diaz, F.: Temporal profiles of queries. ACM Trans. Inf. Syst. 25(3), 14 (2007)CrossRefGoogle Scholar
  15. 15.
    Kanhabua, N., Nørvåg, K.: Improving temporal language models for determining time of non-timestamped documents. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL 2008. LNCS, vol. 5173, pp. 358–370. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  16. 16.
    Kanhabua, N., Nørvåg, K.: Using temporal language models for document dating. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009, Part II. LNCS, vol. 5782, pp. 738–741. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  17. 17.
    Kumar, A., Lease, M., Baldridge, J.: Supervised language modeling for temporal resolution of texts. In: CIKM 2011, pp. 2069–2072 (2011)Google Scholar
  18. 18.
    Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)Google Scholar
  19. 19.
    Li, X., Croft, W.B.: Time-based language models. In: CIKM 2003, pp. 469–475 (2003)Google Scholar
  20. 20.
    Ntoulas, A., Cho, J., Olston, C.: What’s new on the web?: the evolution of the web from a search engine perspective. In: WWW 2004, pp. 1–12 (2004)Google Scholar
  21. 21.
    Oshiro, T.M., Perez, P.S., Baranauskas, J.A.: How many trees in a random forest? In: Perner, P. (ed.) MLDM 2012. LNCS, vol. 7376, pp. 154–168. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  22. 22.
    Swan, R., Jensen, D.: Timemines: constructing timelines with statistical models of word usage. In: KDD Workshop on Text Mining, pp. 73–80 (2000)Google Scholar
  23. 23.
    Zhao, Y., Hauff, C.: Sub-document timestamping of web documents. In: SIGIR 2015, pp. 1023–1026 (2015)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Delft University of TechnologyDelftThe Netherlands

Personalised recommendations