Advertisement

An Approach to Document Fingerprinting

  • Yunhyong Kim
  • Seamus Ross
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9469)

Abstract

The nature of an individual document is often defined by its relationship to selected tasks, societal values, and cultural meaning. The identifying features, regardless of whether the document content is textual, aural or visual, are often delineated in terms of descriptions about the document, for example, intended audience, coverage of topics, purpose of creation, structure of presentation as well as relationships to other entities expressed by authorship, ownership, production process, and geographical and temporal markers. To secure a comprehensive view of a document, therefore, we must draw heavily on cognitive and/or computational resources not only to extract and classify information at multiple scales, but also to interlink these across multiple dimensions in parallel. Here we present a preliminary thought experiment for fingerprinting documents using textual documents visualised and analysed at multiple scales and dimensions to explore patterns on which we might capitalise.

Keywords

Text analysis Natural language processing Patterns Readability 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baldwin, T., Lui, M.: Language identification: the long and the short of the matter. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 2010, pp. 229–237. Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm?id=1857999.1858026
  2. 2.
    Barrón-Cedeño, A., Vila, M., Martí, M., Rosso, P.: Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013). http://dx.doi.org/10.1162/COLI_a_00153 CrossRefGoogle Scholar
  3. 3.
    Cohen, H., Crammer, K.: Learning multiple tasks in parallel with a shared annotator. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) NIPS, pp. 1170–1178 (2014). http://dblp.uni-trier.de/db/conf/nips/nips2014.html#CohenC14
  4. 4.
    Donais, J.A., Frost, R.A., Peelar, S.M., Roddy, R.A.: A system for the automated author attribution of text and instant messages. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2013, pp. 1484–1485. ACM, New York (2013). http://doi.acm.org/10.1145/2492517.2500308
  5. 5.
    Fang, A.C., Cao, J.: Enhanced genre classification through linguistically fine-grained pos tags. In: Otoguro, R., Ishikawa, K., Umemoto, H., Yoshimoto, K., Harada, Y. (eds.) PACLIC, pp. 85–94. Institute for Digital Enhancement of Cognitive Development, Waseda University (2010)Google Scholar
  6. 6.
    Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)CrossRefGoogle Scholar
  7. 7.
    Harvey, R.: Appraisal and selection. In: Curation Reference Manual. Digital Curation Center (2007). http://www.dcc.ac.uk/resources/curation-reference-manual/completed-chapters/appraisal-and-selection
  8. 8.
    Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments - part 1. Inf. Process. Manage. 36(6), 779–808 (2000). http://dblp.uni-trier.de/db/journals/ipm/ipm36.html#JonesWR00 CrossRefGoogle Scholar
  9. 9.
    Kim, Y., Ross, S.: Closing the loop: assisting archival appraisal and information retrieval in one sweep. In: Proceedings of the 76th ASIS&T Annual Meeting: Beyond the Cloud: Rethinking Information Boundaries, ASIST 2013, pp. 16:1–16:10. American Society for Information Science, Silver Springs (2013). http://dl.acm.org/citation.cfm?id=2655780.2655796
  10. 10.
    Lui, M., Lau, J.H., Baldwin, T.: Automatic detection and language identification of multilingual documents. TACL 2, 27–40 (2014)Google Scholar
  11. 11.
    Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 141–150. ACM, New York (2007). http://doi.acm.org/10.1145/1242572.1242592
  12. 12.
    Manning, C.D.: Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In: Gelbukh, A.F. (ed.) CICLing 2011, Part I. LNCS, vol. 6608, pp. 171–189. Springer, Heidelberg (2011). http://nlp.stanford.edu/~manning/papers/CICLing2011-manning-tagging.pdf CrossRefGoogle Scholar
  13. 13.
    Oliver, G., Ross, S., Guercio, M., Pala, C.: Report on automated re-appraisal: Managing archives in digital libraries (2008). https://www.academia.edu/10963951/Report_on_Automated_re-Appraisal_Managing_Archives_in_Digital_Libraries_Deliverable_6.10.1_
  14. 14.
    Oliver, G., Kim, Y., Ross, S.: Documentary genre and digital recordkeeping: red herring or a way forward? Archival Science 8, 295–305 (2008)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.University of GlasgowGlasgowUK
  2. 2.University of TorontoTorontoCanada

Personalised recommendations