Abstract
In this paper we evaluate the effect of a document sanitization process on a set of information retrieval metrics, in order to measure information loss and risk of disclosure. As an example document set, we use a subset of the Wikileaks Cables, made up of documents relating to five key news items which were revealed by the cables. In order to sanitize the documents we have developed a semi-automatic anonymization process following the guidelines of Executive Order 13526 (2009) of the US Administration, by (i) identifying and anonymizing specific person names and data, and (ii) concept generalization based on WordNet categories, in order to identify words categorized as classified. Finally, we manually revise the text from a contextual point of view to eliminate complete sentences, paragraphs and sections, where necessary. We show that a significant sanitization can be applied, while maintaining the relevance of the documents to the queries corresponding to the five key news items.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Executive Order 13526, of the US Administration - Classified National Security Information, Section 1.4, points (a) to (h) (2009), http://www.whitehouse.gov/the-press-office/executive-order-classified-national-security-information
Wikileaks Cable repository, http://www.cablegatesearch.net
Chakaravarthy, V.T., Gupta, H., Roy, P., Mohania, M.K.: Efficient Techniques for Document Sanitization. In: CIKM 2008, Napa Valley, California, USA, October 26–30 (2008)
Cumby, C., Ghani, R.: A Machine Learning Based System for Semi-Automatically Redacting Documents. In: Proc. IAAI 2011 (2011)
Sweeney, L.: k-anonymity: a model for protecting privacy. Int. Journal of Uncertainty, Fuzziness and Knowledge-Based Systems (IJUFKS) 10(5), 557–570 (2002)
Hong, T.-P., Lin, C.-W., Yang, K.-T., Wang, S.-L.: A Heuristic Data-Sanitization Approach Based on TF-IDF. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds.) IEA/AIE 2011, Part I. LNCS, vol. 6703, pp. 156–164. Springer, Heidelberg (2011)
Samelin, K., Pöhls, H.C., Bilzhause, A., Posegga, J., de Meer, H.: Redactable Signatures for Independent Removal of Structure and Content. In: Ryan, M.D., Smyth, B., Wang, G. (eds.) ISPEC 2012. LNCS, vol. 7232, pp. 17–33. Springer, Heidelberg (2012)
Chow, R., Staddon, J.N., Oberst, I.S.: Method and apparatus for facilitating document sanitization. US Patent Application Pub. No. US 2011/0107205 A1, May 5 (2011)
Neamatullah, I., Douglass, M.M., Lehman, L.H., Reisner, A., Villarroel, M., Long, W.J., Szolovits, P., Moody, G.B., Mark, R.G., Clifford, G.D.: Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8, 32 (2008)
Abril, D., Navarro-Arribas, G., Torra, V.: Towards Semantic Microaggregation of Categorical Data for Confidential Documents. In: Torra, V., Narukawa, Y., Daumas, M. (eds.) MDAI 2010. LNCS (LNAI), vol. 6408, pp. 266–276. Springer, Heidelberg (2010)
Abril, D., Navarro-Arribas, G., Torra, V.: On the Declassification of Confidential Documents. In: Torra, V., Narakawa, Y., Yin, J., Long, J. (eds.) MDAI 2011. LNCS (LNAI), vol. 6820, pp. 235–246. Springer, Heidelberg (2011)
Yahoo! News. Top 10 revelations from Wiki Leaks cables, http://news.yahoo.com/blogs/lookout/top-10-revelations-wikileaks-cables.html
Pingar – Entity Extraction Software, http://www.pingar.com
Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D., Miller, K.: WordNet: An online lexical database. Int. J. Lexicograph 3(4), 235–244 (1990)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and Technology behind Search, 2nd edn. ACM Press Books (2011) ISBN: 0321416910
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008) ISBN: 0521865719
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nettleton, D.F., Abril, D. (2012). Document Sanitization: Measuring Search Engine Information Loss and Risk of Disclosure for the Wikileaks cables. In: Domingo-Ferrer, J., Tinnirello, I. (eds) Privacy in Statistical Databases. PSD 2012. Lecture Notes in Computer Science, vol 7556. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33627-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-33627-0_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33626-3
Online ISBN: 978-3-642-33627-0
eBook Packages: Computer ScienceComputer Science (R0)