Open Source Intelligence Investigation pp 69-93 | Cite as
Acquisition and Preparation of Data for OSINT Investigations
Abstract
Underpinning all open-source intelligence investigations is data. Without data there is nothing to build upon, to combine, to analyse or draw conclusions from. This chapter outlines some of the processes an investigator can undertake to obtain data from open sources as well as methods for the preparation of this data into usable formats for further analysis. First, it discusses the reasons for needing to collect data from open sources. Secondly, it introduces different types of data that may be encountered including unstructured and structured data sources and where to obtain such data. Thirdly, it reviews methods for information extraction—the first step in preparing data for further analysis. Finally, it covers some of the privacy, legal and ethical good practices that should be adhered to when accessing, interrogating and using open source data.
References
- Bayerl PS, Akhgar B (2015) Surveillance and falsification implications for open source intelligence investigations. Commun ACM 58(8):62–69CrossRefGoogle Scholar
- Bazzell M (2016) Open source intelligence techniques: resources for searching and analyzing online information. CCI PublishingGoogle Scholar
- Bird S (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on interactive presentation sessions. Association for Computational Linguistics, July 2006, pp 69–72Google Scholar
- Bradbury D (2011) In plain view: open source intelligence. Comput Fraud Secur 2011(4):5–9CrossRefGoogle Scholar
- Cavoukian A (2011) 7 Foundational principles of privacy by design. https://www.ipc.on.ca/images/Resources/7foundationalprinciples.pdf
- Chen H (2011) Dark Web: exploring and mining the dark side of the web. In: 2011 European intelligence and security informatics conference (EISIC). IEEE, Sept 2011, pp 1–2Google Scholar
- College of Policing (2013) Investigation process. In: Authorised professional practice. https://www.app.college.police.uk/app-content/investigations/investigation-process/#material
- College of Policing (2015) Intelligence cycle. In: Authorised professional practice. https://www.app.college.police.uk/app-content/intelligence-management/intelligence-cycle/
- Cunningham H, Tablan V, Roberts A, Bontcheva K (2013) Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS Comput Biol 9(2):e1002854CrossRefGoogle Scholar
- DARPA (2014) Memex aims to create a new paradigm for domain-specific search. In: Defense Advanced Research Projects Agency. http://www.darpa.mil/news-events/2014-02-09
- Defense Technical Information Center (DTIC), Department of Defense (2007) Joint intelligence. http://www.dtic.mil/doctrine/new_pubs/jp2_0.pdf
- FBI Intelligence Cycle (n.d.) In: Federal Bureau of Investigation. https://www.fbi.gov/about-us/intelligence/intelligence-cycle
- Finkel JR, Grenager T, Manning C (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, June 2005, pp 363–370Google Scholar
- Fu T, Abbasi A, Chen H (2010) A focused crawler for Dark Web forums. J Am Soc Inform Sci Technol 61(6):1213–1231Google Scholar
- Gibson S (2004) Open source intelligence. RUSI J 149:16–22CrossRefGoogle Scholar
- Greenwald G, MacAskill E, Poitras L (2013) Edward Snowden: the whistleblower behind the NSA surveillance revelations. In: The guardian. http://www.theguardian.com/world/2013/jun/09/edward-snowden-nsa-whistleblower-surveillance
- Hansen D, Shneiderman B, Smith MA (2010) Analyzing social media networks with NodeXL: insights from a connected world. Morgan Kaufmann, Los AltosGoogle Scholar
- HMIC (Her Majesty’s Inspectorate of Constabulary) (2011) The rules of engagement: a review of the August 2011 riots. https://www.justiceinspectorates.gov.uk/hmic/media/a-review-of-the-august-2011-disorders-20111220.pdf
- Hoepman JH (2014) Privacy design strategies. In: IFIP international information security conference. Springer, Berlin, June 2014, pp 446–459Google Scholar
- Imran M, Elbassuoni S, Castillo C, Diaz F, Meier P (2013) Practical extraction of disaster-relevant information from social media. In: Proceedings of the 22nd international conference on World Wide Web. ACM, May 2013, pp 1021–1024Google Scholar
- Lohr S (2014) For big-data scientists, “Janitor Work” is key hurdle to insights. In: The New York Times. http://mobile.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=2
- Madhavan J, Ko D, Kot Ł, Ganapathy V, Rasmussen A, Halevy A (2008) Google’s deep web crawl. Proc VLDB Endowment 1(2):1241–1252CrossRefGoogle Scholar
- Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The Stanford CoreNLP Natural Language Processing Toolkit. In ACL (System Demonstrations), June 2014, pp 55–60Google Scholar
- Mercado SC (2009) Sailing the sea of OSINT in the information age. Secret Intell Reader 78Google Scholar
- NATO (2001) NATO open source intelligence handbookGoogle Scholar
- Omand D, Bartlett J, Miller C (2012) Introducing social media intelligence (SOCMINT). Intell Natl Secur 27(6):801–823CrossRefGoogle Scholar
- Pallaris C (2008) Open source intelligence: a strategic enabler of national security. CSS Analyses Secur Policy 3(32):1–3Google Scholar
- Rogers C, Lewis R (eds) (2013) Introduction to police work. Routledge, LondonGoogle Scholar
- Shein E (2013) Ephemeral data. Commun ACM 56:20CrossRefGoogle Scholar
- Warden P (2010) How I got sued by Facebook. In: Pete Warden’s blog. https://petewarden.com/2010/04/05/how-i-got-sued-by-facebook/