Improving Analysis and Decision-Making Through Intelligent Web Crawling

  • Jonathan T. McClainEmail author
  • Glory Emmanuel Aviña
  • Derek Trumbo
  • Robert Kittinger
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9744)


Analysts across national security domains are required to sift through large amounts of data to find and compile relevant information in a form that enables decision makers to take action in high-consequence scenarios. However, even the most experienced analysts are unable to be 100 % consistent and accurate based on the entire dataset, unbiased towards familiar documentation, and are unable to synthesize and process large amounts of information in a small amount of time. Sandia National Laboratories has attempted to solve this problem by developing an intelligent web crawler called Huntsman. Huntsman acts as a personal research assistant by browsing the internet or offline datasets in a way similar to the human search process, only much faster (millions of documents per day), by submitting queries to search engines and assessing the usefulness of page results through analysis of full-page content with a suite of text analytics. This paper will discuss Huntsman’s capability to both mirror and enhance human analysts using intelligent web crawling with analysts-in-the-loop. The goal is to demonstrate how weaknesses in human cognitive processing can be compensated for by fusing human processes with text analytics and web crawling systems, which ultimately reduces analysts’ cognitive burden and increases mission effectiveness.


Text analytics Intelligent web crawling Decision making Cognitive consistency 


  1. Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999)CrossRefGoogle Scholar
  2. Henzinger, M., Heydon, A., Mitzenmacher, M., Najork, M.: On near-uniform URL sampling. In: Proceedings of the 9th International World Wide Web Conference, pp. 295–308. Elsevier Science, Amsterdam, Netherlands, May 2000Google Scholar
  3. Jasra, M.: Google Has Indexed Only 0.004 % of All Data on the Internet (2010).
  4. Zeinalipour-Yazti, D., Dikaiakos, M.: (2002)Google Scholar
  5. Najork, M., Wiener, J.L.: Breadth-first search crawling yields high-quality pages. In: WWW 10, Hong Kong, 1–5 May 2001Google Scholar
  6. Goldstein, D.G., Gigerenzer, G.: Models of ecological rationality: the recognition heuristic. Psychol. Rev. 109(1), 75 (2002)CrossRefGoogle Scholar
  7. Pope, C., Ziebland, S., Mays, N.: Analysing qualitative data. BMJ 320(7227), 114–116 (2000)CrossRefGoogle Scholar
  8. Howard, N., Spielholz, P., Bao, S., Silverstein, B., Fan, Z.J.: Reliability of an observational tool to assess the organization of work. Int. J. Ind. Ergon. 39(1), 260–266 (2009)CrossRefGoogle Scholar
  9. Marchionini, G.: Information Seeking in Electronic Environments, vol. 9. Cambridge University Press, Cambridge (1997)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Jonathan T. McClain
    • 1
    Email author
  • Glory Emmanuel Aviña
    • 1
  • Derek Trumbo
    • 1
  • Robert Kittinger
    • 1
  1. 1.Sandia National LaboratoriesAlbuquerqueUSA

Personalised recommendations