Skip to main content

SIREN: A Fine Grained Approach to Develop Information Security Search Engine

  • Chapter
  • First Online:
Advances in Cybersecurity Management

Abstract

The explosive growth of internet users and connected devices increased the threat vector surface. However, there is no single website or a search engine that provides information on vulnerabilities, threats, attacks, controls, etc. Ambiguity, bias and lack of credibility are some of the alarming issues while dealing with generic search engines on sensitive topics such as ‘Health’ and ‘Information Security’. A dedicated information security specific search engine benefits various stakeholders including security professionals, researchers, government, regulators and others. We implemented a fine grained approach that identifies sub-domains of information security, extracts related URLs and content and assesses search results credibility to enhance adoption of information security specific search engine.

To identify sub-domains and extract seed and child URLs, a fine grained approach that extends an efficient Artificial Bee Colony algorithm was implemented. About 34,007 seed URLs and 400,726 child URLs of various sub-domains of the information security were extracted. The results of the proposed approach identified more URLs (seed and child) of sub-domains as compared to existing approaches while consuming less computing resources.

The research literature on web page ranking and credibility identified a need for fine grained assessment of search results based on surface, content and off-page features. Furthermore, the fine grained web page features were classified into genres using a Gradient Boosted Decision Tree algorithm with an accuracy of 88.75%. Based on features and genres, a FACT score was formulated to rank the web pages based on credibility. An open-source WEBCred framework was developed to calculate the FACT score of 10,429 URLs in information security domain. The results compared against Web of Trust score and Alexa ranking are promising.

A Security Information and Extraction eNgine (SIREN) was developed and hosted to demonstrate the proposed approaches. The SIREN is expected to be integrated into Indian Banks’ Centre for Analysis of Risks and Threats platform so that banks can use it for threat intelligence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.pwc.com/gx/en/issues/cyber-security/information-security-survey.html.

  2. 2.

    https://www.idrbt.ac.in/ib-cart.html.

  3. 3.

    https://www.nist.gov/publications/introduction-information-security.

  4. 4.

    https://tinyurl.com/SecKeywordsList.

  5. 5.

    https://pypi.python.org/pypi/wikipedia/.

  6. 6.

    https://dev.twitter.com/streaming/overview.

  7. 7.

    https://radimrehurek.com/gensim/models/phrases.html.

  8. 8.

    https://tinyurl.com/ForbesWebCred/.

  9. 9.

    http://www.example.com:8080/../index.html?q=search.

  10. 10.

    https://pypi.org/project/BeautifulSoup/.

  11. 11.

    https://developers.google.com/custom-search/json-api/v1/overview.

  12. 12.

    https://search.google.com/test/mobile-friendly.

  13. 13.

    http://www.nltk.org/.

  14. 14.

    https://github.com/easylist/easylist.

  15. 15.

    https://archive.org/.

References

  1. McAfee Labs COVID-19 Threat Report. Retrieved January 30, 2021. Available at https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-july-2020.pdf.

  2. Mulwad, V., Li, W., Joshi, A., Finin, T., & Viswanathan, K. (2011). Extracting information about security vulnerabilities from web text. In IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (vol 3, pp. 257–260). Piscataway: IEEE.

    Google Scholar 

  3. McCallum, A., Nigam, K., Rennie, J., & Seymore, K. (1999). A machine learning approach to building domain-specific search engines. In IJCAI’99: Proceedings of the 16th International Joint Conference on Artificial Intelligence (vol. 99, pp. 662–667). Citeseer.

    Google Scholar 

  4. Tang, T. T., Craswell, N., Hawking, D., Griffiths, K., & Christensen, H. (2006). Quality and relevance of domain-specific search: A case study in mental health. Information Retrieval, 9(2), 207–225.

    Article  Google Scholar 

  5. Kejriwal, M., & Szekely, P. (2018). Constructing domain-specific search engines with no programming. In Thirty-Second AAAI Conference on Artificial Intelligence.

    Google Scholar 

  6. Wöber, K. (2006). Domain specific search engines. In Travel Destination Recommendation Systems: Behavioral Foundations and Applications (pp 205–226).

    Google Scholar 

  7. Abdel-Basset, M., Abdel-Fatah, L., & Sangaiah, A. K. (2018). Metaheuristic algorithms: A comprehensive review. In Proceedings of the Computational Intelligence for Multimedia Big Data on the Cloud with Engineering Applications (pp. 185–231). Amsterdam: Elsevier.

    Chapter  Google Scholar 

  8. Karaboga, D., & Akay, B. (2009). A survey: Algorithms simulating bee swarm intelligence. Artificial Intelligence Review, 31(1–4), 61–85.

    Article  Google Scholar 

  9. Heip, C. H. R., Herman, P. M. J., Soetaert, K., et al. (1998). Indices of Diversity and Evenness (vol. 24, pp. 61–88). Monaco: Institut océanographique.

    Google Scholar 

  10. MyWOT. Web of Trust. Retrieved January 30, 2021, from https://www.mywot.com/

  11. Najork, M. (2009). Web crawler architecture. In Encyclopedia of database systems (pp. 3462–3465). Berlin: Springer.

    Chapter  Google Scholar 

  12. Olston, C., Najork, M., et al. (2010) Web crawling. Foundations and Trends® in Information Retrieval, 4(3), 175–246.

    Article  Google Scholar 

  13. Aggarwal, C. C., Al-Garawi, F., & Yu, P. S. (2001). On the design of a learning crawler for topical resource discovery. Transactions on Information Systems (TOIS), 19(3), 286–309.

    Article  Google Scholar 

  14. Priyatam, P. N., Dubey, A., Perumal, K., Praneeth, S., Kakadia, D., & Varma, V. (2014). Seed selection for domain-specific search. In Proceedings of the 23rd International Conference on World Wide Web (pp. 923–928). New York, NY, USA: ACM.

    Chapter  Google Scholar 

  15. Karaboga, D., Gorkemli, B., Ozturk, C., & Karaboga, N. (2014). A comprehensive survey: Artificial bee colony (ABC) algorithm and applications. Artificial Intelligence Review, 42(1), 21–57.

    Article  Google Scholar 

  16. Fenz, S., & Ekelhart, A. (2009). Formalizing information security knowledge. In Proceedings of the 4th International Symposium on Information, Computer, and Communications Security (pp. 183–194). New York: ACM.

    Google Scholar 

  17. ISO 27001 Series Security Standards. Retrieved January 30, 2021. https://www.iso.org/isoiec-27001-information-security.html

  18. Reid, R., & Van Niekerk, J. (2014). From information security to cyber security cultures. In Information Security for South Africa (pp. 1–7). Piscataway: IEEE.

    Google Scholar 

  19. NIST Cyber Security Framework. Retrieved January 30, 2021. https://www.nist.gov/cyberframework

  20. Karaboga, D. & Basturk, B. (2008). On the Performance of Artificial Bee Colony (ABC) Algorithm. (vol. 8, pp 687–697). Elsevier.

    Google Scholar 

  21. Anuar, S., Selamat, A., & Sallehuddin, R. (2016). A Modified Scout Bee for Artificial Bee Colony Algorithm and its Performance on Optimization Problems. (vol. 28, pp 395–406). Elsevier.

    Google Scholar 

  22. Sanagavarapu, L. M., & Reddy, Y. R. (2021). SIREN - GitHub Repository. Retrieved January 30, 2021. https://github.com/orgs/SIREN-DST/

    Google Scholar 

  23. Prasath, R., & Öztürk, P. (2011). Finding potential seeds through rank aggregation of web searches. In International Conference on Pattern Recognition and Machine Intelligence (pp. 227–234). Berlin: Springer.

    Chapter  Google Scholar 

  24. Barbaresi, A. (2014). Finding viable seed URLs for web corpora: A scouting approach and comparative study of available sources. In 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 1–8).

    Google Scholar 

  25. Chakrabarti, S., Punera, K., & Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. In Proceedings of the 11th International Conference on World Wide Web (pp. 148–159). New York, NY, USA: ACM.

    Chapter  Google Scholar 

  26. Spellerberg, I. F., & Fedor, I. F. (2003). A tribute to Claude Shannon (1916–2001) and a plea for more rigorous use of species richness, species diversity and the ‘Shannon–Wiener’ index. Global Ecology and Biogeography, 12(3), 177–179.

    Article  Google Scholar 

  27. Sanagavarapu, L. M., & Reddy, Y. R. (2021). Security Acronyms. Retrieved January 30, 2021 http://tinyurl.com/SecArconym/

    Google Scholar 

  28. Osiński, S., Stefanowski, J., & Weiss, D. (2004). Lingo: Search results clustering algorithm based on singular value decomposition. In Intelligent Information Processing and Web Mining (pp. 359–368). Berlin: Springer.

    Chapter  Google Scholar 

  29. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.

    MATH  Google Scholar 

  30. Magurran, A. E. (1988). Ecological diversity and its measurement. Princeton: Princeton University Press.

    Book  Google Scholar 

  31. Internet Live Stats. Retrieved January 30, 2021; [Internet Live Stats is a part of the Real Time Statistics Project]. https://www.internetlivestats.com/

  32. Lazar, J., Meiselwitz, G., & Feng, J. (2007). Understanding web credibility: A synthesis of the research literature. In Foundations and trends in human computer interaction. Norwell: Now Publishers

    Google Scholar 

  33. Roa-Valverde, A. J., & Sicilia, M.-A. (2014). A survey of approaches for ranking on the web of data. Information Retrieval, 17(4), 295–325.

    Article  Google Scholar 

  34. Jones, K. S. (1988). A look back and a look forward. In Proceedings of the 11th Annual International Conference on Research and Development in Information Retrieval (pp. 13–29). New York, NY, USA: ACM.

    Google Scholar 

  35. Roussinov, D., Crowston, K., Nilan, M., Kwasnik, B., Cai, J., & Liu, X. (2001). Genre based navigation on the web. In Proceedings of the Hawaii International Conference on System Sciences.

    Google Scholar 

  36. zu Eissen, S. M., & Stein, B. (2004). Genre classification of web pages. In Annual Conference on Artificial Intelligence. Berlin: Springer.

    Google Scholar 

  37. Rehm, G. (2010). Hypertext types and markup languages (pp. 143–164). Berlin: Springer.

    Google Scholar 

  38. Agrawal, S., Mohan, S. L., & Reddy, Y. R. (2018). Automated credibility assessment of web page based on genre. In Proceedings of 6th International Conference Big Data Analytics, (BDA) (vol. 11297, pp. 155–169). Berlin: Springer.

    Google Scholar 

  39. Lim, C. S., Lee, K. J., & Kim, G. C. (2005). Multiple Sets of Features for Automatic Genre Classification of Web Documents. Information Processing and Management, 41(5), 1263–1276.

    Article  Google Scholar 

  40. Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Technical Report.

    Google Scholar 

  41. Kessler, B., Numberg, G., & Schütze, H. (1997). Automatic detection of text genre. In Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics.

    Google Scholar 

  42. Jebari, C. (2015). Enhanced and combined centroid-based approach for multi-label genre classification of web pages. International Journal of Metaheuristics, 4, 220–243.

    Article  Google Scholar 

  43. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 20, 422–446.

    Article  Google Scholar 

  44. Crowston, K., & Kwasnik, B. H. (2004). A framework for creating a facetted classification for genres: Addressing issues of multidimensionality. 37th Annual Hawaii International Conference on System Sciences.

    Google Scholar 

  45. Agrawal, S., Sanagavarapu, L. M., & Reddy, Y. R. (2021). Web Credibility Website. Retrieved January 30, 2021. https://tinyurl.com/WEBCredFramwork/

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lalit Mohan Sanagavarapu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sanagavarapu, L.M., Reddy, Y.R., Agrawal, S. (2021). SIREN: A Fine Grained Approach to Develop Information Security Search Engine. In: Daimi, K., Peoples, C. (eds) Advances in Cybersecurity Management. Springer, Cham. https://doi.org/10.1007/978-3-030-71381-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-71381-2_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-71380-5

  • Online ISBN: 978-3-030-71381-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics