Abstract
The explosive growth of internet users and connected devices increased the threat vector surface. However, there is no single website or a search engine that provides information on vulnerabilities, threats, attacks, controls, etc. Ambiguity, bias and lack of credibility are some of the alarming issues while dealing with generic search engines on sensitive topics such as ‘Health’ and ‘Information Security’. A dedicated information security specific search engine benefits various stakeholders including security professionals, researchers, government, regulators and others. We implemented a fine grained approach that identifies sub-domains of information security, extracts related URLs and content and assesses search results credibility to enhance adoption of information security specific search engine.
To identify sub-domains and extract seed and child URLs, a fine grained approach that extends an efficient Artificial Bee Colony algorithm was implemented. About 34,007 seed URLs and 400,726 child URLs of various sub-domains of the information security were extracted. The results of the proposed approach identified more URLs (seed and child) of sub-domains as compared to existing approaches while consuming less computing resources.
The research literature on web page ranking and credibility identified a need for fine grained assessment of search results based on surface, content and off-page features. Furthermore, the fine grained web page features were classified into genres using a Gradient Boosted Decision Tree algorithm with an accuracy of 88.75%. Based on features and genres, a FACT score was formulated to rank the web pages based on credibility. An open-source WEBCred framework was developed to calculate the FACT score of 10,429 URLs in information security domain. The results compared against Web of Trust score and Alexa ranking are promising.
A Security Information and Extraction eNgine (SIREN) was developed and hosted to demonstrate the proposed approaches. The SIREN is expected to be integrated into Indian Banks’ Centre for Analysis of Risks and Threats platform so that banks can use it for threat intelligence.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
References
McAfee Labs COVID-19 Threat Report. Retrieved January 30, 2021. Available at https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-july-2020.pdf.
Mulwad, V., Li, W., Joshi, A., Finin, T., & Viswanathan, K. (2011). Extracting information about security vulnerabilities from web text. In IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (vol 3, pp. 257–260). Piscataway: IEEE.
McCallum, A., Nigam, K., Rennie, J., & Seymore, K. (1999). A machine learning approach to building domain-specific search engines. In IJCAI’99: Proceedings of the 16th International Joint Conference on Artificial Intelligence (vol. 99, pp. 662–667). Citeseer.
Tang, T. T., Craswell, N., Hawking, D., Griffiths, K., & Christensen, H. (2006). Quality and relevance of domain-specific search: A case study in mental health. Information Retrieval, 9(2), 207–225.
Kejriwal, M., & Szekely, P. (2018). Constructing domain-specific search engines with no programming. In Thirty-Second AAAI Conference on Artificial Intelligence.
Wöber, K. (2006). Domain specific search engines. In Travel Destination Recommendation Systems: Behavioral Foundations and Applications (pp 205–226).
Abdel-Basset, M., Abdel-Fatah, L., & Sangaiah, A. K. (2018). Metaheuristic algorithms: A comprehensive review. In Proceedings of the Computational Intelligence for Multimedia Big Data on the Cloud with Engineering Applications (pp. 185–231). Amsterdam: Elsevier.
Karaboga, D., & Akay, B. (2009). A survey: Algorithms simulating bee swarm intelligence. Artificial Intelligence Review, 31(1–4), 61–85.
Heip, C. H. R., Herman, P. M. J., Soetaert, K., et al. (1998). Indices of Diversity and Evenness (vol. 24, pp. 61–88). Monaco: Institut océanographique.
MyWOT. Web of Trust. Retrieved January 30, 2021, from https://www.mywot.com/
Najork, M. (2009). Web crawler architecture. In Encyclopedia of database systems (pp. 3462–3465). Berlin: Springer.
Olston, C., Najork, M., et al. (2010) Web crawling. Foundations and Trends® in Information Retrieval, 4(3), 175–246.
Aggarwal, C. C., Al-Garawi, F., & Yu, P. S. (2001). On the design of a learning crawler for topical resource discovery. Transactions on Information Systems (TOIS), 19(3), 286–309.
Priyatam, P. N., Dubey, A., Perumal, K., Praneeth, S., Kakadia, D., & Varma, V. (2014). Seed selection for domain-specific search. In Proceedings of the 23rd International Conference on World Wide Web (pp. 923–928). New York, NY, USA: ACM.
Karaboga, D., Gorkemli, B., Ozturk, C., & Karaboga, N. (2014). A comprehensive survey: Artificial bee colony (ABC) algorithm and applications. Artificial Intelligence Review, 42(1), 21–57.
Fenz, S., & Ekelhart, A. (2009). Formalizing information security knowledge. In Proceedings of the 4th International Symposium on Information, Computer, and Communications Security (pp. 183–194). New York: ACM.
ISO 27001 Series Security Standards. Retrieved January 30, 2021. https://www.iso.org/isoiec-27001-information-security.html
Reid, R., & Van Niekerk, J. (2014). From information security to cyber security cultures. In Information Security for South Africa (pp. 1–7). Piscataway: IEEE.
NIST Cyber Security Framework. Retrieved January 30, 2021. https://www.nist.gov/cyberframework
Karaboga, D. & Basturk, B. (2008). On the Performance of Artificial Bee Colony (ABC) Algorithm. (vol. 8, pp 687–697). Elsevier.
Anuar, S., Selamat, A., & Sallehuddin, R. (2016). A Modified Scout Bee for Artificial Bee Colony Algorithm and its Performance on Optimization Problems. (vol. 28, pp 395–406). Elsevier.
Sanagavarapu, L. M., & Reddy, Y. R. (2021). SIREN - GitHub Repository. Retrieved January 30, 2021. https://github.com/orgs/SIREN-DST/
Prasath, R., & Öztürk, P. (2011). Finding potential seeds through rank aggregation of web searches. In International Conference on Pattern Recognition and Machine Intelligence (pp. 227–234). Berlin: Springer.
Barbaresi, A. (2014). Finding viable seed URLs for web corpora: A scouting approach and comparative study of available sources. In 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 1–8).
Chakrabarti, S., Punera, K., & Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. In Proceedings of the 11th International Conference on World Wide Web (pp. 148–159). New York, NY, USA: ACM.
Spellerberg, I. F., & Fedor, I. F. (2003). A tribute to Claude Shannon (1916–2001) and a plea for more rigorous use of species richness, species diversity and the ‘Shannon–Wiener’ index. Global Ecology and Biogeography, 12(3), 177–179.
Sanagavarapu, L. M., & Reddy, Y. R. (2021). Security Acronyms. Retrieved January 30, 2021 http://tinyurl.com/SecArconym/
Osiński, S., Stefanowski, J., & Weiss, D. (2004). Lingo: Search results clustering algorithm based on singular value decomposition. In Intelligent Information Processing and Web Mining (pp. 359–368). Berlin: Springer.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.
Magurran, A. E. (1988). Ecological diversity and its measurement. Princeton: Princeton University Press.
Internet Live Stats. Retrieved January 30, 2021; [Internet Live Stats is a part of the Real Time Statistics Project]. https://www.internetlivestats.com/
Lazar, J., Meiselwitz, G., & Feng, J. (2007). Understanding web credibility: A synthesis of the research literature. In Foundations and trends in human computer interaction. Norwell: Now Publishers
Roa-Valverde, A. J., & Sicilia, M.-A. (2014). A survey of approaches for ranking on the web of data. Information Retrieval, 17(4), 295–325.
Jones, K. S. (1988). A look back and a look forward. In Proceedings of the 11th Annual International Conference on Research and Development in Information Retrieval (pp. 13–29). New York, NY, USA: ACM.
Roussinov, D., Crowston, K., Nilan, M., Kwasnik, B., Cai, J., & Liu, X. (2001). Genre based navigation on the web. In Proceedings of the Hawaii International Conference on System Sciences.
zu Eissen, S. M., & Stein, B. (2004). Genre classification of web pages. In Annual Conference on Artificial Intelligence. Berlin: Springer.
Rehm, G. (2010). Hypertext types and markup languages (pp. 143–164). Berlin: Springer.
Agrawal, S., Mohan, S. L., & Reddy, Y. R. (2018). Automated credibility assessment of web page based on genre. In Proceedings of 6th International Conference Big Data Analytics, (BDA) (vol. 11297, pp. 155–169). Berlin: Springer.
Lim, C. S., Lee, K. J., & Kim, G. C. (2005). Multiple Sets of Features for Automatic Genre Classification of Web Documents. Information Processing and Management, 41(5), 1263–1276.
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Technical Report.
Kessler, B., Numberg, G., & Schütze, H. (1997). Automatic detection of text genre. In Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics.
Jebari, C. (2015). Enhanced and combined centroid-based approach for multi-label genre classification of web pages. International Journal of Metaheuristics, 4, 220–243.
Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 20, 422–446.
Crowston, K., & Kwasnik, B. H. (2004). A framework for creating a facetted classification for genres: Addressing issues of multidimensionality. 37th Annual Hawaii International Conference on System Sciences.
Agrawal, S., Sanagavarapu, L. M., & Reddy, Y. R. (2021). Web Credibility Website. Retrieved January 30, 2021. https://tinyurl.com/WEBCredFramwork/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Sanagavarapu, L.M., Reddy, Y.R., Agrawal, S. (2021). SIREN: A Fine Grained Approach to Develop Information Security Search Engine. In: Daimi, K., Peoples, C. (eds) Advances in Cybersecurity Management. Springer, Cham. https://doi.org/10.1007/978-3-030-71381-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-71381-2_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71380-5
Online ISBN: 978-3-030-71381-2
eBook Packages: Computer ScienceComputer Science (R0)