SIREN: A Fine Grained Approach to Develop Information Security Search Engine

Sanagavarapu, Lalit Mohan; Reddy, Y. Raghu; Agrawal, Shriyansh

doi:10.1007/978-3-030-71381-2_16

Lalit Mohan Sanagavarapu³,
Y. Raghu Reddy³ &
Shriyansh Agrawal³

2194 Accesses
1 Citations

Abstract

The explosive growth of internet users and connected devices increased the threat vector surface. However, there is no single website or a search engine that provides information on vulnerabilities, threats, attacks, controls, etc. Ambiguity, bias and lack of credibility are some of the alarming issues while dealing with generic search engines on sensitive topics such as ‘Health’ and ‘Information Security’. A dedicated information security specific search engine benefits various stakeholders including security professionals, researchers, government, regulators and others. We implemented a fine grained approach that identifies sub-domains of information security, extracts related URLs and content and assesses search results credibility to enhance adoption of information security specific search engine.

To identify sub-domains and extract seed and child URLs, a fine grained approach that extends an efficient Artificial Bee Colony algorithm was implemented. About 34,007 seed URLs and 400,726 child URLs of various sub-domains of the information security were extracted. The results of the proposed approach identified more URLs (seed and child) of sub-domains as compared to existing approaches while consuming less computing resources.

The research literature on web page ranking and credibility identified a need for fine grained assessment of search results based on surface, content and off-page features. Furthermore, the fine grained web page features were classified into genres using a Gradient Boosted Decision Tree algorithm with an accuracy of 88.75%. Based on features and genres, a FACT score was formulated to rank the web pages based on credibility. An open-source WEBCred framework was developed to calculate the FACT score of 10,429 URLs in information security domain. The results compared against Web of Trust score and Alexa ranking are promising.

A Security Information and Extraction eNgine (SIREN) was developed and hosted to demonstrate the proposed approaches. The SIREN is expected to be integrated into Indian Banks’ Centre for Analysis of Risks and Threats platform so that banks can use it for threat intelligence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Hardcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SIREN - Security Information Retrieval and Extraction eNgine

ABC Algorithm for URL Extraction

Suspicious URLs Filtering Using Optimal RT-PFL: A Novel Feature Selection Based Web URL Detection

Notes

References

McAfee Labs COVID-19 Threat Report. Retrieved January 30, 2021. Available at https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-july-2020.pdf.
Mulwad, V., Li, W., Joshi, A., Finin, T., & Viswanathan, K. (2011). Extracting information about security vulnerabilities from web text. In IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (vol 3, pp. 257–260). Piscataway: IEEE.
Google Scholar
McCallum, A., Nigam, K., Rennie, J., & Seymore, K. (1999). A machine learning approach to building domain-specific search engines. In IJCAI’99: Proceedings of the 16th International Joint Conference on Artificial Intelligence (vol. 99, pp. 662–667). Citeseer.
Google Scholar
Tang, T. T., Craswell, N., Hawking, D., Griffiths, K., & Christensen, H. (2006). Quality and relevance of domain-specific search: A case study in mental health. Information Retrieval, 9(2), 207–225.
Article Google Scholar
Kejriwal, M., & Szekely, P. (2018). Constructing domain-specific search engines with no programming. In Thirty-Second AAAI Conference on Artificial Intelligence.
Google Scholar
Wöber, K. (2006). Domain specific search engines. In Travel Destination Recommendation Systems: Behavioral Foundations and Applications (pp 205–226).
Google Scholar
Abdel-Basset, M., Abdel-Fatah, L., & Sangaiah, A. K. (2018). Metaheuristic algorithms: A comprehensive review. In Proceedings of the Computational Intelligence for Multimedia Big Data on the Cloud with Engineering Applications (pp. 185–231). Amsterdam: Elsevier.
Chapter Google Scholar
Karaboga, D., & Akay, B. (2009). A survey: Algorithms simulating bee swarm intelligence. Artificial Intelligence Review, 31(1–4), 61–85.
Article Google Scholar
Heip, C. H. R., Herman, P. M. J., Soetaert, K., et al. (1998). Indices of Diversity and Evenness (vol. 24, pp. 61–88). Monaco: Institut océanographique.
Google Scholar
MyWOT. Web of Trust. Retrieved January 30, 2021, from https://www.mywot.com/
Najork, M. (2009). Web crawler architecture. In Encyclopedia of database systems (pp. 3462–3465). Berlin: Springer.
Chapter Google Scholar
Olston, C., Najork, M., et al. (2010) Web crawling. Foundations and Trends® in Information Retrieval, 4(3), 175–246.
Article Google Scholar
Aggarwal, C. C., Al-Garawi, F., & Yu, P. S. (2001). On the design of a learning crawler for topical resource discovery. Transactions on Information Systems (TOIS), 19(3), 286–309.
Article Google Scholar
Priyatam, P. N., Dubey, A., Perumal, K., Praneeth, S., Kakadia, D., & Varma, V. (2014). Seed selection for domain-specific search. In Proceedings of the 23rd International Conference on World Wide Web (pp. 923–928). New York, NY, USA: ACM.
Chapter Google Scholar
Karaboga, D., Gorkemli, B., Ozturk, C., & Karaboga, N. (2014). A comprehensive survey: Artificial bee colony (ABC) algorithm and applications. Artificial Intelligence Review, 42(1), 21–57.
Article Google Scholar
Fenz, S., & Ekelhart, A. (2009). Formalizing information security knowledge. In Proceedings of the 4th International Symposium on Information, Computer, and Communications Security (pp. 183–194). New York: ACM.
Google Scholar
ISO 27001 Series Security Standards. Retrieved January 30, 2021. https://www.iso.org/isoiec-27001-information-security.html
Reid, R., & Van Niekerk, J. (2014). From information security to cyber security cultures. In Information Security for South Africa (pp. 1–7). Piscataway: IEEE.
Google Scholar
NIST Cyber Security Framework. Retrieved January 30, 2021. https://www.nist.gov/cyberframework
Karaboga, D. & Basturk, B. (2008). On the Performance of Artificial Bee Colony (ABC) Algorithm. (vol. 8, pp 687–697). Elsevier.
Google Scholar
Anuar, S., Selamat, A., & Sallehuddin, R. (2016). A Modified Scout Bee for Artificial Bee Colony Algorithm and its Performance on Optimization Problems. (vol. 28, pp 395–406). Elsevier.
Google Scholar
Sanagavarapu, L. M., & Reddy, Y. R. (2021). SIREN - GitHub Repository. Retrieved January 30, 2021. https://github.com/orgs/SIREN-DST/
Google Scholar
Prasath, R., & Öztürk, P. (2011). Finding potential seeds through rank aggregation of web searches. In International Conference on Pattern Recognition and Machine Intelligence (pp. 227–234). Berlin: Springer.
Chapter Google Scholar
Barbaresi, A. (2014). Finding viable seed URLs for web corpora: A scouting approach and comparative study of available sources. In 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 1–8).
Google Scholar
Chakrabarti, S., Punera, K., & Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. In Proceedings of the 11th International Conference on World Wide Web (pp. 148–159). New York, NY, USA: ACM.
Chapter Google Scholar
Spellerberg, I. F., & Fedor, I. F. (2003). A tribute to Claude Shannon (1916–2001) and a plea for more rigorous use of species richness, species diversity and the ‘Shannon–Wiener’ index. Global Ecology and Biogeography, 12(3), 177–179.
Article Google Scholar
Sanagavarapu, L. M., & Reddy, Y. R. (2021). Security Acronyms. Retrieved January 30, 2021 http://tinyurl.com/SecArconym/
Google Scholar
Osiński, S., Stefanowski, J., & Weiss, D. (2004). Lingo: Search results clustering algorithm based on singular value decomposition. In Intelligent Information Processing and Web Mining (pp. 359–368). Berlin: Springer.
Chapter Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.
MATH Google Scholar
Magurran, A. E. (1988). Ecological diversity and its measurement. Princeton: Princeton University Press.
Book Google Scholar
Internet Live Stats. Retrieved January 30, 2021; [Internet Live Stats is a part of the Real Time Statistics Project]. https://www.internetlivestats.com/
Lazar, J., Meiselwitz, G., & Feng, J. (2007). Understanding web credibility: A synthesis of the research literature. In Foundations and trends in human computer interaction. Norwell: Now Publishers
Google Scholar
Roa-Valverde, A. J., & Sicilia, M.-A. (2014). A survey of approaches for ranking on the web of data. Information Retrieval, 17(4), 295–325.
Article Google Scholar
Jones, K. S. (1988). A look back and a look forward. In Proceedings of the 11th Annual International Conference on Research and Development in Information Retrieval (pp. 13–29). New York, NY, USA: ACM.
Google Scholar
Roussinov, D., Crowston, K., Nilan, M., Kwasnik, B., Cai, J., & Liu, X. (2001). Genre based navigation on the web. In Proceedings of the Hawaii International Conference on System Sciences.
Google Scholar
zu Eissen, S. M., & Stein, B. (2004). Genre classification of web pages. In Annual Conference on Artificial Intelligence. Berlin: Springer.
Google Scholar
Rehm, G. (2010). Hypertext types and markup languages (pp. 143–164). Berlin: Springer.
Google Scholar
Agrawal, S., Mohan, S. L., & Reddy, Y. R. (2018). Automated credibility assessment of web page based on genre. In Proceedings of 6th International Conference Big Data Analytics, (BDA) (vol. 11297, pp. 155–169). Berlin: Springer.
Google Scholar
Lim, C. S., Lee, K. J., & Kim, G. C. (2005). Multiple Sets of Features for Automatic Genre Classification of Web Documents. Information Processing and Management, 41(5), 1263–1276.
Article Google Scholar
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Technical Report.
Google Scholar
Kessler, B., Numberg, G., & Schütze, H. (1997). Automatic detection of text genre. In Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics.
Google Scholar
Jebari, C. (2015). Enhanced and combined centroid-based approach for multi-label genre classification of web pages. International Journal of Metaheuristics, 4, 220–243.
Article Google Scholar
Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 20, 422–446.
Article Google Scholar
Crowston, K., & Kwasnik, B. H. (2004). A framework for creating a facetted classification for genres: Addressing issues of multidimensionality. 37th Annual Hawaii International Conference on System Sciences.
Google Scholar
Agrawal, S., Sanagavarapu, L. M., & Reddy, Y. R. (2021). Web Credibility Website. Retrieved January 30, 2021. https://tinyurl.com/WEBCredFramwork/
Google Scholar

Download references

Author information

Authors and Affiliations

Software Engineering Research Centre, IIIT Hyderabad, Hyderabad, India
Lalit Mohan Sanagavarapu, Y. Raghu Reddy & Shriyansh Agrawal

Authors

Lalit Mohan Sanagavarapu
View author publications
You can also search for this author in PubMed Google Scholar
Y. Raghu Reddy
View author publications
You can also search for this author in PubMed Google Scholar
Shriyansh Agrawal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lalit Mohan Sanagavarapu .

Editor information

Editors and Affiliations

University of Detroit Mercy, Detroit, MI, USA
Kevin Daimi
Ulster University, Newtownabbey, UK
Cathryn Peoples

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sanagavarapu, L.M., Reddy, Y.R., Agrawal, S. (2021). SIREN: A Fine Grained Approach to Develop Information Security Search Engine. In: Daimi, K., Peoples, C. (eds) Advances in Cybersecurity Management. Springer, Cham. https://doi.org/10.1007/978-3-030-71381-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-71381-2_16
Published: 23 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71380-5
Online ISBN: 978-3-030-71381-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SIREN: A Fine Grained Approach to Develop Information Security Search Engine

Abstract

Access this chapter

Similar content being viewed by others

SIREN - Security Information Retrieval and Extraction eNgine

ABC Algorithm for URL Extraction

Suspicious URLs Filtering Using Optimal RT-PFL: A Novel Feature Selection Based Web URL Detection

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

SIREN: A Fine Grained Approach to Develop Information Security Search Engine

Abstract

Access this chapter

Similar content being viewed by others

SIREN - Security Information Retrieval and Extraction eNgine

ABC Algorithm for URL Extraction

Suspicious URLs Filtering Using Optimal RT-PFL: A Novel Feature Selection Based Web URL Detection

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation