Skip to main content
Log in

Searching the Web for illegal content: the anatomy of a semantic search engine

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In this paper, we describe the challenges in the realization of a semantic search engine, suited to help law enforcements in the fight against the online drug marketplaces, where new psychoactive substances are sold. This search engine has been developed under the Semantic Illegal Content Hunter (SICH) Project, with the financial support of the Prevention of and Fight Against Crime Programme ISEC 2012 European Commission. The SICH Project-specific objective is to develop new strategic tools and assessment techniques, based on semantic analysis on texts, to support the dynamic mapping and the automatic identification of illegal content over the Net. In particular, a Web search engine can be roughly divided into three main components: (a) the crawler that is in charge of collecting the Web pages to be indexed, (b) the indexer that parses and stores the collected data and (c) the query processor that interacts with the user parsing a query and returning the relevant document; in this paper, we detail each of these components of the SICH search engine, highlighting the differences from a traditional Web search engine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://fives.kau.se/.

  2. http://antipaedo.lip6.fr/.

  3. http://www.i-dash.eu/.

  4. http://scc-sentinel.lancs.ac.uk/icop/.

  5. http://www.psychonautproject.eu/.

  6. http://www.rednetproject.eu/.

  7. http://www.expertsystem.com/.

  8. http://nutch.apache.org/.

References

  • Arapakis I (2015) System and user aspects of web search latency. http://www.slideshare.net/iarapakis/upf15

  • Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval, vol 463. ACM Press, New York

    Google Scholar 

  • Bitcoin (2011) Bitcoin P2P digital currency

  • Brandes U, Gaertler M, Wagner D (2003) Experiments on graph clustering algorithms. Springer, New York

    Book  MATH  Google Scholar 

  • Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1):107–117

    Article  Google Scholar 

  • Camastra F, Ciaramella A, Staiano A (2013) Machine learning and soft computing for ICT security: an overview of current trends. J Ambient Intell Humaniz Comput 4(2):235–247

    Article  Google Scholar 

  • Cho J, Garcia-Molina H (2002) Parallel crawlers. In: Proceedings of the 11th international conference on World Wide Web. ACM, pp 124–135

  • Corazza O, Assi S, Simonato P, Corkery J, Bersani FS, Demetrovics Z, Stair J, Fergus S, Pezzolesi C, Pasinetti M, Deluca P, Drummond C, Davey Z, Blaszko U, Moskalewicz J, Mervo B, Furia LD, Farre M, Flesland L, Pisarska A, Shapiro H, Siemann H, Skutle A, Sferrazza E, Torrens M, Sambola F, van der Kreeft P, Scherbaum N, Schifano F (2013) Promoting innovation and excellence to face the rapid diffusion of novel psychoactive substances in the EU: the outcomes of the rednet project. Hum Psychopharmacol Clin Exp 28(4):317–323

    Article  Google Scholar 

  • Corazza O, Valeriani G, Bersani FS, Corkery J, Martinotti G, Bersani G, Schifano F (2014) “Spice”, “Kryptonite”, “Black Mamba”: an overview of brand names and marketing strategies of novel psychoactive substances on the Web. J Psychoact Drugs 46(4):287–294

    Article  Google Scholar 

  • Deluca P, Davey Z, Corazza O, Furia LD, Farre M, Flesland LH, Mannonen M, Majava A, Peltoniemi T, Pasinetti M, Pezzolesi C, Scherbaum N, Siemann H, Skutle A, Torrens M, van der Kreeft P, Iversen E, Schifano F (2012) Identifying emerging trends in recreational drug use; outcomes from the psychonaut web mapping project. Prog Neuro Psychopharmacol Biol Psychiatr 39(2):221–226 (new drugs of abuse)

    Article  Google Scholar 

  • Diestel R (2012) Graph theory, Graduate texts in mathematics, vol 173, 4th edn. Springer, Heidelberg

  • Fruchterman TM, Reingold EM (1991) Graph drawing by force-directed placement. Softw Pract Exp 21(11):1129–1164

    Article  Google Scholar 

  • Han X, Ma J, Wu Y, Cui C (2014) A novel machine learning approach to rank web forum posts. Soft Comput 18(5):941–959

    Article  Google Scholar 

  • Hoque E, Hoeber O, Strong G, Gong M (2013) Combining conceptual query expansion and visual search results exploration for web image retrieval. J Ambient Intell Humaniz Comput 4(3):389–400

    Article  Google Scholar 

  • Hout MCV, Bingham T (2013a) Silk Road, the virtual drug marketplace: a single case study of user experiences. Int J Drug Policy 24(5):385–391

  • Hout MCV, Bingham T (2013b) Surfing the Silk Road: a study of users experiences. Int J Drug Policy 24(6):524–529

  • Hout MCV, Bingham T (2014) Responsible vendors, intelligent consumers: Silk road, the online revolution in drug trading. Int J Drug Policy 25(2):183–189

    Article  Google Scholar 

  • Jansen BJ (2006) Adversarial information retrieval aspects of sponsored search. In: AIRWeb, pp 33–36

  • Laura L, Me G (2015) Searching the web for illegal content: the anatomy of a semantic search engine. In: Proceedings of the 10th international conference on global security, safety & sustainability. Springer

  • Maleki-Dizaji S, Siddiqi J, Soltan-Zadeh Y, Rahman F (2014) Adaptive information retrieval system via modelling user behaviour. J Ambient Intell Humaniz Comput 5(1):105–110

    Article  Google Scholar 

  • Nikravesh M, Loia V, Azvine B (2002) Fuzzy logic and the internet (flint): Internet, world wide web, and search engines. Soft Comput 6(5):287–299

    Article  MATH  Google Scholar 

  • Ogiela M, Sukowski P (2014) Protocol for irreversible off-line transactions in anonymous electronic currency exchange. Soft Comput 18(12):2587–2594

    Article  Google Scholar 

  • Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: bringing order to the web. Technical Report, Stanford InfoLab. http://ilpubs.stanford.edu:8090/422/

  • Pereira RAM, Molinari A, Pasi G (2005) Contextual weighted representations and indexing models for the retrieval of html documents. Soft Comput 9(7):481–492

    Article  Google Scholar 

  • Tor project (2011) Anonymity online. https://www.torproject.org/. Accessed 20 Sept 2012

  • United Nations Office on Drugs and Crime (UNODC) (2014) Global synthetic drugs assessment (United Nations publication, Sales No. E.14.XI.6). https://www.unodc.org/documents/scientific/2014_Global_Synthetic_Drugs_Assessment_web.pdf

  • Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

Download references

Acknowledgments

The EU ISEC programme has funded the 2-year national project SICH for the consortium formed by Expert System and RiSSC Centro Ricerche e Studi su Sicurezza e Criminalità http://www.rissc.it/. A preliminary version of part of this paper appeared in Laura and Me (2015).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luigi Laura.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Laura, L., Me, G. Searching the Web for illegal content: the anatomy of a semantic search engine. Soft Comput 21, 1245–1252 (2017). https://doi.org/10.1007/s00500-015-1857-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-015-1857-4

Keywords

Navigation