Skip to main content

Mingling of Clear and Muddy Water: Understanding and Detecting Semantic Confusion in Blackhat SEO

  • 2002 Accesses

Part of the Lecture Notes in Computer Science book series (LNSC,volume 12972)

Abstract

Search Engine Optimization (SEO) is a set of techniques that help website operators increase the visibility of their webpages to search engine users. However, there are also many unethical practices that abuse ranking algorithms of a search engine to promote illegal online content, called blackhat SEO. In this paper, we make the first attempt to systematically investigate a recent trend in blackhat SEO, semantic confusion, which mingles the content of a webpage to deceive existing detection of blackhat SEO. In particular, from a new perspective of content semantics, we propose an effective defense against the semantic confusion based blackhat SEO. We built a prototype of our defense called SCDS, and then we validated its effectiveness based on 4.5 million domains randomly selected from 11 zone files and passive DNS records. Our evaluation results show that SCDS can detect more than 82 thousand blackhat SEO websites with a precision of 98.35%. We further analyzed 57,477 long-tail keywords promoted by blackhat SEO and found more than 157 SEO campaigns. Finally, we deployed SCDS into the gateway of a campus network for ten months and detected 23,093 domains with malicious semantic confusion content, showing the effectiveness of SCDS in practice.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-88418-5_13
  • Chapter length: 22 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-88418-5
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 9.
Fig. 10.
Fig. 11.

Notes

  1. 1.

    The dataset that has been widely used in text-related studies (http://thuctc.thunlp.org/ [in Chinese]). Note that although the dataset itself was compiled based on the News pages from 2005 to 2011, the semantics of the language remains significantly stable and the accuracy of text classification holds too.

  2. 2.

    http://magnet-uri.sourceforge.net/, a URI-scheme in P2P file sharing for enabling resources to be referred to without an available host.

  3. 3.

    https://www.bittorrent.com/, a popular file-sharing P2P tool based on distributed hash table (DHT) method.

  4. 4.

    QiAnXin is a leading Cybersecurity company in China (https://en.qianxin.com).

References

  1. Beautiful soup documentation - beautiful soup 4.9.0 documentation (2020). https://www.crummy.com/software/BeautifulSoup/bs4/doc/

  2. Github - fxsjy/jieba. https://github.com/fxsjy/jieba (2020)

  3. Keras: the Python deep learning API (2020). https://keras.io/

  4. SeleniumHQ Browser Automation (2020). https://www.selenium.dev/

  5. Textcnn - pytorch and keras — kaggle (2020). https://www.kaggle.com/mlwhiz/textcnn-pytorch-and-keras

  6. Chung, Y.j., Toyoda, M., Kitsuregawa, M.: A study of link farm distribution and evolution using a time series of web snapshots. In: International Workshop on Adversarial Information Retrieval on the Web (2009)

    Google Scholar 

  7. Cormack, G.V.: Email spam filtering: a systematic review. Found. Trends Inf. Retr. 1(4), 335–455 (2007)

    CrossRef  Google Scholar 

  8. Du, K., Yang, H., Li, Z., Duan, H., Zhang, K.: The ever-changing labyrinth: a large-scale analysis of wildcard DNS powered blackhat SEO. In: USENIX Security (2016)

    Google Scholar 

  9. Enge, E., Spencer, S., Fishkin, R., Stricchiola, J.: The Art of SEO. O’Reilly Media, Inc. (2012)

    Google Scholar 

  10. Farsight (2020). https://www.farsightsecurity.com/

  11. Fishkin, R.: Indexation for SEO: Real Numbers in 5 Easy Steps (2010). https://moz.com/blog/indexation-for-seo-real-numbers-in-5-easy-steps

  12. Google: Search Engine Optimization Starter Guide (2008). http://static.googleusercontent.com/media/www.google.com/en//webmasters/docs/search-engine-optimization-starter-guide.pdf

  13. Google: Search Engine Optimization (SEO) Starter Guide (2020). https://support.google.com/webmasters/answer/7451184?hl=en

  14. ICANN: Data Protection/Privacy Issues (2018). https://www.icann.org/dataprotectionprivacy

  15. Invernizzi, L., Thomas, K., Kapravelos, A., Comanescu, O., Picod, J.M., Bursztein, E.: Cloak of visibility: detecting when machines browse a different web. In: IEEE S&P (2016)

    Google Scholar 

  16. John, J.P., Yu, F., Xie, Y., Krishnamurthy, A., Abadi, M.: deSEO: combating search-result poisoning. In: USENIX Security (2011)

    Google Scholar 

  17. Leontiadis, N., Moore, T., Christin, N.: A nearly four-year longitudinal study of search-engine poisoning. In: ACM CCS (2014)

    Google Scholar 

  18. Liao, X., Liu, C., McCoy, D., Shi, E., Hao, S., Beyah, R.A.: Characterizing Long-tail SEO Spam on Cloud Web Hosting Services. In: WWW (2016)

    Google Scholar 

  19. Liao, X., et al.: Seeking nonsense, looking for trouble: efficient promotional-infection detection through semantic inconsistency search. In: IEEE S&P (2016)

    Google Scholar 

  20. Lu, L., Perdisci, R., Lee, W.: Surf: detecting and measuring search poisoning. In: ACM CCS (2011)

    Google Scholar 

  21. MaxMind (2020). https://www.maxmind.com/en/home

  22. Motoyama, M., McCoy, D., Levchenko, K., Savage, S., Voelker, G.M.: An analysis of underground forums. In: ACM IMC (2011)

    Google Scholar 

  23. Nilizadeh, S., et al.: Poised: spotting twitter spam off the beaten paths. In: ACM CCS (2017)

    Google Scholar 

  24. SEOmoz: Google Algorithm Change History (2016). https://moz.com/google-algorithm-change

  25. Tu, H., Doupé, A., Zhao, Z., Ahn, G.J.: Users really do answer telephone scams. In: USENIX Security (2019)

    Google Scholar 

  26. Wang, D.Y., Der, M., Karami, M., Saul, L., McCoy, D., Savage, S., Voelker, G.M.: Search+seizure: the effectiveness of interventions on SEO campaigns. In: ACM IMC (2014)

    Google Scholar 

  27. Wang, D.Y., Savage, S., Voelker, G.M.: Juice: a Longitudinal Study of an SEO Botnet. In: NDSS (2013)

    Google Scholar 

  28. Yang, H., et al.: How to learn klingon without a dictionary: detection and measurement of black keywords used by the underground economy. In: IEEE S&P (2017)

    Google Scholar 

  29. Yuan, K., Lu, H., Liao, X., Wang, X.: Reading thieves’ cant: automatically identifying and understanding Dark Jargons from cybercrime marketplaces. In: USENIX Security (2018)

    Google Scholar 

  30. Zhang, Q., Wang, D.Y., Voelker, G.M.: DSpin: detecting automatically spun content on the web. In: NDSS (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia Zhang .

Editor information

Editors and Affiliations

Appendix

Appendix

A Practices of Semantic Confusion

We further analyzed the method of embedding mingled semantics in SEO pages and identified three main categories: (1) Modify only the \(\texttt {<\!title\!>}\) tag, and keep all other parts the same. This is because search engines often pay more attention to the \(\texttt {<\!title\!>}\) tag and the text in the title has a higher probability to be indexed. Also, blackhat SEOers do not want to be detected due to the modification of the pages or the mixed illegal content. (2) Embed the same promotion keywords into different paragraphs repeatedly. Appropriate repeats are helpful for search engines to extract keywords and give them a higher rank. These two methods mainly target search engines. (3) Replace the total paragraph with promotion content. This category mainly targets visitors and aims to attract them immediately upon arrival at the webpage. The replaced content is short-lived because it is easily noticed by search engines and webmasters.

B Keyword Promotion in Other Platforms

GitHub. In our study, we found that some blackhat SEOers are promoting GitHub repositories of gambling software development services. Specifically, when we searched “github.com+[gambling keywords]” in Google, the results show many GitHub repositories introducing gambling software, and the descriptions of these repositories include the developer’s contact information (e.g., phone numbers and IM IDs). Another practice is to place a large number of gambling keywords in a repository’s introduction through which search engines can index them. For example, Fig. 12 shows the search results of a gambling keyword “ ” (a dark jargon that means “Live Dealer Casino Games” in underground gambling business) with “github.com” in Google. The top results are mostly GitHub repositories that promote gambling development services.

Fig. 12.
figure 12

Gambling software development in GitHub.

Fig. 13.
figure 13

Gambling shop in JingDong (jd.com).

E-Commerce. We also found that the keywords promoted by blackhat SEO pages were not only used in search engines, but they also appeared on E-Commerce websites. For example, when we searched the most frequent keyword, pk10 on jd.com (a well-known E-Commerce website in China), there were shops that provide illegal gambling software development services, as shown in Fig. 13. Therefore, we recommend that E-Commerce websites should also pay attention to the identification and purification of such activities related to the underground business in search results.

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Yang, H. et al. (2021). Mingling of Clear and Muddy Water: Understanding and Detecting Semantic Confusion in Blackhat SEO. In: Bertino, E., Shulman, H., Waidner, M. (eds) Computer Security – ESORICS 2021. ESORICS 2021. Lecture Notes in Computer Science(), vol 12972. Springer, Cham. https://doi.org/10.1007/978-3-030-88418-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-88418-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-88417-8

  • Online ISBN: 978-3-030-88418-5

  • eBook Packages: Computer ScienceComputer Science (R0)