Skip to main content

Semantic Marking Method for Non-text Documents of Website Based on Their Context in Hypertext Clustering

  • Conference paper
  • First Online:
Recent Research in Control Engineering and Decision Making (ICIT 2019)

Abstract

Initial indexing and structuration of information on Internet are the conditions for resolving of the task of an effective search of information that best relates to user’s query now. Mainly they deal with text-based time expensive processing methods. Hyper structured nature of the web is used as an alternate approach for this purpose, but websites also contain information in the non-text format: (images, movies, pdf-files etc.). These documents, first of all, are intended for perception by the person, but not for the automated processing. In this article, we propose the method for the decision of this problem on the way of semantic marking of non-text documents based on their context in hypertext clustering. At the same time, we develop the approach of the context independent semantic clustering of the website with using of web-analytics information, which utilizes internal hypertext structure, user’s behavior statistics and does not require full-text content analysis. For this purpose, we represent the hypertext structure of the site as a graph and apply flow simulation algorithms to produce web clustering. Then we make a semantic description of the clusters by sets of keywords. Non-text documents have hyperlinks to some web clusters, so we consider extracted keywords for relating cluster as its semantic marking. We have checked the suggested method on the example of site sstu.ru.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.internetlivestats.com/.

  2. 2.

    The hyperlink to the image “_MG_0878.JPG” on the page http://photo.sstu.ru/main.php?g2_itemId=889.

  3. 3.

    See, for example, https://www.ibm.com/blogs/policy/dataresponsibility-at-ibm/.

References

  1. Manjaly, A.V., Priya, B.S.: Malayalam text and non-text classification of natural scene images based on multiple instance learning. In: IEEE International Conference on Advances in Computer Applications, ICACA 2016, pp. 190–196, Coimbatore, India (2016/ 2017). https://doi.org/10.1109/icaca.2016.7887949

  2. Franzoni, V., Milani, A., Pallottelli, S., Leung, C.H.C., Yuanxi, L.: Context-based image semantic similarity. In: Proceedings of 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD). IEEE, pp. 1280–1284 (2015)

    Google Scholar 

  3. Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. 41(3) (2009). Article 17

    Article  Google Scholar 

  4. Sridevi, K., Umarani, R., Selvi, V.: An analysis of web document clustering algorithms. Int. J. Sci. Technol. 1(6), 275–282 (2011)

    Google Scholar 

  5. Kosala, R., Blockeel, H.: Web mining research: a survey. ACM SIGKDD Explor. Newslett. 2(1), 1–15 (2000)

    Article  Google Scholar 

  6. MCL—a cluster algorithm for graphs, http://micans.org/mcl/. Accessed 20 Oct 2018

  7. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)

    Google Scholar 

  8. Aggarwal, C.C., Wang, H.A.: Survey of Clustering Algorithms for Graph Data. Springer, Boston, pp. 275–301 (2010)

    Google Scholar 

  9. Ngomo, N., Schumacher, F.: Borderow: a local graph clustering algorithm for natural language processing. In: Computational Linguistics and Intelligent Text Processing, pp. 547–558 (2009)

    Google Scholar 

  10. Salin, V., Slastihina, M., Ermilov, I., Speck, R., Auer, S., Papshev, S.: Semantic clustering of website based on its hypertext structure. In: Proceedings of 6th International Conference, KESW 2015. Communications in Computer and Information Science, pp. 182–194 (2015)

    Google Scholar 

  11. Kumbaroska, V., Mitrevski, P.: Behavioural-based modelling and analysis of Navigation Patterns across Information Networks. Emerg. Res. Solut. ICT 1, 60–74 (2016). https://doi.org/10.20544/ERSICT.02.16.P06

    Article  Google Scholar 

  12. Schaeffer, S.E.: Graph clustering by flow simulation. Comput. Sci. Rev. T(1), 27–64. https://doi.org/10.1016/j.cosrev.2007.05.001

    Article  Google Scholar 

  13. Scikit-learn machine learning in Python. http://scikit-learn.org/stable/modules/clustering.html. Accessed 18 Apr 2018

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sergey Papshev .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Papshev, S., Sytnik, A., Melnikova, N., Bogomolov, A. (2019). Semantic Marking Method for Non-text Documents of Website Based on Their Context in Hypertext Clustering. In: Dolinina, O., Brovko, A., Pechenkin, V., Lvov, A., Zhmud, V., Kreinovich, V. (eds) Recent Research in Control Engineering and Decision Making. ICIT 2019. Studies in Systems, Decision and Control, vol 199. Springer, Cham. https://doi.org/10.1007/978-3-030-12072-6_26

Download citation

Publish with us

Policies and ethics