Extracting Semantics from Unconstrained Navigation on Wikipedia

Abstract

Semantic relatedness between words has been successfully extracted from navigation on Wikipedia pages. However, the navigational data used in the corresponding works are sparse and expected to be biased since they have been collected in the context of games. In this paper, we raise this limitation and explore if semantic relatedness can also be extracted from unconstrained navigation. To this end, we first highlight structural differences between unconstrained navigation and game data. Then, we adapt a state of the art approach to extract semantic relatedness on Wikipedia paths. We apply this approach to transitions derived from two unconstrained navigation datasets as well as transitions from WikiGame and compare the results based on two common gold standards. We confirm expected structural differences when comparing unconstrained navigation with the paths collected by WikiGame. In line with this result, the mentioned state of the art approach for semantic extraction on navigation data does not yield good results for unconstrained navigation. Yet, we are able to derive a relatedness measure that performs well on both unconstrained navigation data as well as game data. Overall, we show that unconstrained navigation data on Wikipedia is suited for extracting semantics.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2

Notes

  1. 1.

    http://www.thewikigame.com.

  2. 2.

    http://www.wikispeedia.net.

  3. 3.

    http://www.thewikigame.com.

  4. 4.

    http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/.

  5. 5.

    http://ewulczyn.github.io/Wikipedia_Clickstream_Getting_Started/.

  6. 6.

    https://dumps.wikimedia.org/enwiki/20150112/.

  7. 7.

    http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html.

  8. 8.

    http://clic.cimec.unitn.it/~elia.bruni/MEN.

  9. 9.

    The performance decrease on fewer evaluation pairs can be explained as follows: on one hand, with fewer data points to correlate, the correlation task becomes easier and one might thus expect that the correlation value rises. On the other hand though, data points with faulty ordering have a greater impact on the correlation score. If we remove “good” data points, i.e. with good correlated ordering, we are left with the “bad” data points. This way, it is possible to actually decrease correlation performance when leaving out data points.

  10. 10.

    The restriction of WikiLink to WikiStream source pages (restricted) should actually contain the same number of matchable evaluation pairs. We attribute this difference to the ever changing nature of Wikipedia.

References

  1. 1.

    Bruni E, Tran NK, Baroni M (2014) Multimodal distributional semantics. J Artif Intell Res (JAIR) 49:1–47

  2. 2.

    Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2001) Placing search in context: The concept revisited. In: Proc. of the 10th international conference on World Wide Web

  3. 3.

    Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proc. of the 20th international joint conference on Artifical intelligence

  4. 4.

    Meiss M, Menczer F, Fortunato S, Flammini A, Vespignani A (2008) Ranking web sites with real user traffic. In: Proc. First ACM International Conference on Web Search and Data Mining (WSDM), pp 65–75

  5. 5.

    Milne D, Witten IH (2008) An Effective, Low-cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In: Proc. of the Conference on Artificial Intelligence, AAAI ’08

  6. 6.

    Singer P, Niebler T, Strohmaier M, Hotho A (2013) Computing semantic relatedness from human navigational paths: A case study on wikipedia. IJSWIS 9(4):41–70

    Google Scholar 

  7. 7.

    Strube M, Ponzetto SP (2006) Wikirelate! computing semantic relatedness using wikipedia. In: Proc of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI Press, 2

  8. 8.

    West R, Leskovec J (2012) Human wayfinding in information networks. In: Proc. of the 21st WWW Conf

  9. 9.

    West R, Pineau J, Precup D (2009) Wikispeedia: An online game for inferring semantic distances between concepts. In: Proc. of the 21st International Joint Conference on Artificial Intelligence (IJCAI)

  10. 10.

    West R, Paranjape A, Leskovec J (2015) Mining missing hyperlinks from human navigation traces: a case study of wikipedia. In: Proceedings of the 24th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, pp. 1242–1252

  11. 11.

    Zhang Z, Gentile A, Ciravegna F (2012) Recent advances in methods of lexical semantic relatedness - a survey. Nat Lang Eng 1(1):1–69

    Google Scholar 

Download references

Acknowledgments

This work is funded by the DFG through the PoSTS II project. We also want to thank Alex Clemesha for providing us with the game data from the WikiGame website.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Thomas Niebler.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Niebler, T., Schlör, D., Becker, M. et al. Extracting Semantics from Unconstrained Navigation on Wikipedia. Künstl Intell 30, 163–168 (2016). https://doi.org/10.1007/s13218-015-0417-5

Download citation

Keywords

  • Usage analysis
  • Semantic web