Neural Article Pair Modeling for Wikipedia Sub-article Matching

  • Muhao ChenEmail author
  • Changping Meng
  • Gang Huang
  • Carlo Zaniolo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11053)


Nowadays, editors tend to separate different subtopics of a long Wiki-pedia article into multiple sub-articles. This separation seeks to improve human readability. However, it also has a deleterious effect on many Wikipedia-based tasks that rely on the article-as-concept assumption, which requires each entity (or concept) to be described solely by one article. This underlying assumption significantly simplifies knowledge representation and extraction, and it is vital to many existing technologies such as automated knowledge base construction, cross-lingual knowledge alignment, semantic search and data lineage of Wikipedia entities. In this paper we provide an approach to match the scattered sub-articles back to their corresponding main-articles, with the intent of facilitating automated Wikipedia curation and processing. The proposed model adopts a hierarchical learning structure that combines multiple variants of neural document pair encoders with a comprehensive set of explicit features. A large crowdsourced dataset is created to support the evaluation and feature extraction for the task. Based on the large dataset, the proposed model achieves promising results of cross-validation and significantly outperforms previous approaches. Large-scale serving on the entire English Wikipedia also proves the practicability and scalability of the proposed model by effectively extracting a vast collection of newly paired main and sub-articles. Code related to this paper is available at:


Article pair modeling Sub-article matching Text representations Sequence encoders Explicit features Wikipedia 


  1. 1.
    Ackerman, M.S., Dachtera, J., et al.: Sharing knowledge and expertise: the CSCW view of knowledge management. CSCW 22, 531–573 (2013)Google Scholar
  2. 2.
    Bengio, Y.: Learning deep architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127 (2009)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Cai, Z., Zhao, K., et al.: Wikification via link co-occurrence. In: CIKM (2013)Google Scholar
  4. 4.
    Chen, D., Fisch, A., et al.: Reading Wikipedia to answer open-domain questions. In: ACL (2017)Google Scholar
  5. 5.
    Chen, M., Tian, Y., et al.: Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. In: IJCAI (2017)Google Scholar
  6. 6.
    Chen, M., Tian, Y., et al.: Co-training embeddings of knowledge graphs and entity descriptions for cross-lingual entity alignment. In: IJCAI (2018)Google Scholar
  7. 7.
    Chen, M., Tian, Y., et al.: On2Vec: embedding-based relation prediction for ontology population. In: SDM (2018)Google Scholar
  8. 8.
    Chen, M., Zaniolo, C.: Learning multi-faceted knowledge graph embeddings for natural language processing. In: IJCAI (2017)Google Scholar
  9. 9.
    Chung, J., Gulcehre, C., et al.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv (2014)Google Scholar
  10. 10.
    Cilibrasi, R.L., Vitanyi, P.M.: The Google similarity distance. TKDE 19(3), 370–383 (2007)Google Scholar
  11. 11.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Comm. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  12. 12.
    Dhingra, B., Liu, H., et al.: Gated-attention readers for text comprehension. In: ACL (2017)Google Scholar
  13. 13.
    Dojchinovski, M., Kliegr, T.: real-time classification of entities in text with Wikipedia. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013. LNCS (LNAI), vol. 8190, pp. 654–658. Springer, Heidelberg (2013). Scholar
  14. 14.
    Féraud, R., Clérot, F.: A methodology to explain neural network classification. Neural Netw. 15(2), 237–246 (2002)CrossRefGoogle Scholar
  15. 15.
    Gabrilovich, E., Markovitch, S., et al.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: IJCAI (2007)Google Scholar
  16. 16.
    Gouws, S., Bengio, Y., Corrado, G.: Bilbowa: fast bilingual distributed representations without word alignments. In: ICML (2015)Google Scholar
  17. 17.
    Hecht, B., Carton, S.H., et al.: Explanatory semantic relatedness and explicit spatialization for exploratory search. In: SIGIR (2012)Google Scholar
  18. 18.
    Hu, B., Lu, Z., et al.: Convolutional neural network architectures for matching natural language sentences. In: NIPS, pp. 2042–2050 (2014)Google Scholar
  19. 19.
    Jozefowicz, R., Zaremba, W., et al.: An empirical exploration of recurrent network architectures. In: ICML (2015)Google Scholar
  20. 20.
    Kadlec, R., Schmid, M., et al.: Text understanding with the attention sum reader network. In: ACL, vol. 1 (2016)Google Scholar
  21. 21.
    Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP (2014)Google Scholar
  22. 22.
    Kittur, A., Kraut, R.E.: Beyond Wikipedia: coordination and conflict in online production groups. In: CSCW (2010)Google Scholar
  23. 23.
    Lascarides, A., Asher, N.: Temporal interpretation, discourse relations and commonsense entailment. Linguist. Philos. 16(5), 437–493 (1993)CrossRefGoogle Scholar
  24. 24.
    Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., et al.: DBpedia–a large-scale, multilingual knowledge base extracted from Wikipedia. Seman. Web 6(2), 167–195 (2015)Google Scholar
  25. 25.
    Lin, C.Y., Hovy, E.: From single to multi-document summarization: a prototype system and its evaluation. In: ACL (2002)Google Scholar
  26. 26.
    Lin, Y., Yu, B., et al.: Problematizing and addressing the article-as-concept assumption in Wikipedia. In: CSCW (2017)Google Scholar
  27. 27.
    Liu, X., Xia, T., et al.: Cross social media recommendation. In: ICWSM (2016)Google Scholar
  28. 28.
    Mahdisoltani, F., Biega, J., Suchanek, F., et al.: Yago3: a knowledge base from multilingual Wikipedias. In: CIDR (2015)Google Scholar
  29. 29.
    Meij, E., Balog, K., Odijk, D.: Entity linking and retrieval for semantic search. In: WSDM (2014)Google Scholar
  30. 30.
    Mikolov, T., Sutskever, I., et al.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)Google Scholar
  31. 31.
    Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: CIKM (2008)Google Scholar
  32. 32.
    Mousavi, H., Atzori, M., et al.: Text-mining, structured queries, and knowledge management on web document Corpora. SIGMOD Rec. 43(3), 48–54 (2014)CrossRefGoogle Scholar
  33. 33.
    Ni, Y., Xu, Q.K., et al.: Semantic documents relatedness using concept graph representation. In: WSDM (2016)Google Scholar
  34. 34.
    Olden, J.D., Jackson, D.A.: Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks. Ecol. Model. 154(1–2), 135–150 (2002)CrossRefGoogle Scholar
  35. 35.
    Poria, S., Cambria, E., et al.: Deep convolutional neural network textual features and multiple Kernel learning for utterance-level multimodal sentiment analysis. In: EMNLP (2015)Google Scholar
  36. 36.
    Rocktäschel, T., Grefenstette, E., et al.: Reasoning about entailment with neural attention (2016)Google Scholar
  37. 37.
    Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In: WSDM (2014)Google Scholar
  38. 38.
    Severyn, A., Moschitti, A.: Twitter sentiment analysis with deep convolutional neural networks. In: SIGIR (2015)Google Scholar
  39. 39.
    Sha, L., Chang, B., et al.: Reading and thinking: re-read LSTM unit for textual entailment recognition. In: COLING (2016)Google Scholar
  40. 40.
    Strube, M., Ponzetto, S.P.: Wikirelate! Computing semantic relatedness using Wikipedia. In: AAAI (2006)Google Scholar
  41. 41.
    Suchanek, F.M., Abiteboul, S., et al.: Paris: probabilistic alignment of relations, instances, and schema. In: PVLDB (2011)Google Scholar
  42. 42.
    Tsai, C.T., Roth, D.: Cross-lingual Wikification using multilingual embeddings. In: NAACL (2016)Google Scholar
  43. 43.
    Vrandečić, D.: Wikidata: a new platform for collaborative data collection. In: WWW (2012)Google Scholar
  44. 44.
    Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledge base. Comm. ACM 57(10), 78–85 (2014)CrossRefGoogle Scholar
  45. 45.
    Wang, Z., Li, J., et al.: Cross-lingual knowledge linking across Wiki knowledge bases. In: WWW (2012)Google Scholar
  46. 46.
    Xie, R., Liu, Z., et al.: Representation learning of knowledge graphs with entity descriptions. In: AAAI (2016)Google Scholar
  47. 47.
    Yamada, I., Shindo, H., et al.: Joint learning of the embedding of words and entities for named entity disambiguation. In: CoNLL (2016)Google Scholar
  48. 48.
    Yin, W., Schütze, H.: Convolutional neural network for paraphrase identification. In: NAACL (2015)Google Scholar
  49. 49.
    Yin, W., Schütze, H., et al.: Abcnn: Attention-based convolutional neural network for modeling sentence pairs. TACL 4(1), 259–272 (2016)Google Scholar
  50. 50.
    Zou, L., Huang, R., et al.: Natural language question answering over RDF: a graph data driven approach. In: SIGMOD (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Muhao Chen
    • 1
    Email author
  • Changping Meng
    • 2
  • Gang Huang
    • 3
  • Carlo Zaniolo
    • 1
  1. 1.University of CaliforniaLos AngelesUSA
  2. 2.Purdue UniversityWest LafayetteUSA
  3. 3.GoogleMountain ViewUSA

Personalised recommendations