Skip to main content

A Dynamic-Static Approach of Model Fusion for Document Similarity Computation

  • Conference paper
  • First Online:
Web Information Systems Engineering – WISE 2015 (WISE 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9418))

Included in the following conference series:

  • 1469 Accesses

Abstract

The semantic similarity of text document pairs can be used for valuable applications. There are various existing basic models proposed for representing document content and computing document similarity. Each basic model performs difference in different scenarios. Existing model selection or fusion approaches generate improved models based on these basic models on the granularity of document collection. These improved models are static for all document pairs and may be only proper for some of the document pairs. We propose a dynamic idea of model fusion, and an approach based on a Dynamic-Static Fusion Model (DSFM) on the granularity of document pairs, which is dynamic for each document pair. The dynamic module in DSFM learns to rank the basic models to predict the best basic model for a given document pair. We propose a model categorization method to construct ideal model labels of document pairs for learning in this dynamic module. The static module in DSFM is based on linear regression. We also propose a model selection method to select appropriate candidate basic models for fusion and improve the performance. The experiments on public document collections which contain paragraph pairs and sentence pairs with human-rated similarity illustrate the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Lee, M.D., Welsh, M.: An empirical evaluation of models of text document similarity. In: Proceedings of CogSci 2005, pp. 1254–1259 (2005)

    Google Scholar 

  2. STS2015, Semantic Textual Similarity for English in SemEval-2015. http://alt.qcri.org/semeval2015/task2/index.php?id=semantic-textual-similarity-for-english

  3. Joachims, T.: Training linear SVMs in linear time. In: Proceedings of KDD 2006, pp. 217–226 (2006)

    Google Scholar 

  4. Li, H.: A short introduction to learning to rank. IEICE Trans. Inf. Syst. E94–D(10), 1854–1862 (2011)

    Article  Google Scholar 

  5. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  6. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  7. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of SIGIR 1999, pp. 50–57 (1999)

    Google Scholar 

  8. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  9. Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In: Proceedings of WSDM 2014, pp. 543–552 (2014)

    Google Scholar 

  10. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. OReilly Media Inc., Sebastopol (2009)

    MATH  Google Scholar 

  11. Řehůřek, R., Sojka, P.: Software framework for topic modeling with large corpora. In: Proceedings of LREC 2010 Workshop New Challenges for NLP Frameworks, pp. 46–50 (2010)

    Google Scholar 

  12. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a Web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  13. Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems (I-Semantics 2011), pp. 1–8 (2011)

    Google Scholar 

  14. Thornton, C., Hutter, F., Hoos, H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classifiaction algorithms. In: Proceedings of KDD 2013 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiyi Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Li, J., Asano, Y., Shimizu, T., Yoshikawa, M. (2015). A Dynamic-Static Approach of Model Fusion for Document Similarity Computation. In: Wang, J., et al. Web Information Systems Engineering – WISE 2015. WISE 2015. Lecture Notes in Computer Science(), vol 9418. Springer, Cham. https://doi.org/10.1007/978-3-319-26190-4_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26190-4_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26189-8

  • Online ISBN: 978-3-319-26190-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics