Advertisement

Enhancement of Performance of Document Clustering in the Authorship Identification Problem with a Weighted Cosine Similarity

  • Carolina Martín-del-Campo-RodríguezEmail author
  • Grigori Sidorov
  • Ildar Batyrshin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11289)

Abstract

Distance and similarity measures are essential to solve many pattern recognition problems such as classification, information retrieval and clustering, where the use of a specific distance could led to a better performance than others. A weighted cosine distance is proposed considering a variation in the weights of exclusive attributes of the input vectors. An agglomerative hierarchical clustering of documents was used for the comparison between the traditional cosine similarity and the one proposed in this paper. This modified measure has outcome in an improvement in the formation of clusters.

Notes

Acknowledgments

This work was partially supported by the Mexican Government (CONACYT projects 240844, SNI, COFAA-IPN, SIP-IPN 20181849, 20171813, BEIFI 20181315).

References

  1. 1.
    Forbes, S.: On the local distribution of certain Illinois fishes: an essay in statistical ecology. In: Bulletin of the Illinois State Laboratory of Natural History, vol. 7, no. 8. Illinois State Laboratory of Natural History (1907)Google Scholar
  2. 2.
    Jaccard, P.: Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Société Vaudoise des Sciences Naturelles 37, 547–579 (1901)Google Scholar
  3. 3.
    Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996 (1998)CrossRefGoogle Scholar
  4. 4.
    Arif, S.M., Holliday, J.D., Willett, P.: Comparison of chemical similarity measures using different numbers of query structures. J. Inf. Sci., 1–8 (2013)Google Scholar
  5. 5.
    Batyrshin, I.: Towards a general theory of similarity and association measures: similarity, dissimilarity and correlation functions. J. Intell. Fuzzy Syst. (2018)Google Scholar
  6. 6.
    Sahu, L., Mohan, B.R.: An improved k-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop. In: 2014 9th International Conference on Industrial and Information Systems (ICIIS), pp. 1–5 (2014)Google Scholar
  7. 7.
    Gómez-Adorno, H., Alemán, Y., Vilariño Ayala, D., Sanchez-Perez, M., Pinto, D., Sidorov, G.: Author clustering using hierarchical clustering analysis-notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds) CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, 11–14 September, Dublin, Ireland. CEUR-WS.org (2017)Google Scholar
  8. 8.
    García-Mondeja, Y., Castro-Castro, D., Lavielle-Castro, V., Muñoz, R.: Discovering author groups using a B-compact graph-based clustering-notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T., (eds.) CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, 11–14 September, Dublin, Ireland, CEUR-WS.org (2017)Google Scholar
  9. 9.
    Mirco Kocher, J.S.: UniNE at CLEF 2017: author clustering-notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) CLEF 2017 Evaluation Labs and Workshop - Working Notes Papers, 11–14 September, Dublin, Ireland, CEUR-WS.org (2017)Google Scholar
  10. 10.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2000)zbMATHGoogle Scholar
  11. 11.
    Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18, 491–504 (2014)CrossRefGoogle Scholar
  12. 12.
    Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high dimensional data. In: Wille, L.T. (ed.) New Directions in Statistical Physics, pp. 273–309. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-662-08968-2_16CrossRefGoogle Scholar
  13. 13.
    Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27, 857–871 (1971)CrossRefGoogle Scholar
  14. 14.
    Batyrshin, I., Kubysheva, N., Solovyev, V., Villa-Vargas, L.: Visualization of similarity measures for binary data and 2 x 2 tables. Computación y Sistemas 20, 345–353 (2016)CrossRefGoogle Scholar
  15. 15.
    Tschuggnall, M., et al.: Overview of the author identification task at PAN 2017: style breach detection and author clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings (2017)Google Scholar
  16. 16.
    Stamatatos, E., et al.: Clustering by authorship within and across documents. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. Volume 1609 of CEUR Workshop Proceedings, CLEF and CEUR-WS.org (2016)Google Scholar
  17. 17.
    Amigó, E., Gonzalo, J., Artiles, J., Verdejo, M.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. 12, 461–486 (2009)Google Scholar
  18. 18.
    Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3, 1–27 (1974)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Carolina Martín-del-Campo-Rodríguez
    • 1
    Email author
  • Grigori Sidorov
    • 1
  • Ildar Batyrshin
    • 1
  1. 1.Instituto Politécnico Nacional (IPN)Centro de Investigación en Computación (CIC)Mexico CityMexico

Personalised recommendations