Vive la Petite Différence!

Exploiting Small Differences for Gender Attribution of Short Texts
  • Filip Graliński
  • Rafał Jaworski
  • Łukasz Borchmann
  • Piotr Wierzchoń
Conference paper

DOI: 10.1007/978-3-319-45510-5_7

Part of the Lecture Notes in Computer Science book series (LNCS, volume 9924)
Cite this paper as:
Graliński F., Jaworski R., Borchmann Ł., Wierzchoń P. (2016) Vive la Petite Différence!. In: Sojka P., Horák A., Kopeček I., Pala K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science, vol 9924. Springer, Cham

Abstract

This article describes a series of experiments on gender attribution of Polish texts. The research was conducted on the publicly available corpus called “He Said She Said”, consisting of a large number of short texts from the Polish version of Common Crawl. As opposed to other experiments on gender attribution, this research takes on a task of classifying relatively short texts, authored by many different people.

For the sake of this work, the original “He Said She Said” corpus was filtered in order to eliminate noise and apparent errors in the training data. In the next step, various machine learning algorithms were developed in order to achieve better classification accuracy.

Interestingly, the results of the experiments presented in this paper are fully reproducible, as all the source codes were deposited in the open platform Gonito.net. Gonito.net allows for defining machine learning tasks to be tackled by multiple researchers and provides the researchers with easy access to each other’s results.

Keywords

Gender attribution Text classification Corpus Common Crawl Research reproducibility 

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Filip Graliński
    • 1
  • Rafał Jaworski
    • 1
  • Łukasz Borchmann
    • 1
  • Piotr Wierzchoń
    • 1
  1. 1.Adam Mickiewicz University in PoznańPoznańPoland

Personalised recommendations