Skip to main content

Vive la Petite Différence!

Exploiting Small Differences for Gender Attribution of Short Texts

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9924))

Abstract

This article describes a series of experiments on gender attribution of Polish texts. The research was conducted on the publicly available corpus called “He Said She Said”, consisting of a large number of short texts from the Polish version of Common Crawl. As opposed to other experiments on gender attribution, this research takes on a task of classifying relatively short texts, authored by many different people.

For the sake of this work, the original “He Said She Said” corpus was filtered in order to eliminate noise and apparent errors in the training data. In the next step, various machine learning algorithms were developed in order to achieve better classification accuracy.

Interestingly, the results of the experiments presented in this paper are fully reproducible, as all the source codes were deposited in the open platform Gonito.net. Gonito.net allows for defining machine learning tasks to be tackled by multiple researchers and provides the researchers with easy access to each other’s results.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://data.statmt.org/ngrams/raw/.

  2. 2.

    git://gonito.net/petite-difference-challenge.git.

  3. 3.

    The output files and source codes are available for inspection and reproduction at Git repository git://gonito.net/petite-difference-challenge, branch submission-00085.

References

  1. Argamon, S., Koppel, M., Fine, J., Shimoni, A.R.: Gender, genre, and writing style in formal written texts. TEXT 23, 321–346 (2003)

    Google Scholar 

  2. Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the blogosphere: age, gender and the varieties of self-expression. First Mon. 12(9) (2007). http://pear.accc.uic.edu/ojs/index.php/fm/article/view/2003

  3. Bartle, A., Zheng, J.: Gender Classification with Deep Learning (2015)

    Google Scholar 

  4. Buck, C., Heafield, K., van Ooyen, B.: N-gram counts and language models from the common crawl. In: Proceedings of the Language Resources and Evaluation Conference, Reykjavk, Icelandik, Iceland, May 2014

    Google Scholar 

  5. Graliński, F., Borchmann, L., Wierzchoń, P.: “He said she said” – male/female corpus of polish. In: Proceedings of the Language Resources and Evaluation Conference LREC 2016 (2016)

    Google Scholar 

  6. Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation, Edinburgh, Scotland, UK pp. 187–197, July 2011. http://kheafield.com/professional/avenue/kenlm.pdf

  7. Kivinen, J., Warmuth, M.K.: Additive versus exponentiated gradient updates for linear prediction. In: Proceedings of the Twenty-seventh Annual ACM Symposium on Theory of Computing, STOC 1995, pp. 209–218. ACM, New York (1995). http://doi.acm.org/10.1145/225058.225121

  8. Lakoff, R.: Language and woman’s place. Harper colophon books, Harper & Row (1975). https://books.google.pl/books?id=0dFoAAAAIAAJ

  9. Langford, J., Li, L., Zhang, T.: Sparse online learning via truncated gradient. In: Advances in Neural Information Processing Systems, NIPS 2008, vol. 21, pp. 905–912 (2009)

    Google Scholar 

  10. Mukherjee, A., Liu, B.: Improving gender classification of blog authors. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, pp. 207–217. Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm?id=1870658.1870679

  11. Sarawgi, R., Gajulapalli, K., Choi, Y.: Gender attribution: tracing stylometric evidence beyond topic and genre. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL 2011, pp. 78–86. ACL, Stroudsburg (2011). http://dl.acm.org/citation.cfm?id=2018936.2018946

  12. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: Proceedings of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, March 2006

    Google Scholar 

  13. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing, vol. 12, pp. 44–49 (1994)

    Google Scholar 

  14. Yan, X., Yan, L.: Gender classification of weblog authors. In: Proceedings of the AAAI Spring Symposia on Computational Approaches, pp. 27–29 (2006)

    Google Scholar 

Download references

Acknowledgements

Work supported by the Polish Ministry of Science and Higher Education under the National Programme for Development of the Humanities, grant 0286/NPRH4/H1a/83/2015: “50,000 słów. Indeks tematyczno-chronologizacyjny 1918–1939”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rafał Jaworski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Graliński, F., Jaworski, R., Borchmann, Ł., Wierzchoń, P. (2016). Vive la Petite Différence!. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45510-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45509-9

  • Online ISBN: 978-3-319-45510-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics