Word Clouds for Efficient Document Labeling

Seifert, Christin; Ulbrich, Eva; Granitzer, Michael

doi:10.1007/978-3-642-24477-3_24

Christin Seifert²²,
Eva Ulbrich²³ &
Michael Granitzer^22,23

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6926))

Included in the following conference series:

International Conference on Discovery Science

1393 Accesses
3 Citations

Abstract

In text classification the amount and quality of training data is crucial for the performance of the classifier. The generation of training data is done by human labelers - a tedious and time-consuming work. We propose to use condensed representations of text documents instead of the full-text document to reduce the labeling time for single documents. These condensed representations are key sentences and key phrases and can be generated in a fully unsupervised way. The key phrases are presented in a layout similar to a tag cloud. In a user study with 37 participants we evaluated whether document labeling with these condensed representations can be done faster and equally accurate by the human labelers. Our evaluation shows that the users labeled word clouds twice as fast but as accurately as full-text documents. While further investigations for different classification tasks are necessary, this insight could potentially reduce costs for the labeling process of text documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wordle - Beautiful Word Clouds, http://www.wordle.net (accessed: April 25, 2011)
Baldridge, J., Palmer, A.: How well does active learning actually work?: Time-based evaluation of cost-reduction strategies for language documentation. In: Proc. of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 296–305. Association for Computational Linguistics, Morristown (2009)
Chapter Google Scholar
Druck, G., Mann, G., McCallum, A.: Learning from labeled features using generalized expectation criteria. In: SIGIR 2008: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 595–602. ACM, New York (2008), http://portal.acm.org/citation.cfm
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken (2000)
MATH Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar
Guan, H., Zhou, J., Guo, M.: A class-feature-centroid classifier for text categorization. In: Proc. of the International Conference on World Wide Web (WWW), pp. 201–210. ACM, New York (2009)
Google Scholar
Gupta, V., Lehal, G.: A survey of text summarization extractive techniques. Journal of Emerging Technologies in Web Intelligence 2(3) (2010), http://ojs.academypublisher.com/index.php/jetwi/article/view/0203258268
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009), http://doi.acm.org/10.1145/1656274.1656278 , doi:10.1145/1656274.1656278
Article Google Scholar
van Ham, F., Wattenberg, M., Viegas, F.B.: Mapping text with phrase nets. IEEE Transactions on Visualization and Computer Graphics 15, 1169–1176 (2009), http://dx.doi.org/10.1109/TVCG.2009.165
Article Google Scholar
McCallum, A.K.: Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu
Mihalcea, R., Tarau, P.: Textrank: Bringing order into texts. In: Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain (2004), http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf
Paley, W.B.: TextArc: Showing word frequency and distribution in text. In: Proceedings of IEEE Symposium on Information Visualization, Poster Compendium. IEEE CS Press, Los Alamitos (2002)
Google Scholar
Schein, A.I., Ungar, L.H.: Active learning for logistic regression: an evaluation. Mach. Learn. 68(3), 235–265 (2007)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002), citeseer.ist.psu.edu/sebastiani02machine.html
Article Google Scholar
Seifert, C., Kump, B., Kienreich, W., Granitzer, G., Granitzer, M.: On the beauty and usability of tag clouds. In: Proceedings of the 12th International Conference on Information Visualisation (IV), pp. 17–25. IEEE Computer Society, Los Alamitos (2008)
Google Scholar
Settles, B.: Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison (2010), http://pages.cs.wisc.edu/~bsettles/active-learning
Strobelt, H., Oelke, D., Rohrdantz, C., Stoffel, A., Keim, D.A., Deussen, O.: Document cards: A top trumps visualization for documents. IEEE Transactions on Visualization and Computer Graphics 15, 1145–1152 (2009)
Article Google Scholar
Tomanek, K., Olsson, F.: A web survey on the use of active learning to support annotation of text data. In: Proc. of the NAACL Workshop on Active Learning for Natural Language Processing (HLT), pp. 45–48. Association for Computational Linguistics, Morristown (2009)
Google Scholar
Wattenberg, M., Viégas, F.B.: The word tree, an interactive visual concordance. IEEE Transactions on Visualization and Computer Graphics 14, 1221–1228 (2008), http://portal.acm.org/citation.cfm
Article Google Scholar
Zhu, X.: Semi-supervised learning literature survey. Tech. Rep. 1530, Computer Sciences, University of Wisconsin (2008), http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf
Šilić, A., Bašić, B.: Visualization of text streams: A survey. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS, vol. 6277, pp. 31–43. Springer, Heidelberg (2010)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of Technology, Graz, Austria
Christin Seifert & Michael Granitzer
Know-Center, Graz, Austria
Eva Ulbrich & Michael Granitzer

Authors

Christin Seifert
View author publications
You can also search for this author in PubMed Google Scholar
Eva Ulbrich
View author publications
You can also search for this author in PubMed Google Scholar
Michael Granitzer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Software Systems, Tampere University of Technology, P. O. Box 553, 33101, Tampere, Finland
Tapio Elomaa
Department of Information and Computer Science, Aalto University School of Science, P.O. Box 15400, 00076, Aalto, Finland
Jaakko Hollmén
Helsinki Institute for Information Technology (HIIT), Finland
Heikki Mannila

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Seifert, C., Ulbrich, E., Granitzer, M. (2011). Word Clouds for Efficient Document Labeling. In: Elomaa, T., Hollmén, J., Mannila, H. (eds) Discovery Science. DS 2011. Lecture Notes in Computer Science(), vol 6926. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24477-3_24

Download citation

DOI: https://doi.org/10.1007/978-3-642-24477-3_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24476-6
Online ISBN: 978-3-642-24477-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics