Supervised Classification Using Balanced Training

Du, Mian; Pierce, Matthew; Pivovarova, Lidia; Yangarber, Roman

doi:10.1007/978-3-319-11397-5_11

Supervised Classification Using Balanced Training

Mian Du⁷,
Matthew Pierce⁷,
Lidia Pivovarova⁷ &
…
Roman Yangarber⁷

Conference paper
First Online: 01 January 2014

1040 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8791))

Abstract

We examine supervised learning for multi-class, multi-label text classification. We are interested in exploring classification in a real-world setting, where the distribution of labels may change dynamically over time. First, we compare the performance of an array of binary classifiers trained on the label distribution found in the original corpus against classifiers trained on balanced data, where we try to make the label distribution as nearly uniform as possible. We discuss the performance trade-offs between balanced vs. unbalanced training, and highlight the advantages of balancing the training set. Second, we compare the performance of two classifiers, Naive Bayes and SVM, with several feature-selection methods, using balanced training. We combine a Named-Entity-based rote classifier with the statistical classifiers to obtain better performance than either method alone.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://about.reuters.com/researchandstandards/corpus/
2.
Henceforth we use the terms label, class and (industry) sector interchangeably.
3.
The commonly-used pre-processed data from [14] is not suitable, for two reasons: (a) we need plain text as input for IE, and (b) the preprocessed dataset contains only unigrams, while we use a combination of unigrams and bigrams as features.
4.
For example, we merge I64000 and I65000, both called Retail Distribution.
5.
Otherwise we cannot guarantee that each sector will have a sufficient number of instances in the training and test pools. For example, if we collect the training and testing data in random order and happen to start with the largest sectors, then by the time we come to the smallest sectors all of its data may already be included in the training pool (due to multiple labeling of documents), leaving none for testing.

References

Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recogn. 37(9), 1757–1771 (2004)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)
MATH Google Scholar
Cisse, M.M., Usunier, N., Arti, T., Gallinari, P.: Robust Bloom filters for large multilabel classification tasks. In: Advances in Neural Information Processing Systems, pp. 1851–1859 (2013)
Google Scholar
Dendamrongvit, S., Kubat, M.: Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains. In: Theeramunkong, T., Nattee, C., Adeodato, P.J.L., Chawla, N., Christen, P., Lenca, P., Poon, J., Williams, G. (eds.) New Frontiers in Applied Data Mining. LNCS, vol. 5669, pp. 40–52. Springer, Heidelberg (2010)
Chapter Google Scholar
Dhondt, E., Verberne, S., Weber, N., Koster, C., Boves, L.: Using skipgrams and pos-based feature selection for patent classification. Comput. Linguist. Neth. 2, 52–70 (2012)
Google Scholar
Erenel, Z., Altınçay, H.: Improving the precision-recall trade-off in undersampling-based binary text categorization using unanimity rule. Neural Comput. Appl. 22(1), 83–100 (2013)
Article Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289–1305 (2003)
MATH Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Huang, R., Riloff, E.: Classifying message board posts with an extracted lexicon of patient attributes. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1557–1562 (2013)
Google Scholar
Huttunen, S., Vihavainen, A., Du, M., Yangarber, R.: Predicting relevance of event extraction for the end user. In: Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R. (eds.) Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing, pp. 163–176. Springer, Berlin (2012)
Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
MATH Google Scholar
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. Technical report 1997–75, Stanford InfoLab, February 1997
Google Scholar
Kotsiantis, S., Kanellopoulos, D., Pintelas, P., et al.: Handling imbalanced datasets: a review. GESTS Int. Trans. Comput. Sci. Eng. 30(1), 25–36 (2006)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Google Scholar
Liu, Y., Loh, H.T., Sun, A.: Imbalanced text classification: a term weighting approach. Expert Syst. Appl. 36(1), 690–701 (2009)
Article Google Scholar
Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS (LNAI), vol. 2972, pp. 312–321. Springer, Heidelberg (2004)
Google Scholar
Puurula, A.: Scalable text classification with sparse generative modeling. In: Anthony, P., Ishizuka, M., Lukose, D. (eds.) PRICAI 2012. LNCS, vol. 7458, pp. 458–469. Springer, Heidelberg (2012)
Google Scholar
Stamatatos, E.: Author identification: using text sampling to handle the class imbalance problem. Inf. Process. Manage. 44(2), 790–799 (2008)
Article Google Scholar
Tikk, D., Biró, G.: Experiments with multi-label text classifier on the Reuters collection. In: Proceedings of the International Conference on Computational Cybernetics (ICCC 03), pp. 33–38 (2003)
Google Scholar
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehouse. Min. (IJDWM) 3(3), 1–13 (2007)
Article Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retrieval 1(1–2), 69–90 (1999)
Article Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)
Google Scholar
Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl. 38(3), 2758–2765 (2011)
Article Google Scholar
Zhuang, D., Zhang, B., Yang, Q., Yan, J., Chen, Z., Chen, Y.: Efficient text classification by weighted proximal SVM. In: Fifth IEEE International Conference on Data Mining (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Helsinki, Finland
Mian Du, Matthew Pierce, Lidia Pivovarova & Roman Yangarber

Authors

Mian Du
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Pierce
View author publications
You can also search for this author in PubMed Google Scholar
Lidia Pivovarova
View author publications
You can also search for this author in PubMed Google Scholar
Roman Yangarber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lidia Pivovarova .

Editor information

Editors and Affiliations

University Joseph Fourier, Grenoble, France
Laurent Besacier
Rovira i Virgili University, Tarragona, Spain
Adrian-Horia Dediu
Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Du, M., Pierce, M., Pivovarova, L., Yangarber, R. (2014). Supervised Classification Using Balanced Training. In: Besacier, L., Dediu, AH., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2014. Lecture Notes in Computer Science(), vol 8791. Springer, Cham. https://doi.org/10.1007/978-3-319-11397-5_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-11397-5_11
Published: 03 September 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11396-8
Online ISBN: 978-3-319-11397-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics