Feature Selection for Language Independent Text Forum Summarization
Nowadays the need for multilingual information retrieval for searching relevant information is rising steadily. Specialized text-based forums on the Web are a valuable source of such information. However, extraction of informative messages is often hindered by large amount of non-informative posts (the so-called offtopic posts) and informal language commonly used on forums.
The paper deals with the task of automatic identification of posts potentially useful for sharing professional experience within text forums irrespective of the forum’s language. For our experiments we have selected subsets from various text forums containing different languages. Manual markup was held by native speaking experts. Textual, thread-based, and social graph features were extracted. In order to select satisfactory language-independent forum features we used gradient boosting models, relative influence metric for model analysis, and NDCG metric for measuring selection method quality.
We have formed a satisfactory set of forum features indicating the post’s utility which do not demand sophisticated linguistic analysis and is suitable for practical use.
Unable to display preview. Download preview PDF.
- 1.Abbasi, A., Chen, H., Salem, A.: Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums. The University of Arizona (2007). http://ai.arizona.edu/intranet/papers/AhmedAbbasi_SentimentTOIS.pdf
- 5.Carbonaro, A.: WordNet-based Summarization to Enhance Learning Interaction Tutoring. Peer Reviewed Papers 6(2) (2010)Google Scholar
- 6.Chen, J.-S., Hsieh, C.-L., Hsu, F.-C.: A study on Chinese word segmentation: Genetic algorithm approach. Information Management Research 2(2), 27–44 (2000)Google Scholar
- 7.Ding, S.L., Cong, G., Lin, C.Y., Zhu, X.Y.: Using conditional random fields to extract contexts and answers of questions from online forums. In: Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics, Columbus, Ohio, pp. 710–718. ACL (2008)Google Scholar
- 10.Garbacea, C., Tsagkias, M., de Rijke, M.: Feature Selection and Data Sampling Methods for Learning Reputation Dimensions. The University of Amsterdam at RepLab 2014 (2014). http://ceur-ws.org/Vol-1180/CLEF2014wn-Rep-GarbaceaEt2014.pdf
- 11.Generalized Boosted Regression Models. http://cran.r-project.org/web/packages/gbm/index.html
- 13.Huang, C.-C.: Automated knowledge transfer for Internet forum. Master thesis, Graduate School of Information Management, I-Shou University, Taiwan, ROC (2003)Google Scholar
- 15.Ren, Z., Ma, J., Wang, S., Liu, Y.: Summarizing web forum threads based on a latent topic propagation process. In: CIKM 2011, October 24–28, Glasgow, Scotland, UK (2011)Google Scholar
- 16.Jones, K.S.: Automatic summarising: the state of the art. Information Processing and Management, Special Issue on Automatic Summarising (2007)Google Scholar
- 17.Steinberger, R.: Challenges and methods for multilingual text mining. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.167.4724
- 19.Wang, B., Liu, B., Sun, C., Wang, X., Sun, L.: Thread Segmentation Based Answer Detection in Chinese Online Forums. Acta Automatica Sinica 39(1) (2013)Google Scholar
- 20.Wang, L., Cardie, C.: Summarizing decisions in spoken meetings. In: Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages, Portland, Oregon, June 23, 2011, pp. 16–24. Association for Computational Linguistics (2011)Google Scholar
- 23.Zhou, L., Hovy, E.: Digesting virtual geek culture: the summarization of technical internet relay chats. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL 2005, Stroudsburg, PA, USA, pp. 298–305. Association for Computational Linguistics (2005)Google Scholar