Abstract
In this paper, we propose an interactive constrained independent topic analysis in text data mining. Independent topic analysis (ITA) is a method for extracting independent topics from document data using independent component analysis. In this independent topic analysis, the most independent topics between each topic are extracted. By extracting the independent topic, managing documents with a large number of text data is easy with document access support systems and document management systems. However, the topics extracted by ITA are often different from the topics a user requests. For the system to be of service to users, an interactive system that reflects the user’s requests is necessary. Thus, we propose an interactive ITA that works for the user. For example, if there are three topics, i.e., topic A, topic B, and topic C, and a user choose the content from topics A and B, a user can merge those topics into one topic D. In addition, if a user wants to analyze topic A in more detail, a user could separate topic A into topics E and topic F. To that end, we define Merge Link constraints and Separate Link constraints as user requests. The Merge Link constraint is a constraint that merges two topics into one topic. The Separate Link constraint is a constraint that separates two topics from one topic. In this paper, we propose a method for extracting a highly independent topic that meets these constraints. We conducted evaluation experiments on our proposed methods, and obtained results to show the effectiveness of our approach.
Similar content being viewed by others
References
Andrzejewski, D., Zhu, X., & Craven, M. (2009). Incorporating domain knowledge into topic modeling via dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09) (pp. 25–32), ACM.
Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2005). Learning a mahalanobis metric from equivalence constraints. Journal of Machine Learning Research, 6, 937–965.
Basu, S., Banerjee, A., & Mooney, R. J. (2002). Semi-supervised clustering by seeding. In Proceedings of the 19th International Conference on Machine Learning (pp. 27–34). Morgan Kaufmann Publishers Inc.
Basu, S., Bilenko, M., & Raymond, M. J. (2004). A probabilistic framework for semi-supervised clustering. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 59–68), ACM.
Basu, S., Davidson, I., & Wagstaff, K. L. (2008). Constrained clustering: Advances in algorithms, theory, and application. Chapman and Hall/CRC Data Mining and Knowledge Discovery Series, Boca Raton.
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
Brown, G., Pocock, A., Zhao, M.-J., & Lujan, M. (2012). Conditional likelihood maximisation: A unifying framework for information theoretic feature. Journal of Machine Learning Research, 13, 27–66.
Chang, H., & Yeung, D. (2004). Locally linear metric adaptation for semi-supervised clustering. In Proceedings of the 21st International Conference on Machine Learning (pp. 153–160).
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6), 391–407.
Hofmann, T. (1999). Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI’99) (pp. 289–29). Morgan Kaufmann Publishers Inc.
Hoi, S. C., Liu, W., Lyu, M. R., & Ma, W. (2006). Learning distance metrics with contextual constraints for image retrieval. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06) (vol. 2, pp. 2072–2078).
Hyvarinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634.
Karhunen, J., Oja, E., & Hyvarinen, A. (2001). Independent component analysis. Oxford: Wiley.
Kamishima, T., Akaho, S., & Sato, I. (2015). A topic model whose information-independence is enhanced. In The 29th Annual Conference of the Japanese Society for Artificial Intelligence, No. 3L3–3.
Kobayashi, H., Wakaki, H., Yamasaki, T., & Suzuki, M. (2012). Topic models with logical constraints on words. In Proceedings of Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing (pp. 42–49).
Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 27 Aug 2017.
Salton, G., Fox, E. A., & Wu, H. (1983). Extended boolean information retrieval. Communications of ACM, 26(11), 1022–1036.
Shinohara, Y. (1999). Independent Topic Analysis: Extraction of Characteristic Topics by maximization of Independence, Technical report of IEICE.
Shinohara, Y. (2000). Development of Browsing Assistance System for finding Primary Topics and Tracking their Changes in a Document Database, CRIEPI Research Report.
Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained k-means clustering with background knowledge. In Proceedings of the 18th International Conference on Machine Learning (pp. 577–584), Morgan Kaufmann.
Zhao, Y., & Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets. In Conference of Information and Knowledge Management (CIKM) (pp. 515–524), ACM.
Zhong, S., & Ghosh, J. (2003). A comparative study of generative models for document clustering. In Data Mining Workshop on Clustering High Dimensional Data and Its Applications.
Acknowledgements
This work was supported by the Japan Science and Technology (JST) agency under the EMS-CREST program.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nishigaki, T., Nitta, K. & Onoda, T. An Interactive Independent Topic Analysis for a Mass Document Review Service. Rev Socionetwork Strat 12, 47–69 (2018). https://doi.org/10.1007/s12626-018-0018-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12626-018-0018-5