Skip to main content
Log in

An Interactive Independent Topic Analysis for a Mass Document Review Service

  • Article
  • Published:
The Review of Socionetwork Strategies Aims and scope Submit manuscript

Abstract

In this paper, we propose an interactive constrained independent topic analysis in text data mining. Independent topic analysis (ITA) is a method for extracting independent topics from document data using independent component analysis. In this independent topic analysis, the most independent topics between each topic are extracted. By extracting the independent topic, managing documents with a large number of text data is easy with document access support systems and document management systems. However, the topics extracted by ITA are often different from the topics a user requests. For the system to be of service to users, an interactive system that reflects the user’s requests is necessary. Thus, we propose an interactive ITA that works for the user. For example, if there are three topics, i.e., topic A, topic B, and topic C, and a user choose the content from topics A and B, a user can merge those topics into one topic D. In addition, if a user wants to analyze topic A in more detail, a user could separate topic A into topics E and topic F. To that end, we define Merge Link constraints and Separate Link constraints as user requests. The Merge Link constraint is a constraint that merges two topics into one topic. The Separate Link constraint is a constraint that separates two topics from one topic. In this paper, we propose a method for extracting a highly independent topic that meets these constraints. We conducted evaluation experiments on our proposed methods, and obtained results to show the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Andrzejewski, D., Zhu, X., & Craven, M. (2009). Incorporating domain knowledge into topic modeling via dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML ’09) (pp. 25–32), ACM.

  2. Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2005). Learning a mahalanobis metric from equivalence constraints. Journal of Machine Learning Research, 6, 937–965.

    Google Scholar 

  3. Basu, S., Banerjee, A., & Mooney, R. J. (2002). Semi-supervised clustering by seeding. In Proceedings of the 19th International Conference on Machine Learning (pp. 27–34). Morgan Kaufmann Publishers Inc.

  4. Basu, S., Bilenko, M., & Raymond, M. J. (2004). A probabilistic framework for semi-supervised clustering. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 59–68), ACM.

  5. Basu, S., Davidson, I., & Wagstaff, K. L. (2008). Constrained clustering: Advances in algorithms, theory, and application. Chapman and Hall/CRC Data Mining and Knowledge Discovery Series, Boca Raton.

  6. Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.

    Article  Google Scholar 

  7. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

    Google Scholar 

  8. Brown, G., Pocock, A., Zhao, M.-J., & Lujan, M. (2012). Conditional likelihood maximisation: A unifying framework for information theoretic feature. Journal of Machine Learning Research, 13, 27–66.

    Google Scholar 

  9. Chang, H., & Yeung, D. (2004). Locally linear metric adaptation for semi-supervised clustering. In Proceedings of the 21st International Conference on Machine Learning (pp. 153–160).

  10. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6), 391–407.

    Article  Google Scholar 

  11. Hofmann, T. (1999). Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI’99) (pp. 289–29). Morgan Kaufmann Publishers Inc.

  12. Hoi, S. C., Liu, W., Lyu, M. R., & Ma, W. (2006). Learning distance metrics with contextual constraints for image retrieval. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06) (vol. 2, pp. 2072–2078).

  13. Hyvarinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634.

    Article  Google Scholar 

  14. Karhunen, J., Oja, E., & Hyvarinen, A. (2001). Independent component analysis. Oxford: Wiley.

    Google Scholar 

  15. Kamishima, T., Akaho, S., & Sato, I. (2015). A topic model whose information-independence is enhanced. In The 29th Annual Conference of the Japanese Society for Artificial Intelligence, No. 3L3–3.

  16. Kobayashi, H., Wakaki, H., Yamasaki, T., & Suzuki, M. (2012). Topic models with logical constraints on words. In Proceedings of Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing (pp. 42–49).

  17. Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 27 Aug 2017.

  18. Salton, G., Fox, E. A., & Wu, H. (1983). Extended boolean information retrieval. Communications of ACM, 26(11), 1022–1036.

    Article  Google Scholar 

  19. Shinohara, Y. (1999). Independent Topic Analysis: Extraction of Characteristic Topics by maximization of Independence, Technical report of IEICE.

  20. Shinohara, Y. (2000). Development of Browsing Assistance System for finding Primary Topics and Tracking their Changes in a Document Database, CRIEPI Research Report.

  21. Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained k-means clustering with background knowledge. In Proceedings of the 18th International Conference on Machine Learning (pp. 577–584), Morgan Kaufmann.

  22. Zhao, Y., & Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets. In Conference of Information and Knowledge Management (CIKM) (pp. 515–524), ACM.

  23. Zhong, S., & Ghosh, J. (2003). A comparative study of generative models for document clustering. In Data Mining Workshop on Clustering High Dimensional Data and Its Applications.

Download references

Acknowledgements

This work was supported by the Japan Science and Technology (JST) agency under the EMS-CREST program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takahiro Nishigaki.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nishigaki, T., Nitta, K. & Onoda, T. An Interactive Independent Topic Analysis for a Mass Document Review Service. Rev Socionetwork Strat 12, 47–69 (2018). https://doi.org/10.1007/s12626-018-0018-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12626-018-0018-5

Keywords

Navigation