Skip to main content

How to Detect Novelty in Textual Data Streams? A Comparative Study of Existing Methods

  • 527 Accesses

Part of the Lecture Notes in Computer Science book series (LNAI,volume 11986)

Abstract

Since datasets with annotation for novelty at the document and/or word level are not easily available, we present a simulation framework that allows us to create different textual datasets in which we control the way novelty occurs. We also present a benchmark of existing methods for novelty detection in textual data streams. We define a few tasks to solve and compare several state-of-the-art methods. The simulation framework allows us to evaluate their performances according to a set of limited scenarios and test their sensitivity to some parameters. Finally, we experiment with the same methods on different kinds of novelty in the New York Times Annotated Dataset.

Keywords

  • Novelty Detection
  • Text mining
  • Evaluation framework
  • Natural Language Processing

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-39098-3_9
  • Chapter length: 16 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   54.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-39098-3
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   69.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.

Notes

  1. 1.

    The code for simulation is available at https://github.com/clechristophe/NoveltySimulator.

  2. 2.

    https://catalog.ldc.upenn.edu/LDC2008T19.

References

  1. Allan, J., Lavrenko, V., Malin, D., Swan, R.: Detections, bounds, and timelines: UMass and TDT-3. In: Proceedings of Topic Detection and Tracking Workshop, pp. 167–174. SN (2000)

    Google Scholar 

  2. Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120. ACM (2006)

    Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)

    MATH  Google Scholar 

  4. Eckhoff, R., Markus, M., Lassnig, M., Schon, S.: Detecting weak signals with technologies overview of current technology-enhanced approaches for the detection of weak signals. Int. J. Trends Econ. Manag. Technol. (IJTEMT) 3(5) (2014)

    Google Scholar 

  5. Gerrish, S., Blei, D.M.: A language-based approach to measuring scholarly impact. In: ICML, vol. 10, pp. 375–382. Citeseer (2010)

    Google Scholar 

  6. Hiltunen, E., et al.: Weak signals in organizational futures learning. Helsinki School of Economics (2010)

    Google Scholar 

  7. Lau, J.H., Collier, N., Baldwin, T.: On-line trend analysis with topic models: \(\backslash \)# Twitter trends detection topic model online. In: Proceedings of COLING 2012, pp. 1519–1534 (2012)

    Google Scholar 

  8. Long, R., Wang, H., Chen, Y., Jin, O., Yu, Y.: Towards effective event detection, tracking and summarization on microblog data. In: Wang, H., Li, S., Oyama, S., Hu, X., Qian, T. (eds.) WAIM 2011. LNCS, vol. 6897, pp. 652–663. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23535-1_55

    CrossRef  Google Scholar 

  9. Mannermaa, M.: Heikoista signaaleista vahva tulevaisuus. Wsoy (2004)

    Google Scholar 

  10. Markou, M., Singh, S.: Novelty detection: a review–part 1: statistical approaches. Signal Process. 83(12), 2481–2497 (2003)

    CrossRef  Google Scholar 

  11. Marsland, S.: Novelty detection in learning systems. Neural Comput. Surv. 3(2), 157–195 (2003)

    Google Scholar 

  12. Metzler, D., Cai, C., Hovy, E.: Structured event retrieval over microblog archives. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 646–655. Association for Computational Linguistics (2012)

    Google Scholar 

  13. Murena, P.A., Al-Ghossein, M., Abdessalem, T., Cornuéjols, A.: Adaptive window strategy for topic modeling in document streams. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2018)

    Google Scholar 

  14. Ng, K.W., Tsai, F.S., Chen, L., Goh, K.C.: Novelty detection for text documents using named entity recognition. In: 2007 6th International Conference on Information, Communications & Signal Processing, pp. 1–5. IEEE (2007)

    Google Scholar 

  15. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)

    CrossRef  Google Scholar 

  16. Pimentel, M.A., Clifton, D.A., Clifton, L., Tarassenko, L.: A review of novelty detection. Signal Process. 99, 215–249 (2014)

    CrossRef  Google Scholar 

  17. Ritter, G., Gallegos, M.T.: Outliers in statistical pattern recognition and an application to automatic chromosome classification. Pattern Recogn. Lett. 18(6), 525–539 (1997)

    CrossRef  Google Scholar 

  18. Suzuki, Y., Fukumoto, F.: Detection of topic and its extrinsic evaluation through multi-document summarization. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 241–246 (2014)

    Google Scholar 

  19. Xie, W., Zhu, F., Jiang, J., Lim, E.P., Wang, K.: Topicsketch: real-time bursty topic detection from Twitter. IEEE Trans. Knowl. Data Eng. 28(8), 2216–2229 (2016)

    CrossRef  Google Scholar 

  20. Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 688–693. ACM (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Clément Christophe .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Christophe, C., Velcin, J., Cugliari, J., Suignard, P., Boumghar, M. (2020). How to Detect Novelty in Textual Data Streams? A Comparative Study of Existing Methods. In: Lemaire, V., Malinowski, S., Bagnall, A., Bondu, A., Guyet, T., Tavenard, R. (eds) Advanced Analytics and Learning on Temporal Data. AALTD 2019. Lecture Notes in Computer Science(), vol 11986. Springer, Cham. https://doi.org/10.1007/978-3-030-39098-3_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-39098-3_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-39097-6

  • Online ISBN: 978-3-030-39098-3

  • eBook Packages: Computer ScienceComputer Science (R0)