Skip to main content

Cross-Modal Contrastive Learning forĀ Event Extraction

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13945))

Included in the following conference series:

  • 1672 Accesses

Abstract

Event extraction aims to extract information of triggers and arguments from texts. Recent advanced methods leverage information from other modalities (e.g., images and videos) besides the texts to enhance event extraction. However, the different modalities are often misaligned at the event level, negatively impacting model performance. To address this issue, we firstly constructed a new multi-modal event extraction benchmark Text Video Event Extraction (TVEE) dataset, containing 7,598 text-video pairs. The texts are automatically extracted from video captions, which are perfectly aligned to the video content in most cases. Secondly, we present a Cross-modal Contrastive Learning for Event Extraction (CoCoEE) model to extract events from multi-modal data by contrasting text-video and event-video representations. We conduct extensive experiments on our TVEE dataset and the current benchmark VM2E2 dataset. The results show that our proposed model outperforms baseline methods in terms of F-score. Furthermore, the proposed cross-modal contrastive learning method improves event extraction in each single modality. The dataset and code will be released once upon acceptance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.youtube.com/c/ondemandnews.

  2. 2.

    https://cloud.tencent.com/product/ocr-catalog.

  3. 3.

    https://huggingface.co/bert-base-uncased.

  4. 4.

    https://huggingface.co/t5-base.

  5. 5.

    Because the multi-modal evaluation only focuses on event type extraction, it canā€™t show the performance of every module, we perform ablation study on text evaluation and video evaluation.

References

  1. Chen, B., et al.: Joint multimedia event extraction from video and article. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 74ā€“88 (2021)

    Google ScholarĀ 

  2. Chen, H., Shu, R., Takamura, H., Nakayama, H.: GraphPlan: story generation by planning with event graph. In: Proceedings of the 14th International Conference on Natural Language Generation, pp. 377ā€“386 (2021)

    Google ScholarĀ 

  3. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine learning, pp. 1597ā€“1607 (2020)

    Google ScholarĀ 

  4. Daiya, D.: Combining temporal event relations and pre-trained language models for text summarization. In: IEEE International Conference on Machine Learning and Applications, pp. 641ā€“646 (2020)

    Google ScholarĀ 

  5. Du, X., Cardie, C.: Event extraction by answering (almost) natural questions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 671ā€“683 (2020)

    Google ScholarĀ 

  6. Du, X., Rush, A.M., Cardie, C.: GRiT: generative role-filler transformers for document-level event entity extraction. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 634ā€“644 (2021)

    Google ScholarĀ 

  7. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729ā€“9738 (2020)

    Google ScholarĀ 

  8. Huang, P.Y., Patrick, M., Hu, J., Neubig, G., Metze, F., Hauptmann, A.: Multilingual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models. arXiv preprint arXiv:2103.08849 (2021)

  9. Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171ā€“4186 (2019)

    Google ScholarĀ 

  10. Li, M., et al.: Timeline summarization based on event graph compression via time-aware optimal transport. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 6443ā€“6456 (2021)

    Google ScholarĀ 

  11. Li, M., et al.: Cross-media structured common space for multimedia event extraction. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 2557ā€“2568 (2020)

    Google ScholarĀ 

  12. Li, Q., Ji, H., Huang, L.: Joint event extraction via structured prediction with global features. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 73ā€“82 (2013)

    Google ScholarĀ 

  13. Lin, Y., Ji, H., Huang, F., Wu, L.: A joint neural model for information extraction with global features. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 7999ā€“8009. Association for Computational Linguistics, (2020)

    Google ScholarĀ 

  14. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 13ā€“23 (2019)

    Google ScholarĀ 

  15. Martschat, S., Markert, K.: A temporally sensitive submodularity framework for timeline summarization. In: Proceedings of the Conference on Computational Natural Language Learning, pp. 230ā€“240 (2018)

    Google ScholarĀ 

  16. Nguyen, T.H., Cho, K., Grishman, R.: Joint event extraction via recurrent neural networks. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 300ā€“309 (2016)

    Google ScholarĀ 

  17. Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, pp. 365ā€“371 (2015)

    Google ScholarĀ 

  18. Pratt, S., Yatskar, M., Weihs, L., Farhadi, A., Kembhavi, A.: Grounded situation recognition. In: European Conference on Computer Vision, pp. 314ā€“332 (2020)

    Google ScholarĀ 

  19. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1ā€“67 (2020)

    MathSciNetĀ  MATHĀ  Google ScholarĀ 

  20. Sadhu, A., Gupta, T., Yatskar, M., Nevatia, R., Kembhavi, A.: Visual semantic role labeling for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5589ā€“5600 (2021)

    Google ScholarĀ 

  21. Tong, M., et al.: Image enhanced event detection in news articles. Proceed. AAAI Conf. Artif. Intell. 34(5), 9040ā€“9047 (2020)

    Google ScholarĀ 

  22. Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations. arXiv preprint arXiv:1909.03546 (2019)

  23. Walker, C., Strassel, S., Medero, J., Maeda, K.: ACE 2005 multilingual training corpus. Linguist. Data Consort. Philadelp. 57, 45 (2006)

    Google ScholarĀ 

  24. Wang, Z., et al.: CLEVE: contrastive pre-training for event extraction. In: Proceedings of Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing. vol. 1, pp. 6283ā€“6297 (2021)

    Google ScholarĀ 

  25. Yao, S., Yang, J., Lu, X., Shuang, K.: Contrastive learning for event extraction. In: International Conference on Machine Learning and Soft Computing, pp. 167ā€“172 (2022)

    Google ScholarĀ 

  26. Yeh, Y.T., Chen, Y.N.: QAInfomax: learning robust question answering system by mutual information maximization. In: Proceedings of Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing, pp. 3370ā€“3375 (2019)

    Google ScholarĀ 

  27. Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 833ā€“842 (2021)

    Google ScholarĀ 

  28. Zhang, S., Song, L., Jin, L., Xu, K., Yu, D., Luo, J.: Video-aided unsupervised grammar induction. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 1513ā€“1524 (2021)

    Google ScholarĀ 

  29. Zhang, T., et al.: Improving event extraction via multimodal integration. In: Proceedings of ACM International Conference on Multimedia, pp. 270ā€“278 (2017)

    Google ScholarĀ 

  30. Zolfaghari, M., Zhu, Y., Gehler, P., Brox, T.: CrossCLR: cross-modal contrastive learning for multi-modal video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1450ā€“1459 (2021)

    Google ScholarĀ 

Download references

Acknowledgments

Supported by the National Key Research and Development Program of China (No. 2022YFF0712400), the National Natural Science Foundation of China (No. 62276063), and the Natural Science Foundation of Jiangsu Province under Grants No. BK20221457.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meng Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, S., Ju, M., Zhang, Y., Zheng, Y., Wang, M., Qi, G. (2023). Cross-Modal Contrastive Learning forĀ Event Extraction. In: Wang, X., et al. Database Systems for Advanced Applications. DASFAA 2023. Lecture Notes in Computer Science, vol 13945. Springer, Cham. https://doi.org/10.1007/978-3-031-30675-4_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30675-4_51

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30674-7

  • Online ISBN: 978-3-031-30675-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics