Skip to main content

A Language-Based Solution to Enable Metaverse Retrieval

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Abstract

Recently, the Metaverse is becoming increasingly attractive, with millions of users accessing the many available virtual worlds. However, how do users find the one Metaverse which best fits their current interests? So far, the search process is mostly done by word of mouth, or by advertisement on technology-oriented websites. However, the lack of search engines similar to those available for other multimedia formats (e.g., YouTube for videos) is showing its limitations, since it is often cumbersome to find a Metaverse based on some specific interests using the available methods, while also making it difficult to discover user-created ones which lack strong advertisement. To address this limitation, we propose to use language to naturally describe the desired contents of the Metaverse a user wishes to find. Second, we highlight that, differently from more conventional 3D scenes, Metaverse scenarios represent a more complex data format since they often contain one or more types of multimedia which influence the relevance of the scenario itself to a user query. Therefore, in this work, we create a novel task, called Text-to-Metaverse retrieval, which aims at modeling these aspects while also taking the cross-modal relations with the textual data into account. Since we are the first ones to tackle this problem, we also collect a dataset of 33000 Metaverses, each of which consists of a 3D scene enriched with multimedia content. Finally, we design and implement a deep learning framework based on contrastive learning, resulting in a thorough experimental setup.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://decentraland.org/.

  2. 2.

    https://www.roblox.com/.

References

  1. Abdul-Rashid, H., et al.: Shrec’18 track: 2D image-based 3D scene retrieval. Training 700, 70 (2018)

    Google Scholar 

  2. Abdul-Rashid, H., et al.: Shrec’19 track: extended 2D scene image-based 3D scene retrieval. Training (per class) 700, 70 (2019)

    Google Scholar 

  3. Agnusdei, G.P., Elia, V., Gnoni, M.G.: A classification proposal of digital twin applications in the safety domain. Comput. Ind. Eng. 154, 107137 (2021)

    Article  Google Scholar 

  4. Almeida, L.G.G., de Vasconcelos, N.V., Winkler, I., Catapan, M.F.: Innovating industrial training with immersive metaverses: a method for developing cross-platform virtual reality environments. Appl. Sci. 13, 8915 (2023)

    Article  Google Scholar 

  5. Cheng, Y., Zhu, X., Qian, J., Wen, F., Liu, P.: Cross-modal graph matching network for image-text retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 18(4), 1–23 (2022)

    Article  Google Scholar 

  6. Choi, H.S., Kim, S.H.: A content service deployment plan for metaverse museum exhibitions-centering on the combination of beacons and HMDs. Int. J. Inf. Manag. 37(1), 1519–1527 (2017)

    Article  Google Scholar 

  7. Dawson, A., et al.: Data-driven consumer engagement, virtual immersive shopping experiences, and blockchain-based digital assets in the retail metaverse. J. Self-Gov. Manag. Econ. 10(2), 52–66 (2022)

    Article  Google Scholar 

  8. Devlin, J., Chang, M.W., Lee, K.: Google, KT, language, AI: bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  9. Fu, H., et al.: 3D-front: 3D furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10933–10942 (2021)

    Google Scholar 

  10. Ge, Y., et al.: Bridging video-text retrieval with multiple choice questions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16167–16176 (2022)

    Google Scholar 

  11. J., C.: Daily active users (dau) of roblox games worldwide from 4th quarter 2018 to 2nd quarter 2023. Technical report, Statista (2023). https://www.statista.com/statistics/1192573/daily-active-users-global-roblox/

  12. Jin, C., Wu, F., Wang, J., Liu, Y., Guan, Z., Han, Z.: Metamgc: a music generation framework for concerts in metaverse. J. Audio Speech Music Proc. 31 (2022)

    Google Scholar 

  13. Laaki, H., Miche, Y., Tammi, K.: Prototyping a digital twin for real time remote control over mobile networks: Application of remote surgery. IEEE Access 7, 20325–20336 (2019)

    Article  Google Scholar 

  14. Lee, H.K., Park, S., Lee, Y.: A proposal of virtual museum metaverse content for the mz generation. Dig. Creat. 33(2), 79–95 (2022)

    Article  Google Scholar 

  15. Lee, L.H., et al.: All one needs to know about metaverse: a complete survey on technological singularity, virtual ecosystem, and research agenda. arXiv preprint arXiv:2110.05352 (2021)

  16. Liu, Y., et al.: A novel cloud-based framework for the elderly healthcare services using digital twin. IEEE Access 7, 49088–49101 (2019)

    Article  Google Scholar 

  17. Luo, H., et al.: Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508, 293–304 (2022)

    Article  Google Scholar 

  18. Metaversed: The metaverse reaches 400m monthly active users. Technical report, Metaversed Consulting (2023). https://www.metaversed.consulting/blog/the-metaverse-reaches-400m-active-users

  19. Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of IEEE/CVF CVPR, pp. 9879–9889 (2020)

    Google Scholar 

  20. Nguyen, T., Gopalan, N., Patel, R., Corsaro, M., Pavlick, E., Tellex, S.: Robot object retrieval with contextual natural language queries. arXiv preprint arXiv:2006.13253 (2020)

  21. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  22. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)

    Google Scholar 

  23. Siyaev, A., Jo, G.S.: Towards aircraft maintenance metaverse using speech interactions with virtual objects in mixed reality. Sensors 21(6), 2066 (2021)

    Article  Google Scholar 

  24. Song, W., Gong, Y., Wang, Y.: VTONShoes: Virtual try-on of shoes in augmented reality on a mobile device. In: IEEE ISMAR, pp. 234–242 (2022)

    Google Scholar 

  25. Wang, H., Bai, X., Yang, M., Zhu, S., Wang, J., Liu, W.: Scene text retrieval via joint text detection and similarity learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4558–4567 (2021)

    Google Scholar 

  26. Wang, J., Chen, S., Liu, Y., Lau, R.: Intelligent metaverse scene content construction. IEEE Access 11, 76222–76241 (2023). https://doi.org/10.1109/ACCESS.2023.3297873

    Article  Google Scholar 

  27. Wang, X., Wang, Y., Shi, Y., Zhang, W., Zheng, Q.: AvatarMeeting: an augmented reality remote interaction system with personalized avatars. In: Proceedings of the 28th ACMMM, pp. 4533–4535 (2020)

    Google Scholar 

  28. Wen, L., Wang, Y., Zhang, D., Chen, G.: Visual matching is enough for scene text retrieval. In: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pp. 447–455 (2023)

    Google Scholar 

  29. Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., Dubnov, S.: Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)

    Google Scholar 

  30. Xin, Y., Yang, D., Zou, Y.: Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)

    Google Scholar 

  31. Yang, H., et al.: Scene synthesis via uncertainty-driven attribute synchronization. In: Proceedings of the IEEE/CVF ICCV, pp. 5630–5640 (2021)

    Google Scholar 

  32. Yuan, J., Abdul-Rashid, H., Li, B., Lu, Y.: Sketch/image-based 3d scene retrieval: Benchmark, algorithm, evaluation. In: 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 264–269. IEEE (2019)

    Google Scholar 

  33. Zhou, Y., Huang, H., Yuan, S., Zou, H., Xie, L., Yang, J.: Metafi++: Wifi-enabled transformer-based human pose estimation for metaverse avatar simulation. IEEE Internet Things J. 10(16), 14128–14136 (2023). https://doi.org/10.1109/JIOT.2023.3262940

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by the Department Strategic Plan (PSD) of the University of Udine-Interdepartmental Project on Artificial Intelligence (2020-25), MUR Progetti di Ricerca di Rilevante Interesse Nazionale (PRIN) 2022 (project code 2022YTE579), and by TechStar Srl, Italy. Also, we thank Beatrice Portelli for helping with the illustrations and for the useful feedback during the preparation of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Abdari .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Abdari, A., Falcon, A., Serra, G. (2024). A Language-Based Solution to Enable Metaverse Retrieval. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14556. Springer, Cham. https://doi.org/10.1007/978-3-031-53311-2_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53311-2_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53310-5

  • Online ISBN: 978-3-031-53311-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics