A Language-Based Solution to Enable Metaverse Retrieval

Abdari, Ali; Falcon, Alex; Serra, Giuseppe

doi:10.1007/978-3-031-53311-2_35

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14556))

Included in the following conference series:

International Conference on Multimedia Modeling

381 Accesses

Abstract

Recently, the Metaverse is becoming increasingly attractive, with millions of users accessing the many available virtual worlds. However, how do users find the one Metaverse which best fits their current interests? So far, the search process is mostly done by word of mouth, or by advertisement on technology-oriented websites. However, the lack of search engines similar to those available for other multimedia formats (e.g., YouTube for videos) is showing its limitations, since it is often cumbersome to find a Metaverse based on some specific interests using the available methods, while also making it difficult to discover user-created ones which lack strong advertisement. To address this limitation, we propose to use language to naturally describe the desired contents of the Metaverse a user wishes to find. Second, we highlight that, differently from more conventional 3D scenes, Metaverse scenarios represent a more complex data format since they often contain one or more types of multimedia which influence the relevance of the scenario itself to a user query. Therefore, in this work, we create a novel task, called Text-to-Metaverse retrieval, which aims at modeling these aspects while also taking the cross-modal relations with the textual data into account. Since we are the first ones to tackle this problem, we also collect a dataset of 33000 Metaverses, each of which consists of a 3D scene enriched with multimedia content. Finally, we design and implement a deep learning framework based on contrastive learning, resulting in a thorough experimental setup.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://decentraland.org/.
2.
https://www.roblox.com/.

References

Abdul-Rashid, H., et al.: Shrec’18 track: 2D image-based 3D scene retrieval. Training 700, 70 (2018)
Google Scholar
Abdul-Rashid, H., et al.: Shrec’19 track: extended 2D scene image-based 3D scene retrieval. Training (per class) 700, 70 (2019)
Google Scholar
Agnusdei, G.P., Elia, V., Gnoni, M.G.: A classification proposal of digital twin applications in the safety domain. Comput. Ind. Eng. 154, 107137 (2021)
Article Google Scholar
Almeida, L.G.G., de Vasconcelos, N.V., Winkler, I., Catapan, M.F.: Innovating industrial training with immersive metaverses: a method for developing cross-platform virtual reality environments. Appl. Sci. 13, 8915 (2023)
Article Google Scholar
Cheng, Y., Zhu, X., Qian, J., Wen, F., Liu, P.: Cross-modal graph matching network for image-text retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 18(4), 1–23 (2022)
Article Google Scholar
Choi, H.S., Kim, S.H.: A content service deployment plan for metaverse museum exhibitions-centering on the combination of beacons and HMDs. Int. J. Inf. Manag. 37(1), 1519–1527 (2017)
Article Google Scholar
Dawson, A., et al.: Data-driven consumer engagement, virtual immersive shopping experiences, and blockchain-based digital assets in the retail metaverse. J. Self-Gov. Manag. Econ. 10(2), 52–66 (2022)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K.: Google, KT, language, AI: bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Fu, H., et al.: 3D-front: 3D furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10933–10942 (2021)
Google Scholar
Ge, Y., et al.: Bridging video-text retrieval with multiple choice questions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16167–16176 (2022)
Google Scholar
J., C.: Daily active users (dau) of roblox games worldwide from 4th quarter 2018 to 2nd quarter 2023. Technical report, Statista (2023). https://www.statista.com/statistics/1192573/daily-active-users-global-roblox/
Jin, C., Wu, F., Wang, J., Liu, Y., Guan, Z., Han, Z.: Metamgc: a music generation framework for concerts in metaverse. J. Audio Speech Music Proc. 31 (2022)
Google Scholar
Laaki, H., Miche, Y., Tammi, K.: Prototyping a digital twin for real time remote control over mobile networks: Application of remote surgery. IEEE Access 7, 20325–20336 (2019)
Article Google Scholar
Lee, H.K., Park, S., Lee, Y.: A proposal of virtual museum metaverse content for the mz generation. Dig. Creat. 33(2), 79–95 (2022)
Article Google Scholar
Lee, L.H., et al.: All one needs to know about metaverse: a complete survey on technological singularity, virtual ecosystem, and research agenda. arXiv preprint arXiv:2110.05352 (2021)
Liu, Y., et al.: A novel cloud-based framework for the elderly healthcare services using digital twin. IEEE Access 7, 49088–49101 (2019)
Article Google Scholar
Luo, H., et al.: Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508, 293–304 (2022)
Article Google Scholar
Metaversed: The metaverse reaches 400m monthly active users. Technical report, Metaversed Consulting (2023). https://www.metaversed.consulting/blog/the-metaverse-reaches-400m-active-users
Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of IEEE/CVF CVPR, pp. 9879–9889 (2020)
Google Scholar
Nguyen, T., Gopalan, N., Patel, R., Corsaro, M., Pavlick, E., Tellex, S.: Robot object retrieval with contextual natural language queries. arXiv preprint arXiv:2006.13253 (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Google Scholar
Siyaev, A., Jo, G.S.: Towards aircraft maintenance metaverse using speech interactions with virtual objects in mixed reality. Sensors 21(6), 2066 (2021)
Article Google Scholar
Song, W., Gong, Y., Wang, Y.: VTONShoes: Virtual try-on of shoes in augmented reality on a mobile device. In: IEEE ISMAR, pp. 234–242 (2022)
Google Scholar
Wang, H., Bai, X., Yang, M., Zhu, S., Wang, J., Liu, W.: Scene text retrieval via joint text detection and similarity learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4558–4567 (2021)
Google Scholar
Wang, J., Chen, S., Liu, Y., Lau, R.: Intelligent metaverse scene content construction. IEEE Access 11, 76222–76241 (2023). https://doi.org/10.1109/ACCESS.2023.3297873
Article Google Scholar
Wang, X., Wang, Y., Shi, Y., Zhang, W., Zheng, Q.: AvatarMeeting: an augmented reality remote interaction system with personalized avatars. In: Proceedings of the 28th ACMMM, pp. 4533–4535 (2020)
Google Scholar
Wen, L., Wang, Y., Zhang, D., Chen, G.: Visual matching is enough for scene text retrieval. In: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pp. 447–455 (2023)
Google Scholar
Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., Dubnov, S.: Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Google Scholar
Xin, Y., Yang, D., Zou, Y.: Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Google Scholar
Yang, H., et al.: Scene synthesis via uncertainty-driven attribute synchronization. In: Proceedings of the IEEE/CVF ICCV, pp. 5630–5640 (2021)
Google Scholar
Yuan, J., Abdul-Rashid, H., Li, B., Lu, Y.: Sketch/image-based 3d scene retrieval: Benchmark, algorithm, evaluation. In: 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 264–269. IEEE (2019)
Google Scholar
Zhou, Y., Huang, H., Yuan, S., Zou, H., Xie, L., Yang, J.: Metafi++: Wifi-enabled transformer-based human pose estimation for metaverse avatar simulation. IEEE Internet Things J. 10(16), 14128–14136 (2023). https://doi.org/10.1109/JIOT.2023.3262940
Article Google Scholar

Download references

Acknowledgments

This work was supported by the Department Strategic Plan (PSD) of the University of Udine-Interdepartmental Project on Artificial Intelligence (2020-25), MUR Progetti di Ricerca di Rilevante Interesse Nazionale (PRIN) 2022 (project code 2022YTE579), and by TechStar Srl, Italy. Also, we thank Beatrice Portelli for helping with the illustrations and for the useful feedback during the preparation of this work.

Author information

Authors and Affiliations

University of Udine, Udine, Italy
Ali Abdari, Alex Falcon & Giuseppe Serra
University of Naples Federico II, Naples, Italy
Ali Abdari

Authors

Ali Abdari
View author publications
You can also search for this author in PubMed Google Scholar
Alex Falcon
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Serra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Abdari .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abdari, A., Falcon, A., Serra, G. (2024). A Language-Based Solution to Enable Metaverse Retrieval. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14556. Springer, Cham. https://doi.org/10.1007/978-3-031-53311-2_35

Download citation

DOI: https://doi.org/10.1007/978-3-031-53311-2_35
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53310-5
Online ISBN: 978-3-031-53311-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Language-Based Solution to Enable Metaverse Retrieval