VOXReality is an ambitious project whose goal will be to facilitate and exploit the convergence of two important technologies, natural language processing (NLP) and computer vision (CV). Both technologies are experiencing a huge performance increase due to the emergence of data-driven methods, specifically machine learning (ML) and artificial intelligence (AI). On one hand, CV/ML are driving the extended reality (XR) revolution beyond what was possible up to now, and on the other hand, speech-based interfaces and text-based content understanding are revolutionizing human-machine and human-human interaction.
VOXReality will employ an economical approach to integrate language and vision-based AI models with either unidirectional or bidirectional exchanges between the two modalities. Vision systems drive both AR and VR, while language understanding adds a natural way for humans to interact with the backends of XR systems, or create multi-modal XR experiences combining vision and sound. The results of the project will be two-fold:
The above technologies will be validated through three use cases: