Multimodal Search and Retrieval

Multimodal search deals with search for media items of multiple types (e.g. images, 3D objects, videos, sounds, text and their combinations) using as query any of the above types (or their combinations). The EU-funded project I-SEARCH ( aims to provide a novel unified framework for multimodal content indexing, search and retrieval. The searchable items within I-SEARCH will span from very simple media items (e.g., a single image or an audio file) to highly complex multimedia collections (e.g., a 3D object together with multiple 2D images and audio files) along with accompanying information. All the above multimedia collections are called Content Objects (CO). For a formal representation of COs, a novel description framework is introduced by I-SEARCH: the Rich Unified Content Description (RUCoD).

A multimodal dataset has been created in I-SEARCH to demonstrate multimodal search. The dataset consists of 10305 COs classified into 51 categories. The COs consist of images, 3D objects, sounds and videos accompanied by textual information, tags and location information (if available). The RUCoD descriptors (XML documents) of the entire dataset are available for download below. The links to actual media files are available within the corresponding RUCoD XML document (in the <MultimediaContent> tag).

Low-level descriptors have been extracted for the 3D objects and images of the dataset. The links to descriptors are available at the RUCoD XML files (<L_Descriptor type=”ImageType”> for the image descriptors and <L_Descriptor type=”Object3D”> for the 3D object descriptors).

Download the I-SEARCH Multimodal Dataset (

Additionally, experimental evaluation of multimodal search algorithms has been performed in the following multimodal datasets that were created by us:

  • Multimodal Dataset 1: 264 COs, classified into 12 categories, consisting of 3D objects and 2D images.


  • Multimodal Dataset 2: 495 COs, classified into 10 categories, consisting of 3D objects, 2D images and sounds.


  • Multimodal Dataset 3: 2334 COs, classified into 50 categories, consisting of 3D objects and 2D real images.


–        Multimodal Dataset 4: 2779 COs, classified into 50 categories, consisting of 3D objects, 2D real images and text.


–        Multimodal Dataset 5: 637 COs, classified into 43 categories, consisting of 3D objects, 2D images, audio, video and text.


Relevant papers

Daras, P., Manolopoulou, S., and Axenopoulos, A. 2012. “Search and retrieval of rich media objects supporting multiple multimodal queries”, IEEE Transactions on Multimedia 4(3), pp. 734–746.

A. Axenopoulos, S. Manolopoulou, P. Daras, “Multimodal Search and Retrieval using Manifold Learning and Query Formulation”, ACM International Conference on 3D Web Technology, June 20-22, 2011, Paris, France

D. Rafailidis, S. Manolopoulou, P. Daras, “A Unified Framework for Multimodal Retrieval”, Pattern Recognition, Elsevier