In this paper, a unified framework for multimodal content retrieval is presented. The proposed framework supports retrieval of rich media objects as unified sets of different modalities (image, audio, 3D, video and text), by efficiently combining all monomodal heterogeneous similarities to a global one according to an automatic weighting scheme. Then, a multimodal space is constructed, to capture the semantic correlations among multiple modalities. In contrast to existing techniques, the proposed method is also able to handle external multimodal queries, by embedding them to the already constructed multimodal space, following a space mapping procedure of a submanifold analysis. In our experiments with five real multimodal datasets, we show the superiority of the proposed approach against competitive methods.