This study presents an ongoing work on a new large-scale, user-object interaction data-set incorporating visual, sensorial and positional modalities, which can potentially be used for (a) assessing vision-related machine learning models for different tasks targeting scene understanding, such as activity recognition, visual object affordances and object detection; (b) providing realistic interactions in the Virtual Reality (VR) world; (c) enhancing 3D perception in robotic applications such as manipulation. The aim is to provide a large and diverse set of stereo video sequences, filmed from multiple cameras and involving multiple actors, together with sensorial and positional data recorded in our lab's premises. The data-set is utilized as a first effort to provide realistic haptic feedback to a user interacting with a 3D object in a virtual environment. This data-set is expected to bridge the aforementioned gap between theory and application and facilitate the development of techniques which allow robots to better understand their surroundings. A set of experiments and a preliminary analysis show promising results and demonstrate the particular characteristics of the involved representation schemes.