Research findings in cognitive neuroscience establish that humans, early on, develop their understanding of real-world objects by observing others interact with them or by performing active exploration and physical interactions with them. This fact has motivated the so-called "sensorimotor" learning approach, where the object appearance information (sensory) is combined with the object affordances (motor), i.e. the types of actions a human can perform with the object. In this work, the aforementioned paradigm is adopted, and a neuro-biologically inspired two-stream model for RGB-D object recognition is investigated. Both streams are realized as state-of-the-art deep neural networks that process and fuse appearance and affordance information in multiple ways. In particular, three model variants are developed to efficiently encode the spatio-temporal nature of the hand-object interaction, while an attention mechanism that relies on the appearance stream confidence is also investigated. Additionally, a suitable auxiliary loss is proposed for model training, utilized to further optimize both information streams. Experiments on the challenging SOR3D dataset, which consists of 14 object types and 13 object affordances, demonstrate the efficacy of the proposed model in RGB-D object recognition. Overall, the best performing developed model achieves 90.70% classification accuracy, which is further increased to 91.98% when trained using the auxiliary loss. The latter corresponds to 46% relative error reduction compared to the appearance-only classifier performance. Finally, a cross-view analysis on the SOR3D dataset provides valuable feedback for the viewpoint impact on the affordance information.