In this paper, we propose a novel multi-stage deep learning methodology to accurately tackle the problem of hand pose estimation. More specifically, we initially propose a disentanglement stage to differentiate the significant pose-specific information from the irrelevant background noise and illumination variations of RGB images. Then, we introduce a variational alignment stage to accurately align the latent spaces of the pose-specific and the true hand pose information, effectively improving the discrimination ability of the proposed methodology. Finally, we propose two loss terms to impose physiological and geometrical kinematic constraints to the predicted hand poses, empowering the proposed methodology to avoid non-plausible poses. During all stages, a novel injection decoder is introduced, improving significantly the decoding accuracy of the latent space. Extensive experimentation on two well-known datasets (i.e., RHD and STB) validate the ability of the proposed methodology to achieve accurate hand pose estimation results, overcoming current state-of-the-art methods.