3D Pose Estimation Using a Global and Local Cross-Attention Mechanism

Abstract

The task of 3D human pose and shape estimation involves the accurate prediction of 3D joint coordinates using a single image or a video sequence and it is crucial in several computer vision fields, such as sign language recognition, humancomputer interaction and autonomous vehicles. Existing methodologies typically rely on modeling global and local temporal relationships among image frames without paying much attention to the interaction between these relationships and the modeling of the input space in other manifolds that possess important statistical and geometrical properties. This work proposes a novel multi-stage 3D pose estimation method that seamlessly combines global and local temporal modeling through self-attention mechanisms operating on multiple manifolds, thus leveraging the ability of different manifolds to model complementary features of the input space. Through the extraction of global and local attention maps and the fusion of these maps using a novel cross-attention mechanism, the proposed method aims to enhance the contextual understanding and improve the capacity of the model to capture the intricate human motion dynamics present in a video sequence. The effectiveness of the proposed method in achieving precise 3D pose and shape across successive frames is confirmed by the experimental results on two challenging datasets, namely 3DPW and MPI-INF-3DHP.