Motion Analysis: Action Detection, Recognition and Evaluation based on motion capture data

In this page, you can find information and supplementary videos for the paper entitled “Motion Analysis: Action Detection, Recognition and Evaluation based on motion capture data”

Abstract

A novel motion analysis framework, for real-time action detection, recognition and evaluation of motion capture data, is presented in this paper. Pose and kinematics information are used for data description, while automatic and dynamic weighting is also applied, altering joint data significance based on action involvement. The Bag-of-Gesturelets (BoG) model is employed for data representation, and Kinetic energy based descriptor sampling is performed before codebook construction. The automatically segmented and recognized action instances are subsequently fed to the framework evaluation component, which compares them with the corresponding reference ones, estimating their similarity. Exploiting fuzzy logic, the framework subsequently gives semantic feedback on the limbs whose motion needs to be altered and the ways in which this should be done. Experimental results on two benchmark datasets and a new, publicly available one, provide evidence that the proposed framework can be effectively used for unsupervised gesture/action training.

full_diagram

Sample key poses of all the exercises constituting the CVD dataset introduced in this paper are presented in the following tables, while the entire pre-segmented dataset can be found here.

CVD1

CVD2

 

Experimental Results – Human action detection/recognition

Taking a closer look at the following two tables, it can be noticed that in both cases the proposed approach outperforms [13] on the CVD dataset, using half the descriptors employed by the latter, chosen based on their kinetic energy, and applying joint weighting, in an automatic and dynamic way. Taking this into account, the discriminative power of weighting as well as the importance of exploiting motion information are highlighted.

Comparison results of action recognition experiments on the pre-segmented CVD dataset.

Comparison results of action recognition experiments on the pre-segmented CVD dataset.

 

Comparison results of action recognition experiments on CVD dataset after automatic segmentation.

Comparison results of action recognition experiments on CVD dataset after automatic segmentation.

The mean F-scores reported in the first row of the next table were obtained by regarding as accurate detections the ones triggered within 10 frames (i.e., 333ms) from the ground truth segment end, while in the following row, the overlap between the ground truth action interval and the detected one determines if the detection can be considered positive or not. Again, the proposed approach outperforms [13].

Detection mean F-score results on CVD dataset.

Detection mean F-score results on CVD dataset.

 

Concerning the MSRC-12 dataset, it can be easily observed that in all modalities the proposed approach outperforms both [17] and [13], achieving state-of-the-art results. Furthermore, it should be noted, that the mean F-score standard deviations of  our method are the smallest reported, which is indicative of the robustness of the method

Comparison mean F-score and standard deviation results of action detection experiments on the different modalities of MSRC-12 dataset at 0.2 overlap ratios.

Comparison mean F-score and standard deviation results of action detection experiments on the different modalities of MSRC-12 dataset at 0.2 overlap ratios.

 

Experimental Results – Human motion evaluation

Both the position and velocity highest errors are detected at frame 26, thus the extracted semantic feedback instructs the user on how to perform Draw X action for achieving higher similarity with the reference motion at this time instance. It is obvious that, at frame 26 when the highest errors of the spatiotemporally aligned actions are detected, the two actions significantly differ.

drawX

In the case of Forward Kick, the main errors are detected in different temporal action phases. The maximum position error is detected at frame 7 and the maximum velocity error at frame 11, respectively.

forwardKick

After applying spatiotemporal alignment, the errors between most of the joints are numerically close to the maximum mean joint error. In Standing gluteus medius case, the mean normalized position and velocity errors are high, while semantic feedback is retrieved for further performance improvement. However, based on the assumption that the similarly performed actions have mean normalized error close to the maximum mean joint error and high standard deviation, the fuzzy engine stipulates that the similarity is high.

standingGluteusMedius

Erroneous motion capture has occurred due to self-occlusion when bending, thus the detected joint errors are high and, therefore, the similarity is low.

bend

 

More details regarding the evaluated actions can be found in the video below: