Author:

Seidel, Johannes
Supervisor:Prof. Gudrun Klinker
Advisor:Eichhorn, Christian (@ga73wuj)
Submission Date:[created]

Abstract

In recent years, interest in markerless human pose estimation has grown in several different research fields. The possibility of flexible yet accurate pose estimation without the limits of a closed indoor environment for infrared marker tracking opens a new set of opportunities for researchers in fields such as sports analytics, human movement sciences, or augmented reality. For this, several solutions have been developed, working in 2D and 3D space. While 3D solutions exist that either directly regress the 3D coordinate or infer depth from 2D predictions, this work focuses on the reconstruction of 3D poses with the help of 2D predictions from multiple views. The goal of this thesis is therefore to validate and evaluate the performance of 2D human pose estimation systems deployed in a multi-view setting. In contrast to previous work, the deep neural network used here is selected and specially trained on the basis of its particular qualities for the use case of sports analytics. Moreover, this work also assesses the impact of the pose model on 3D reconstruction as part of a composite system, as most previous studies make their assumptions about the potential of the learning-based pose predictor based on the final output without further consideration. For its qualities in identifying accurate local features in a global context without the need for a super-imposed body model, the HRNet-W48 was chosen for further experiments. A custom sports dataset was created from an existing popular 2D human pose estimation dataset to train this model for use in sports analysis. The model was then trained using several different techniques, from transfer learning to several data processing approaches and different loss functions. The final model achieved an accuracy of 84.0 AP on the sports dataset. Then, the model was deployed in a multi-view setting using four synchronized cameras. The 2D poses of the model from each viewpoint were taken to reconstruct the 3D pose of a subject performing various sports movements. A simple Direct Linear Transformation (DLT) algorithm was applied with a mean reconstruction error of 9.5 mm. The reconstructed 3D pose was then compared to the results of a marker-based motion capture system. The model achieved a mean error of 49.1 mm across all joints after a Procrustes alignment of the two poses. While systematic errors in predicting the locations of hips were confirmed, the greatest difficulties were encountered in accurately locating the wrists and ankles. The extreme outliers on the joints further down the kinematic chain indicated the importance of a sufficient number of cameras for the 3D reconstruction. It also showed the difficulty of comparing the results of the pose model with those of other studies without taking into account the methods of camera calibration and reconstruction, outlier detection, and other post-processing, and especially the number of cameras used. However, comparing the results of the specially selected and trained model with those of previously evaluated models using the same environment shows approximately 12 mm higher accuracy, demonstrating that the choice of model can have a significant impact on performance.

Results/Implementation/Project Description

Conclusion

[ PDF (optional) ] 

[ Slides Kickoff/Final (optional)]