Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video

Runyang Feng
Yixing Gao*
Xueqing Ma
Tze Ho Elden Tse
Hyung Jin Chang

Jilin University, University of Birmingham
* Corresponding Author
CVPR 2023



Directly leveraging optical flow can be distracted by irrelevant clues such as background and blur (a), and sometimes fails in scenarios with fast motion and mutual occlusion (b). Our proposed framework proceeds with temporal difference encoding and useful information disentanglement to capture more tailored temporal dynamics (c), yielding more robust pose estimations (d).


Temporal modeling is crucial for multi-frame human pose estimation. Most existing methods directly employ optical flow or deformable convolution to predict full-spectrum motion fields, which might incur numerous irrelevant cues, such as a nearby person or background. Without further efforts to excavate meaningful motion priors, their results are sub-optimal, especially in complicated spatiotemporal interactions. On the other hand, the temporal difference has the ability to encode representative motion information which can potentially be valuable for pose estimation but has not been fully exploited. In this paper, we present a novel multi-frame human pose estimation framework, which employs temporal differences across frames to model dynamic contexts and engages mutual information objectively to facilitate useful motion information disentanglement. To be specific, we design a multi-stage Temporal Difference Encoder that performs incremental cascaded learning conditioned on multi-stage feature difference sequences to derive informative motion representation. We further propose a Representation Disentanglement module from the mutual information perspective, which can grasp discriminative task-relevant motion signals by explicitly defining useful and noisy constituents of the raw motion features and minimizing their mutual information. These place us to rank No.1 in the Crowd Pose Estimation in Complex Events Challenge on benchmark dataset HiEve, and achieve state-of-the-art performance on three benchmarks PoseTrack2017, PoseTrack2018, and PoseTrack21.


Overall pipeline of the proposed framework. The goal is to detect the human pose of the key frame. Given an input sequence, we first extract their visual features. Our multi-stage Temporal Difference Encoder takes these features as input and outputs the motion feature. Then, the feature is handed to the Representation Disentanglement module which performs useful information disentanglement. Finally, the useful motion feature and the visual feature are used to obtain the final pose estimation.

Qualitative Results

Visual results of our TDMI framework on benchmark datasets. Challenging scenes such as fast motion or pose occlusion are involved.


This work is supported in part by the National Natural Science Foundation of China under grant No. 62203184. This work is also supported in part by the MSIT, Korea, under the ITRC program (IITP-2022-2020-0-01789) (50%) and the High-Potential Individuals Global Training Program (RS2022-00155054) (50%) supervised by the IITP.