Pose-dIVE diversifies the viewpoint and human pose of the Re-ID dataset to help generalize
and improve the performance of arbitrary Re-ID models.

Abstract

Person re-identification (Re-ID) often faces challenges due to variations in human poses and camera viewpoints, which significantly affect the appearance of individuals across images. Existing datasets frequently lack diversity and scalability in these aspects, hindering the generalization of Re-ID models to new camera systems or environments. To overcome this, we propose Pose-dIVE, a novel data augmentation approach that incorporates sparse and underrepresented human pose and camera viewpoint examples into the training data, addressing the limited diversity in the original training data distribution. Our objective is to augment the training dataset to enable existing Re-ID models to learn features unbiased by human pose and camera viewpoint variations. By conditioning the diffusion model on both the human pose and camera viewpoint through the SMPL model, our framework generates augmented training data with diverse human poses and camera viewpoints. Experimental results demonstrate the effectiveness of our method in addressing human pose bias and enhancing the generalizability of Re-ID models compared to other data augmentation-based Re-ID approaches.



Motivation and Overview



Visualization of the effect of viewpoint and human pose augmentation. We compare visualizations of camera viewpoint and human pose distributions for the Market-1501. The left figures (i) display the camera view- point distribution derived from SMPL, while the right figures (ii) illustrate the pose distribution. In (i), from left to right, we show the viewpoint distributions of the training dataset, the augmented dataset, and the combination of both. Similarly, in (ii), from left to right, we present t-SNE visualizations of the human pose distributions, showing poses from the training dataset, followed by augmented poses sourced from outside the dataset. These visualizations demon- strate that our pose augmentation successfully diversifies both viewpoint and human pose distributions.


Pose-dIVE Framework


Upon observing the highly biased viewpoint and human pose distributions in the existing training dataset, we augment the dataset by manipulating SMPL body shapes and feeding the rendered shapes into a generative model to fill in sparsely distributed poses and viewpoints. With this augmented dataset, we can train a Re-ID model that is robust to viewpoint and human pose biases.


The Effect of Viewpoint and Human Pose Augmentation


Quantitative validation of Pose-dIVE augmentation strategies. We conduct an ablation study using CLIP-ReID to verify the effectiveness of our viewpoint and human pose augmentation. (I) serves as the baseline, representing a Re-ID model trained on the original dataset without our augmentation. For (II) and (III), we augment viewpoints and human poses, respectively. (IV) demonstrates the full augmentation strategy of our model.


Qualitative Results



Example images from the augmented MSMT17 and Market-1501 dataset demonstrate how the generated images preserve original identities while maintaining realism and consistency with the Re-ID dataset.

Quantitative Results



Since our generative augmentation can be applied to any Re-ID model, we trained two recent state-of-the-art baselines (CLIP-ReID, SOLIDER) with Pose-dIVE.

Bibtex

@article{kim2024pose,
    title={Pose-DIVE: Pose-Diversified Augmentation with Diffusion Model for Person Re-Identification},
    author={Kim, In{\`e}s Hyeonsu and Lee, JoungBin and Jin, Woojeong and Son, Soowon and Cho, Kyusun and Seo, Junyoung and Kwak, Min-Seop and Cho, Seokju and Baek, JeongYeol and Lee, Byeongwon and others},
    journal={arXiv preprint arXiv:2406.16042},
    year={2024}
}

Acknowledgements

The website template was borrowed from Michaël Gharbi.