3rd RHOBIN Workshop@CVPR25

Paper Presentations

1	Romain Brégier Fabien Baradel Thomas Lucas Salma Galaaoui Matthieu Armando Philippe Weinzaepfel Grégory Rogez	CondiMen: Conditional Multi-Person Mesh Recovery Abstract: Multi-person human mesh recovery (HMR) consists in detecting all individuals in a given input image, and predicting the body shape, pose, and 3D location for each detected person. The dominant approaches to this task rely on neural networks trained to output a single prediction for each detected individual. In contrast, we propose CondiMen, a method that outputs a joint parametric distribution over likely poses, body shapes, intrinsics and distances to the camera, using a Bayesian network. This approach offers several advantages. First, a probability distribution can handle some inherent ambiguities of this task – such as the uncertainty between a person’s size and their distance to the camera, or more generally the loss of information that occurs when projecting 3D data onto a 2D image. Second, the output distribution can be combined with additional information to produce better predictions, by using e.g. known camera or body shape parameters, or by exploiting multi-view observations. Third, one can efficiently extract the most likely predictions from this output distribution, making the proposed approach suitable for real-time applications. Empirically we find that our model i) achieves performance on par with or better than the state-of-the-art, ii) captures uncertainties and correlations inherent in pose estimation and iii) can exploit additional information at test time, such as multi-view consistency or body shape priors. CondiMen spices up the modeling of ambiguity, using just the right in- gredients on hand.
2	Ayce Idil Aytekin Chuqiao Li Diogo Luvizon Rishabh Dabral Martin R. Oswald Marc Habermann Christian Theobalt	Physics-based Human Pose Estimation from a Single Moving RGB Camera Abstract: Most monocular and physics-based human pose tracking methods, while achieving state-of-the-art results, suffer from artifacts when the scene does not have a strictly flat ground plane or when the camera is moving. Moreover, these methods are often evaluated on in-the-wild real world videos without ground-truth data or on synthetic datasets, which fail to model the real world light transport, camera motion, and pose-induced appearance and geometry changes. To tackle these two problems, we introduce MoviCam, the first non-synthetic dataset containing ground-truth camera trajectories of a dynamically moving monocular RGB camera, scene geometry, and 3D human motion with human-scene contact labels. Additionally, we propose PhysDynPose, a physics-based method that incorporates scene geometry and physical constraints for more accurate human motion tracking in case of camera motion and non-flat scenes. More precisely, we use a state-of-the-art kinematics estimator to obtain the human pose and a robust SLAM method to capture the dynamic camera trajectory, enabling the recovery of the human pose in the world frame. We then refine the kinematic pose estimate using our scene-aware physics optimizer. From our new benchmark, we found that even state-of-the-art methods struggle with this inherently challenging setting, i.e. a moving camera and non-planar environments, while our method robustly estimates both human and camera poses in world coordinates.
3	Xiyuan Kang Yi Yuan Xu Dong Muhammad Awais Lilian Tang Josef Kittler Zhenhua Feng	Short-term 3D Human Mesh Recovery with Virtual Markers Disentanglement Abstract: Human mesh recovery is a fundamental and challenging task in computer vision. Existing image-based methods suffer from depth ambiguity due to the absence of explicit 3D contextual information. Conversely, video-based methods leverage multi-view input and temporal consistency to improve stability but struggle with capturing fine-grained spatial details and have high computational costs. To effectively combine the spatial precision of image-based techniques with the temporal robustness of video-based approaches, we propose a temporal Transformer framework augmented with the state-of-the-art image-based reconstruction model, Virtual Markers. Specifically, we introduce a novel disentanglement module designed to explicitly separate Virtual Markers into distinct pose and shape representations. Leveraging short-term temporal context, the proposed module enhances the consistency of body shape and pose coherence across frames, ensuring both spatial accuracy and computational efficiency. Experimental results demonstrate that the proposed method significantly enhances the performance and interpretability of virtual markers. Our model achieves state-of-the-art results on two widely used benchmarking datasets, outperforming previous image-based approaches across different evaluation metrics.
4	Sonain Jamil	PoseSynViT: Lightweight and Scalable Vision Transformers for Human Pose Estimation Abstract: Vision transformers (ViTs) have consistently delivered outstanding results in visual recognition tasks without needing specialized domain knowledge. Nevertheless, their application in human pose estimation (HPE) tasks remains underexplored. This paper introduces PoseSynViT, a new lightweight ViT model that surpasses ViTPose in several areas, including simplicity of model architecture, scalability, training versatility, and ease of knowledge transfer. Our model uses ViTs as backbones to extract features for HPE and integrates a lightweight decoder. It scales efficiently from 10M to 1B parameters, taking advantage of the inherent scalability and high parallelism of transformers, setting a new benchmark between throughput and performance. PoseSynViT is highly adaptable, supporting various attention mechanisms, input resolutions, and training approaches, and is capable of handling multiple HPE tasks. Additionally, we demonstrate that knowledge from larger models can be seamlessly transferred to smaller ones through a straightforward knowledge token. Experimental results on the MS COCO benchmark show that PoseSynViT outperforms current methods, with our largest model setting a new state-of-the-art performance of 84.3 AP on the MS COCO test dataset.

Challenge Winners

Human Reconstruction track	1st place
Human Reconstruction track	2nd place
Joint Human-Object Reconstruction track	1st place
Joint Human-Object Reconstruction track	2nd place
3D Contact Estimation track	1st place
	2nd place
	3rd place

Contact Info

E-mail: rhobinchallenge@gmail.com