CVPR 2025
Generating human motion guided by conditions such as textual descriptions is challenging due to the need for datasets with pairs of high-quality motion and their corresponding conditions. The difficulty increases when aiming for finer control in the generation. To that end, prior works have proposed to combine several motion diffusion models pre-trained on datasets with different types of conditions, thus allowing control with multiple conditions. However, the proposed merging strategies overlook that the optimal way to combine the generation processes might depend on the particularities of each pre-trained generative model and also the specific textual descriptions. In this context, we introduce MixerMDM, the first learnable model composition technique for combining pre-trained text-conditioned human motion diffusion models. Unlike previous approaches, MixerMDM provides a dynamic mixing strategy that is trained in an adversarial fashion to learn to combine the denoising process of each model depending on the set of conditions driving the generation. By using MixerMDM to combine single- and multi-person motion diffusion models, we achieve fine-grained control on the dynamics of every person individually, and also on the overall interaction. Furthermore, we propose a new evaluation technique that, for the first time in this task, measures the interaction and individual quality by computing the alignment between the mixed generated motions and their conditions as well as the capabilities of MixerMDM to adapt the mixing throughout the denoising process depending on the motions to mix.
MixerMDM creates new motion sequences by seamlessly blending, at each step of the denoising process, motions produced by two pre-trained models. This fusion is driven by the Mixing procedure, guided by a mixing weight dynamically predicted by the Mixer. Trained through a novel Adversarial Training approach, MixerMDM delivers consistent and highly controllable mixed motions that preserve the essential characteristics of the original pre-trained models, outperforming all prior methods.
Quantitatively, we’ve developed a robust methodology to assess model composition techniques, where MixerMDM outperforms all previous methods in both Alignment andAdaptability. Qualitatively, MixerMDM stands out for its remarkable consistency in producing mixed motions that align with their conditioning. MixerMDM also excels at controllability, by generating finely-grained individual variations to interaction motions. We’ve validated all these claims through an extensive user study.
Text Interaction
Two persons are in a boxing match when suddenly one person throws a kick
Text Individual 1
An individual throws a kick with his right leg
Text Individual 2
An individual is boxing
MixerMDM
Diff.Blending
DualMDM
in2IN
Finetuned
Text Interaction
Two people salute to each other
Text Individual 1
An individual bows forward
Text Individual 2
An individual raises their right arm and waves it
MixerMDM
Diff.Blending
DualMDM
in2IN
Finetuned
Text Interaction
Two persons are in a boxing match when suddenly one person throws a kick
Text Individual 1
An individual throws a kick with his right leg
Text Individual 2
An individual is boxing
MixerMDM
MixerMDM
MixerMDM
DualMDM
DualMDM
DualMDM
Text Interaction
Two people salute to each other
Text Individual 1
An individual bows forward
Text Individual 2
An individual raises their right arm and waves it
MixerMDM
MixerMDM
MixerMDM
DualMDM
DualMDM
DualMDM
@misc{ruizponce2025mixermdmlearnablecompositionhuman,
title={MixerMDM: Learnable Composition of Human Motion Diffusion Models},
author={Pablo Ruiz-Ponce and German Barquero and Cristina Palmero and Sergio Escalera and José García-Rodríguez},
year={2025},
eprint={2504.01019},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.01019},
}