Joint outputs
“A group of people with bicycles in the grass.”
MUNI jointly learns modality-specific encoders, expressive decoders, and a single shared flow-based prior so one model can generate any missing modality from any observed subset, including the fully unconditional case.
Unconditional image-text-audio co-generation: each triplet is sampled jointly from MUNI’s learned prior.
We introduce MUNI, an end-to-end multimodal latent diffusion framework for any-to-any generation that unifies subset-conditioned cross-modal generation and unconditional joint sampling through a shared stochastic latent. Existing multimodal generative models are largely LLM-based, which limits leveraging modality-specific generators and requires text-paired data for training. Recent diffusion- and flow-based any-to-any extensions take a different direction but still rely on text-aligned embeddings, fully-paired training, or matched-dimensionality deterministic mappings. MUNI rests on two complementary contributions, one architectural and one in the training objective. First, we extend latent diffusion to multimodal any-to-any generation end-to-end: instead of the standard two-stage recipe that precomputes a frozen latent space and then fits a prior over it, MUNI jointly trains modality-specific encoders, expressive decoders, and a single shared flow-based prior under one objective. Second, we identify that the standard aggregation rules of multimodal variational inference are insufficient once coupled with a learned prior and expressive decoders. A suitable shared latent must simultaneously satisfy coherence across generated modalities, predictive sufficiency of subset latents, and minimality of the latent content. We propose a routed training objective whose structural choices align the latent with these criteria and admit a minimal-sufficiency characterization in the realizable setting. Experiments on PolyMNIST-Quadrant-Labels and a large-scale image-text-audio benchmark show MUNI matching or exceeding the strongest baselines on conditional generation while opening its largest margins on unconditional coherence.
MUNI treats any-to-any generation as a latent-content problem. The shared latent should contain enough information to make independently decoded modalities agree, but not modality-private detail that a stochastic decoder can model.
| Criterion | What the latent should do |
|---|---|
| Coherence | A prior sample should support a coherent joint image-text-audio tuple. |
| Predictivity | A subset posterior should retain the information needed to generate every missing modality. |
| Minimality | Private variation should stay out of the shared latent and remain stochastic in each decoder. |
MUNI uses the same model for every conditioning subset. The routing choices are simple:
| Route | Purpose |
|---|---|
| Non-mixture aggregation | Fuse all observed modalities into one subset posterior. |
| Target-detached self-reconstruction | Train decoders without rewarding encoders for copying private detail. |
| Leave-one-out prior routing | Train the prior only on broad subset latents that can support coherent joint sampling. |
MUNI and MUNI† achieve the best generalist result on every many-to-one metric.
| Generalist | (I+A)→T CLIP ↑ | (I+A)→T CLAP ↑ | (T+A)→I CLIP ↑ | (T+A)→I AIS ↑ | (T+I)→A CLAP ↑ | (T+I)→A AIS ↑ |
|---|---|---|---|---|---|---|
| CoDi | 24.05 | 33.72 | 24.98 | 85.52 | 11.06 | 65.31 |
| OmniFlow | 24.73 | 36.26 | 26.41 | 81.51 | 13.50 | 63.55 |
| FlowBind | 27.54 | 39.56 | 25.23 | 86.44 | 26.83 | 80.15 |
| MUNI | 25.20 | 39.75 | 25.44 | 93.42 | 26.88 | 87.29 |
| MUNI† | 27.60 | 36.61 | 26.43 | 90.94 | 30.16 | 86.89 |
Only OmniFlow among the compared frontier generalists supports fully unconditional joint sampling. MUNI opens its largest margin in this setting.
| Method | T-I CLIP ↑ | T-A CLAP ↑ | A-I AIS ↑ |
|---|---|---|---|
| OmniFlow | 21.17 | 14.23 | 50.95 |
| MUNI | 26.76 | 23.73 | 81.24 |
| MUNI† | 26.74 | 24.83 | 82.80 |
Samples are grouped by the observed source modalities and generated target modality.
PolyMNIST-Quadrant-Labels is a controlled setting for checking whether jointly sampled modalities agree on digit identity and quadrant position. It is useful diagnostically, but the main webpage emphasis is the image-text-audio result above.
| Method | Single-L→I Digit ↑ | Single-L→I Quadrant ↑ | Multi-L→I Both ↑ | Uncond. Coherence ↑ |
|---|---|---|---|---|
| MVAE | 0.4137 | 0.6803 | 0.7053 | 0.0079 |
| MMVAE | 0.9279 | 0.9999 | 0.1630 | 0.3167 |
| MoPoE | 0.9199 | 0.9999 | 0.9283 | 0.3943 |
| HELVAE | 0.9297 | 0.9999 | 0.7654 | 0.4246 |
| MUNI | 0.9131 | 0.9999 | 0.9346 | 0.4841 |
| MVAE | MMVAE | MoPoE | HELVAE | MUNI |
|---|---|---|---|---|
| | | | |
@misc{yeo2026muni, title = {MUNI: Multimodal Unified Latent Diffusion for Coherent Any-to-Any Generation}, author = {Yeo, Kyeongmin and Min, Yunhong and Sung, Minhyuk}, year = {2026}, eprint = {TBA}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {TBA}}