MUNI: Multimodal Unified Latent Diffusion for Coherent Any-to-Any Generation

arXiv preprint
*Equal contribution.

MUNI jointly learns modality-specific encoders, expressive decoders, and a single shared flow-based prior so one model can generate any missing modality from any observed subset, including the fully unconditional case.

Unconditional image-text-audio co-generation: each triplet is sampled jointly from MUNI’s learned prior.

Abstract

We introduce MUNI, an end-to-end multimodal latent diffusion framework for any-to-any generation that unifies subset-conditioned cross-modal generation and unconditional joint sampling through a shared stochastic latent. Existing multimodal generative models are largely LLM-based, which limits leveraging modality-specific generators and requires text-paired data for training. Recent diffusion- and flow-based any-to-any extensions take a different direction but still rely on text-aligned embeddings, fully-paired training, or matched-dimensionality deterministic mappings. MUNI rests on two complementary contributions, one architectural and one in the training objective. First, we extend latent diffusion to multimodal any-to-any generation end-to-end: instead of the standard two-stage recipe that precomputes a frozen latent space and then fits a prior over it, MUNI jointly trains modality-specific encoders, expressive decoders, and a single shared flow-based prior under one objective. Second, we identify that the standard aggregation rules of multimodal variational inference are insufficient once coupled with a learned prior and expressive decoders. A suitable shared latent must simultaneously satisfy coherence across generated modalities, predictive sufficiency of subset latents, and minimality of the latent content. We propose a routed training objective whose structural choices align the latent with these criteria and admit a minimal-sufficiency characterization in the realizable setting. Experiments on PolyMNIST-Quadrant-Labels and a large-scale image-text-audio benchmark show MUNI matching or exceeding the strongest baselines on conditional generation while opening its largest margins on unconditional coherence.

Core Idea

MUNI treats any-to-any generation as a latent-content problem. The shared latent should contain enough information to make independently decoded modalities agree, but not modality-private detail that a stochastic decoder can model.

CriterionWhat the latent should do
CoherenceA prior sample should support a coherent joint image-text-audio tuple.
PredictivityA subset posterior should retain the information needed to generate every missing modality.
MinimalityPrivate variation should stay out of the shared latent and remain stochastic in each decoder.
MUNI framework and latent-content overview
MUNI combines modality-specific encoders and decoders with a shared learned prior, then controls what enters the latent through routed training.

Routed Training

MUNI uses the same model for every conditioning subset. The routing choices are simple:

RoutePurpose
Non-mixture aggregationFuse all observed modalities into one subset posterior.
Target-detached self-reconstructionTrain decoders without rewarding encoders for copying private detail.
Leave-one-out prior routingTrain the prior only on broad subset latents that can support coherent joint sampling.

Main Results: Image-Text-Audio

Many-to-One Alignment

MUNI and MUNI† achieve the best generalist result on every many-to-one metric.

Generalist(I+A)→T CLIP ↑(I+A)→T CLAP ↑(T+A)→I CLIP ↑(T+A)→I AIS ↑(T+I)→A CLAP ↑(T+I)→A AIS ↑
CoDi24.0533.7224.9885.5211.0665.31
OmniFlow24.7336.2626.4181.5113.5063.55
FlowBind27.5439.5625.2386.4426.8380.15
MUNI25.2039.7525.4493.4226.8887.29
MUNI†27.6036.6126.4390.9430.1686.89

Unconditional Coherence

Only OmniFlow among the compared frontier generalists supports fully unconditional joint sampling. MUNI opens its largest margin in this setting.

MethodT-I CLIP ↑T-A CLAP ↑A-I AIS ↑
OmniFlow21.1714.2350.95
MUNI26.7623.7381.24
MUNI†26.7424.8382.80

Qualitative Any-to-Any Samples

Samples are grouped by the observed source modalities and generated target modality.

Controlled Benchmark: PolyMNIST

PolyMNIST-Quadrant-Labels is a controlled setting for checking whether jointly sampled modalities agree on digit identity and quadrant position. It is useful diagnostically, but the main webpage emphasis is the image-text-audio result above.

MethodSingle-L→I Digit ↑Single-L→I Quadrant ↑Multi-L→I Both ↑Uncond. Coherence ↑
MVAE0.41370.68030.70530.0079
MMVAE0.92790.99990.16300.3167
MoPoE0.91990.99990.92830.3943
HELVAE0.92970.99990.76540.4246
MUNI0.91310.99990.93460.4841
MVAEMMVAEMoPoEHELVAEMUNI
PolyMNIST unconditional sample from MVAE PolyMNIST unconditional sample from MMVAE PolyMNIST unconditional sample from MoPoE PolyMNIST unconditional sample from HELVAE PolyMNIST unconditional sample from MUNI
Unconditional PolyMNIST co-generation. Coherent samples preserve digit identity and quadrant position across generated image and label modalities.

Citation

@misc{yeo2026muni,
title = {MUNI: Multimodal Unified Latent Diffusion for Coherent Any-to-Any Generation},
author = {Yeo, Kyeongmin and Min, Yunhong and Sung, Minhyuk},
year = {2026},
eprint = {TBA},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {TBA}
}