MUNI: Multimodal Unified Latent Diffusion for Coherent Any-to-Any Generation

Kyeongmin Yeo ^*

KAIST

Yunhong Min ^*

KAIST

Minhyuk Sung

KAIST

arXiv preprint

^*Equal contribution.

Paper arXiv Code

MUNI jointly learns modality-specific encoders, expressive decoders, and a single shared flow-based prior so one model can generate any missing modality from any observed subset, including the fully unconditional case.

Joint outputs

Generated text

“A group of people with bicycles in the grass.”

Generated audio

Generated image

Joint outputs

Generated text

“The beach is empty and clear with palm trees.”

Generated audio

Generated image

Joint outputs

Generated text

“A man with an acoustic guitar and shirt.”

Generated audio

Generated image

Joint outputs

Generated text

“The room has blue curtains and red furniture.”

Generated audio

Generated image

Unconditional image-text-audio co-generation: each triplet is sampled jointly from MUNI’s learned prior.

Abstract

We introduce MUNI, an end-to-end multimodal latent diffusion framework for any-to-any generation that unifies subset-conditioned cross-modal generation and unconditional joint sampling through a shared stochastic latent. Existing multimodal generative models are largely LLM-based, which limits leveraging modality-specific generators and requires text-paired data for training. Recent diffusion- and flow-based any-to-any extensions take a different direction but still rely on text-aligned embeddings, fully-paired training, or matched-dimensionality deterministic mappings. MUNI rests on two complementary contributions, one architectural and one in the training objective. First, we extend latent diffusion to multimodal any-to-any generation end-to-end: instead of the standard two-stage recipe that precomputes a frozen latent space and then fits a prior over it, MUNI jointly trains modality-specific encoders, expressive decoders, and a single shared flow-based prior under one objective. Second, we identify that the standard aggregation rules of multimodal variational inference are insufficient once coupled with a learned prior and expressive decoders. A suitable shared latent must simultaneously satisfy coherence across generated modalities, predictive sufficiency of subset latents, and minimality of the latent content. We propose a routed training objective whose structural choices align the latent with these criteria and admit a minimal-sufficiency characterization in the realizable setting. Experiments on PolyMNIST-Quadrant-Labels and a large-scale image-text-audio benchmark show MUNI matching or exceeding the strongest baselines on conditional generation while opening its largest margins on unconditional coherence.

Core Idea

MUNI treats any-to-any generation as a latent-content problem. The shared latent should contain enough information to make independently decoded modalities agree, but not modality-private detail that a stochastic decoder can model.

Criterion	What the latent should do
Coherence	A prior sample should support a coherent joint image-text-audio tuple.
Predictivity	A subset posterior should retain the information needed to generate every missing modality.
Minimality	Private variation should stay out of the shared latent and remain stochastic in each decoder.

MUNI framework and latent-content overview — MUNI combines modality-specific encoders and decoders with a shared learned prior, then controls what enters the latent through routed training.

Routed Training

MUNI uses the same model for every conditioning subset. The routing choices are simple:

Route	Purpose
Non-mixture aggregation	Fuse all observed modalities into one subset posterior.
Target-detached self-reconstruction	Train decoders without rewarding encoders for copying private detail.
Leave-one-out prior routing	Train the prior only on broad subset latents that can support coherent joint sampling.

Main Results: Image-Text-Audio

Many-to-One Alignment

MUNI and MUNI† achieve the best generalist result on every many-to-one metric.

Generalist	(I+A)→T CLIP ↑	(I+A)→T CLAP ↑	(T+A)→I CLIP ↑	(T+A)→I AIS ↑	(T+I)→A CLAP ↑	(T+I)→A AIS ↑
CoDi	24.05	33.72	24.98	85.52	11.06	65.31
OmniFlow	24.73	36.26	26.41	81.51	13.50	63.55
FlowBind	27.54	39.56	25.23	86.44	26.83	80.15
MUNI	25.20	39.75	25.44	93.42	26.88	87.29
MUNI†	27.60	36.61	26.43	90.94	30.16	86.89

Unconditional Coherence

Only OmniFlow among the compared frontier generalists supports fully unconditional joint sampling. MUNI opens its largest margin in this setting.

Method	T-I CLIP ↑	T-A CLAP ↑	A-I AIS ↑
OmniFlow	21.17	14.23	50.95
MUNI	26.76	23.73	81.24
MUNI†	26.74	24.83	82.80

Qualitative Any-to-Any Samples

Samples are grouped by the observed source modalities and generated target modality.

Target

MUNI (Ours)

“Fireworks burst in the sky and then a burst.”

Baselines

CoDi

“Jump happening in a fire over a falling lake and blue rainbow.”

OmniFlow

“Rain falls and a pop occurs.”

FlowBind

“Multiple popping and eruption.”

Source

Source image

Source audio

Target

MUNI (Ours)

“A man walking next to a small car.”

Baselines

CoDi

“A man stops for a train about to go on a car being driven by a man.”

OmniFlow

“A man sits in a truck, working on an engine.”

FlowBind

“The vehicle motor is driving down the road.”

Source

Source image

Source audio

Target

MUNI (Ours)

“A man playing an acoustic guitar and another person sitting on the bench.”

Baselines

CoDi

“A man is speaking to a person in the city talking while walking for the man.”

OmniFlow

“A man talking with a crowd talking in the background.”

FlowBind

“A man sitting at a park bench while music is playing.”

Source

Source image

Source audio

Target

MUNI (Ours)

“A helicopter is flying through the air.”

Baselines

CoDi

“Airplane.”

OmniFlow

“A helicopter hovering in the distance, with.”

FlowBind

“A helicopter flying over with muffled helicopter.”

Source

Source image

Source audio

Target

MUNI (Ours)

“A man in sunglasses is driving on the water.”

Baselines

CoDi

“Man driving in a blue car race to turn mid-way on his car engine, is speeding.”

OmniFlow

“A man is working on a boat in the water.”

FlowBind

“A man motor is on a boat in the water boating.”

Source

Source image

Source audio

Target

MUNI (Ours)

“A man is talking to some people and there is a dog.”

Baselines

CoDi

“A man is jaying after a dog says to his dog greeting. as he walks on the dog.”

OmniFlow

“A man talking in a movie, with a blurry background.”

FlowBind

“A man speaking with someone.”

Source

Source image

Source audio

Target

MUNI (Ours)

“A police car is going by with sirens.”

Baselines

CoDi

“Red traffic car for a road red.”

OmniFlow

“A darkened city street with a siren blaring in the distance,.”

FlowBind

“A police siren is signaling while sirens are.”

Source

Source text

“pigeons coo and flap their wings”

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source text

“a telephone ringing”

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source text

“water makes gurgling sound”

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source text

“a person speaking and various laughter and clapping”

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source text

“A clock ticks repeatedly”

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source text

“a male speaking over a microphone”

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source image

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source image

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source image

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source image

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source image

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source image

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source text

“A man playing acoustic guitar on a wooden stage.”

Source image

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source text

“It's a rainy day.”

Source image

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source text

“Someone is typing.”

Source image

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source text

“A crowd is cheering.”

Source image

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source text

“A machine is running.”

Source image

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Source

Source text

“An engine is running.”

Source image

Target

MUNI (Ours)

Baselines

CoDi

OmniFlow

FlowBind

Controlled Benchmark: PolyMNIST

PolyMNIST-Quadrant-Labels is a controlled setting for checking whether jointly sampled modalities agree on digit identity and quadrant position. It is useful diagnostically, but the main webpage emphasis is the image-text-audio result above.

Method	Single-L→I Digit ↑	Single-L→I Quadrant ↑	Multi-L→I Both ↑	Uncond. Coherence ↑
MVAE	0.4137	0.6803	0.7053	0.0079
MMVAE	0.9279	0.9999	0.1630	0.3167
MoPoE	0.9199	0.9999	0.9283	0.3943
HELVAE	0.9297	0.9999	0.7654	0.4246
MUNI	0.9131	0.9999	0.9346	0.4841

PolyMNIST unconditional sample from MVAE — Unconditional PolyMNIST co-generation. Coherent samples preserve digit identity and quadrant position across generated image and label modalities.

PolyMNIST unconditional sample from MMVAE — Unconditional PolyMNIST co-generation. Coherent samples preserve digit identity and quadrant position across generated image and label modalities.

Citation

@misc{yeo2026muni,
  title         = {MUNI: Multimodal Unified Latent Diffusion for Coherent Any-to-Any Generation},
  author        = {Yeo, Kyeongmin and Min, Yunhong and Sung, Minhyuk},
  year          = {2026},
  eprint        = {TBA},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {TBA}
}

Joint outputs

Joint outputs

Joint outputs

Joint outputs

Abstract

Core Idea

Routed Training

Main Results: Image-Text-Audio

Many-to-One Alignment

Unconditional Coherence

Qualitative Any-to-Any Samples

Joint samples by method

Joint samples by method

Joint samples by method

Joint samples by method

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target

Baselines

Source

Target