V-LITE: Video Generation Models Are Inherent Lighting Estimators

The Idea

Reading the light a video model already understands.

Modern video diffusion models render scenes with strikingly consistent illumination — so they must already carry an internal model of light. V-LITE surfaces that knowledge without training a lighting predictor from scratch.

Borrowing a trick from visual-effects practice, we ask the model to inpaint a mirror probe into the scene. To fill that region convincingly, the model has to synthesize physically plausible reflections from the surrounding spatio-temporal context — which is exactly an estimate of the scene's environment lighting.

V-LITE teaser: insert a chrome ball to estimate lighting and enable object insertion — **The idea.** Given any in-the-wild LDR video (a), V-LITE inpaints a virtual light probe that captures the scene's illumination (b) — yielding an HDR environment map for realistic, temporally consistent virtual object insertion (c).

Light-probe inpainting

Lighting estimation is recast as inserting a virtual chrome ball, compelling the diffusion model to render plausible reflections from the scene's spatio-temporal context.

HDR-aware VAE + LoRA

A log-domain, HDR-aware VAE preserves lighting cues, while efficient LoRA fine-tuning bridges LDR-native diffusion models to the HDR domain — no full retraining.

V-LITESet dataset

A hybrid set of 8K in-the-wild videos with dynamic temporal lighting and 800 HDR images with diverse luminance — realistic priors and dynamic context together.

Method

From a masked video to a dynamic HDR environment map.

Three stages turn an LDR-native video model into an HDR lighting estimator.

V-LITE pipeline: HDR-aware VAE, inherent lighting estimation via diffusion inpainting, and HDR reconstruction — **Pipeline.** An HDR-aware VAE maps video to log-space latents; a LoRA-adapted flow-matching diffusion Transformer inpaints the masked probe region conditioned on the scene; the HDR decoder reconstructs a physically coherent chrome ball, which is unprojected into a dynamic HDR environment map.

Key insight. A video model asked to paint a mirror ball into a scene must synthesize reflections consistent with the surrounding light — so its output already encodes the scene's illumination. V-LITE simply reads it back out.

Results

Temporally coherent HDR lighting, in the wild.

Application · Object Insertion

Lighting that lets inserted objects belong.

Relighting a virtual object with V-LITE's estimated environment map yields reflections and shadows that sit naturally in the scene — where single-image estimators drift.

Seaside scene before object insertion — Seaside scene — drag to reveal the object inserted with V-LITE's lighting.

Seaside scene with a virtual object inserted using V-LITE's estimated lighting — Seaside scene — drag to reveal the object inserted with V-LITE's lighting.

Indoor scene — vs. prior lighting estimators

V-LITESet

A hybrid HDR dataset for dynamic lighting.

V-LITESet pairs in-the-wild HDR videos, which supply dynamic spatio-temporal context, with high-fidelity HDR images that anchor diverse, realistic luminance distributions.

8Kin-the-wild videos

800HDR images

HDRlog-domain

Dynamictemporal lighting

Samples from the V-LITESet dataset — **V-LITESet samples.** Each row is one video sampled at different timestamps, showing the diversity of dynamic, real-world illumination.

Ablation & Limitation

Robust to probe placement — with honest failure modes.

Zero-shot generalization by varying probe position and size — **Zero-shot generalization.** Varying the virtual probe's position and size.

Failure case with extreme out-of-distribution input — **Failure case.** Under extreme, out-of-distribution inputs, the model may occasionally produce a mismatched environment map.

Citation

BibTeX

Reference

@InProceedings{Cai_2026_ECCV_Lighting,
  author    = {Cai, Ziqi and Weng, Shuchen and Liu, Kaiqi and Wang, Zifeng and Zhang, Zhiquan and Teng, Minggui and Jiang, Han and Shi, Boxin},
  title     = {Video Generation Models Are Inherent Lighting Estimators},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2026},
}