V-LITE · Inherent Lighting Estimators V-LITE
ECCV 2026

Video Generation Models Are Inherent Lighting Estimators

V-LITE reframes lighting estimation as guided video inpainting — insert a virtual chrome ball, and let a video diffusion model reveal the scene's dynamic HDR illumination.

Ziqi Cai, Shuchen Weng*, Kaiqi Liu, Zifeng Wang, Zhiquan Zhang, Minggui Teng, Han Jiang, Boxin Shi* Peking University  ·  Beijing Academy of Artificial Intelligence  ·  OpenBayes   —   * Corresponding authors
Indoor scene before object insertion Indoor scene with a virtual object inserted using V-LITE's estimated lighting
Background + Object · V-LITE
Drag to compare — a virtual object inserted with V-LITE's estimated lighting.
The Idea

Reading the light a video model already understands.

Modern video diffusion models render scenes with strikingly consistent illumination — so they must already carry an internal model of light. V-LITE surfaces that knowledge without training a lighting predictor from scratch.

Borrowing a trick from visual-effects practice, we ask the model to inpaint a mirror probe into the scene. To fill that region convincingly, the model has to synthesize physically plausible reflections from the surrounding spatio-temporal context — which is exactly an estimate of the scene's environment lighting.

V-LITE teaser: insert a chrome ball to estimate lighting and enable object insertion
The idea. Given any in-the-wild LDR video (a), V-LITE inpaints a virtual light probe that captures the scene's illumination (b) — yielding an HDR environment map for realistic, temporally consistent virtual object insertion (c).

Light-probe inpainting

Lighting estimation is recast as inserting a virtual chrome ball, compelling the diffusion model to render plausible reflections from the scene's spatio-temporal context.

HDR-aware VAE + LoRA

A log-domain, HDR-aware VAE preserves lighting cues, while efficient LoRA fine-tuning bridges LDR-native diffusion models to the HDR domain — no full retraining.

V-LITESet dataset

A hybrid set of 8K in-the-wild videos with dynamic temporal lighting and 800 HDR images with diverse luminance — realistic priors and dynamic context together.

Method

From a masked video to a dynamic HDR environment map.

Three stages turn an LDR-native video model into an HDR lighting estimator.

V-LITE pipeline: HDR-aware VAE, inherent lighting estimation via diffusion inpainting, and HDR reconstruction
Pipeline. An HDR-aware VAE maps video to log-space latents; a LoRA-adapted flow-matching diffusion Transformer inpaints the masked probe region conditioned on the scene; the HDR decoder reconstructs a physically coherent chrome ball, which is unprojected into a dynamic HDR environment map.
!

Key insight. A video model asked to paint a mirror ball into a scene must synthesize reflections consistent with the surrounding light — so its output already encodes the scene's illumination. V-LITE simply reads it back out.

Results

Temporally coherent HDR lighting, in the wild.

Qualitative lighting-estimation results on in-the-wild videos
In-the-wild videos. Estimated chrome-ball reflections stay stable across time and scenes. For visualization, frames are tone-mapped from their original HDR format.
Application · Object Insertion

Lighting that lets inserted objects belong.

Relighting a virtual object with V-LITE's estimated environment map yields reflections and shadows that sit naturally in the scene — where single-image estimators drift.

Seaside scene before object insertion Seaside scene with a virtual object inserted using V-LITE's estimated lighting
Background + Object · V-LITE
Seaside scene — drag to reveal the object inserted with V-LITE's lighting.
Indoor scene — vs. prior lighting estimators
Background
Background
DiffusionLight insertion
DiffusionLight
DiffusionLight-Turbo insertion
DiffusionLight-Turbo
StyleLight insertion
StyleLight
V-LITE insertion
V-LITE (Ours)
V-LITESet

A hybrid HDR dataset for dynamic lighting.

V-LITESet pairs in-the-wild HDR videos, which supply dynamic spatio-temporal context, with high-fidelity HDR images that anchor diverse, realistic luminance distributions.

8Kin-the-wild videos
800HDR images
HDRlog-domain
Dynamictemporal lighting
Samples from the V-LITESet dataset
V-LITESet samples. Each row is one video sampled at different timestamps, showing the diversity of dynamic, real-world illumination.
Ablation & Limitation

Robust to probe placement — with honest failure modes.

Zero-shot generalization by varying probe position and size
Zero-shot generalization. Varying the virtual probe's position and size.
Failure case with extreme out-of-distribution input
Failure case. Under extreme, out-of-distribution inputs, the model may occasionally produce a mismatched environment map.
Citation

BibTeX

Reference

@InProceedings{Cai_2026_ECCV_Lighting,
  author    = {Cai, Ziqi and Weng, Shuchen and Liu, Kaiqi and Wang, Zifeng and Zhang, Zhiquan and Teng, Minggui and Jiang, Han and Shi, Boxin},
  title     = {Video Generation Models Are Inherent Lighting Estimators},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2026},
}