V-LITE reframes lighting estimation as guided video inpainting — insert a virtual chrome ball, and let a video diffusion model reveal the scene's dynamic HDR illumination.
Background
+ Object · V-LITE
Modern video diffusion models render scenes with strikingly consistent illumination — so they must already carry an internal model of light. V-LITE surfaces that knowledge without training a lighting predictor from scratch.
Borrowing a trick from visual-effects practice, we ask the model to inpaint a mirror probe into the scene. To fill that region convincingly, the model has to synthesize physically plausible reflections from the surrounding spatio-temporal context — which is exactly an estimate of the scene's environment lighting.
Lighting estimation is recast as inserting a virtual chrome ball, compelling the diffusion model to render plausible reflections from the scene's spatio-temporal context.
A log-domain, HDR-aware VAE preserves lighting cues, while efficient LoRA fine-tuning bridges LDR-native diffusion models to the HDR domain — no full retraining.
A hybrid set of 8K in-the-wild videos with dynamic temporal lighting and 800 HDR images with diverse luminance — realistic priors and dynamic context together.
Three stages turn an LDR-native video model into an HDR lighting estimator.
Key insight. A video model asked to paint a mirror ball into a scene must synthesize reflections consistent with the surrounding light — so its output already encodes the scene's illumination. V-LITE simply reads it back out.
Relighting a virtual object with V-LITE's estimated environment map yields reflections and shadows that sit naturally in the scene — where single-image estimators drift.
Background
+ Object · V-LITE





V-LITESet pairs in-the-wild HDR videos, which supply dynamic spatio-temporal context, with high-fidelity HDR images that anchor diverse, realistic luminance distributions.
@InProceedings{Cai_2026_ECCV_Lighting,
author = {Cai, Ziqi and Weng, Shuchen and Liu, Kaiqi and Wang, Zifeng and Zhang, Zhiquan and Teng, Minggui and Jiang, Han and Shi, Boxin},
title = {Video Generation Models Are Inherent Lighting Estimators},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2026},
}