PolarVSR: A Unified Framework and Benchmark for Continuous Space-Time Polarization Video Reconstruction

Chenggong Li1,2, Yidong Luo3,4, Junchao Zhang1,2,†, Boxin Shi5,6, and Degui Yang1,2,†

1School of Automation, Central South University

2Hunan Provincial Key Laboratory of Optic-Electronic Intelligent Measurement and Control

3Zhejiang University

4School of Engineering, Westlake University

5State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University

6National Engineering Research Center of Visual Technology, School of Computer Science, Peking University

Co-corresponding authors

Abstract

Polarimetric imaging aims to acquire surface polarization characteristics, such as the Degree of Linear Polarization (DoLP) and the Angle of Polarization (AoP). In mainstream Division-of-Focal-Plane (DoFP) color polarization imaging, reconstructing polarization parameters from captured mosaic arrays remains a challenging inverse problem. Existing DoFP cameras are also limited by hardware bottlenecks and often cannot provide high-frame-rate acquisition, which restricts the use of polarimetric imaging in dynamic video tasks. These limitations motivate the joint enhancement of spatial and temporal resolution.To this end, we propose the first space-time polarization video reconstruction architecture. The proposed method performs unified spatiotemporal modeling of polarization directions and uses a polarization-aware implicit neural representation to achieve continuous and high-fidelity upsampling. By analyzing temporal variations in polarization parameters, we further introduce a flow-guided polarization variation loss to supervise polarization dynamics. In addition, we establish the first large-scale color DoFP polarization video benchmark to support this research direction. Extensive experiments on the proposed benchmark demonstrate the effectiveness of the proposed method.

Overview

The overall pipeline of the proposed method is shown in the overview figure. Following standard continuous STVSR frameworks, PolarVSR synthesizes a high-resolution frame at arbitrary time from a pair of adjacent mosaic arrays. Unlike standard STVSR, the proposed method jointly models all four polarization directions by concatenating them channel-wise for unified feature processing, allowing the network to learn intrinsic cross-direction dependencies under degradation.

The unpolarized intensity is used to derive the inter-frame motion field for accurate motion estimation. Intensity and polarization representations are sampled by the polarization-aware implicit neural representation (PAINR), warped to the target time, refined by the motion-compensated feature refinement (MCFR) block, and decoded to reconstruct the high-resolution color polarization output.

Overview of the proposed PolarVSR framework

Benchmark

We introduce PV, the first large-scale color polarization video dataset collected using a FLIR BFS-U3-51S5PC-C camera. PV covers diverse indoor and outdoor scenes. For indoor setups, the camera is mounted on a rigid stand, and the objects are placed on a motorized turntable. The turntable is operated at three speed levels, and the light sources are adjusted to create different illumination conditions. The objects cover a wide range of materials, including polarizers, plastics, and frosted surfaces. For outdoor scenarios, some scenes are recorded using a tripod, such as in a zoo, where the activities of different animals are captured and variations in fur textures produce noticeable polarization differences. We also collect driving sequences using an in-vehicle setup under daytime and nighttime road conditions, which are useful for downstream tasks such as object recognition. The camera operates at 75 FPS. In total, the dataset consists of 65 scenes and 117550 frames, with sequence lengths ranging from 200 to 2000 frames.

Overview of the PV polarization video benchmark
Overview of representative PV samples across diverse scenes and materials.

Results

Video 1 (Space2 Time8)

S0
DoLP-AoP

Video 2 (Space2 Time8)

S0
DoLP-AoP

Video 3 (Space4 Time4)

S0
DoLP-AoP

Quantitative Results

Table 1. Demosaicking (2x) and 2x frame interpolation.

VFI Method SR Method PSNRI PSNRp SSIMI SSIMp MAE ↓
SuperSloMoATD27.32223.0730.8470.65714.626
SuperSloMoPIDSR33.97732.0210.9390.81811.843
SuperSloMoPUGDiff28.03930.2990.8620.7818.784
VFITATD28.98322.3900.8750.65814.693
VFITPIDSR34.05232.3430.9400.82911.443
VFITPUGDiff30.22531.0430.8970.8008.907
SCUBAATD28.94122.5850.8750.65715.106
SCUBAPIDSR30.47131.2890.8940.79612.769
SCUBAPUGDiff30.21230.8760.8960.7919.995
VideoINR29.38822.2180.8860.65711.255
VideoINR-12ch32.42329.7320.9240.7718.583
MoTIF29.68722.5190.8920.66011.415
MoTIF-12ch32.29229.8590.9330.7888.558
BF-STVSR29.47222.2140.8900.65611.481
BF-STVSR-12ch32.65429.7630.9280.7768.437
Ours34.63133.3100.9440.8545.922

Computational Efficiency

Computational complexity is evaluated on 200x200 mosaic arrays using an NVIDIA RTX 4090 GPU under demosaicking and 2x frame interpolation.

MethodParams (M)FLOPs (T)Time (ms)
SCUBA+ATD17.364+0.7530.403548.92
SCUBA+PIDSR17.364+7.6470.195397.26
SCUBA+PUGDiff17.364+899.19026.146452.99
VFIT+ATD29.054+0.7530.532300.48
VFIT+PIDSR29.054+7.6470.324156.72
VFIT+PUGDiff29.054+899.19026.175267.35
SuperSloMo+ATD39.610+0.7530.522280.88
SuperSloMo+PIDSR39.610+7.6470.313138.01
SuperSloMo+PUGDiff39.610+899.19026.170284.90
ZoomingSlowMo10.5222.318299.40
TMNet11.6183.343447.71
VideoINR10.7322.126434.68
MoTIF13.1952.227419.24
BF-STVSR12.9062.141339.16
Ours17.6841.228103.86