PolarVSR: Continuous Space-Time Polarization Video Reconstruction

Abstract

Polarimetric imaging aims to acquire surface polarization characteristics, such as the Degree of Linear Polarization (DoLP) and the Angle of Polarization (AoP). In mainstream Division-of-Focal-Plane (DoFP) color polarization imaging, reconstructing polarization parameters from captured mosaic arrays remains a challenging inverse problem. Existing DoFP cameras are also limited by hardware bottlenecks and often cannot provide high-frame-rate acquisition, which restricts the use of polarimetric imaging in dynamic video tasks. These limitations motivate the joint enhancement of spatial and temporal resolution.To this end, we propose the first space-time polarization video reconstruction architecture. The proposed method performs unified spatiotemporal modeling of polarization directions and uses a polarization-aware implicit neural representation to achieve continuous and high-fidelity upsampling. By analyzing temporal variations in polarization parameters, we further introduce a flow-guided polarization variation loss to supervise polarization dynamics. In addition, we establish the first large-scale color DoFP polarization video benchmark to support this research direction. Extensive experiments on the proposed benchmark demonstrate the effectiveness of the proposed method.

Overview

The overall pipeline of the proposed method is shown in the overview figure. Following standard continuous STVSR frameworks, PolarVSR synthesizes a high-resolution frame at arbitrary time from a pair of adjacent mosaic arrays. Unlike standard STVSR, the proposed method jointly models all four polarization directions by concatenating them channel-wise for unified feature processing, allowing the network to learn intrinsic cross-direction dependencies under degradation.

The unpolarized intensity is used to derive the inter-frame motion field for accurate motion estimation. Intensity and polarization representations are sampled by the polarization-aware implicit neural representation (PAINR), warped to the target time, refined by the motion-compensated feature refinement (MCFR) block, and decoded to reconstruct the high-resolution color polarization output.

Overview of the proposed PolarVSR framework

Benchmark

We introduce PV, the first large-scale color polarization video dataset collected using a FLIR BFS-U3-51S5PC-C camera. PV covers diverse indoor and outdoor scenes. For indoor setups, the camera is mounted on a rigid stand, and the objects are placed on a motorized turntable. The turntable is operated at three speed levels, and the light sources are adjusted to create different illumination conditions. The objects cover a wide range of materials, including polarizers, plastics, and frosted surfaces. For outdoor scenarios, some scenes are recorded using a tripod, such as in a zoo, where the activities of different animals are captured and variations in fur textures produce noticeable polarization differences. We also collect driving sequences using an in-vehicle setup under daytime and nighttime road conditions, which are useful for downstream tasks such as object recognition. The camera operates at 75 FPS. In total, the dataset consists of 65 scenes and 117550 frames, with sequence lengths ranging from 200 to 2000 frames.

Results

Video 1 (Space2 Time8)

S0

DoLP-AoP

Video 2 (Space2 Time8)

S0

DoLP-AoP

Video 3 (Space4 Time4)

S0

DoLP-AoP

Quantitative Results

Table 1. Demosaicking (2x) and 2x frame interpolation.

VFI Method	SR Method	PSNR_I ↑	PSNR_p ↑	SSIM_I ↑	SSIM_p ↑	MAE ↓
SuperSloMo	ATD	27.322	23.073	0.847	0.657	14.626
SuperSloMo	PIDSR	33.977	32.021	0.939	0.818	11.843
SuperSloMo	PUGDiff	28.039	30.299	0.862	0.781	8.784
VFIT	ATD	28.983	22.390	0.875	0.658	14.693
VFIT	PIDSR	34.052	32.343	0.940	0.829	11.443
VFIT	PUGDiff	30.225	31.043	0.897	0.800	8.907
SCUBA	ATD	28.941	22.585	0.875	0.657	15.106
SCUBA	PIDSR	30.471	31.289	0.894	0.796	12.769
SCUBA	PUGDiff	30.212	30.876	0.896	0.791	9.995
VideoINR		29.388	22.218	0.886	0.657	11.255
VideoINR-12ch		32.423	29.732	0.924	0.771	8.583
MoTIF		29.687	22.519	0.892	0.660	11.415
MoTIF-12ch		32.292	29.859	0.933	0.788	8.558
BF-STVSR		29.472	22.214	0.890	0.656	11.481
BF-STVSR-12ch		32.654	29.763	0.928	0.776	8.437
Ours		34.631	33.310	0.944	0.854	5.922

Table 2. Super-resolution under different spatial-temporal settings. Entries are PSNR_I / PSNR_p / MAE.

Method	Spatial Scale	Time x2	Time x4	Time x8
SuperSloMo+PIDSR	x4	21.469 / 24.755 / 16.034	--	--
SuperSloMo+PIDSR	x8	20.044 / 21.681 / 18.005	--	--
VFIT+PIDSR	x4	21.642 / 24.117 / 15.639	--	--
VFIT+PIDSR	x8	20.082 / 21.268 / 17.821	--	--
SCUBA+PIDSR	x4	21.695 / 24.240 / 15.615	--	--
SCUBA+PIDSR	x8	20.101 / 21.345 / 17.784	--	--
ZoomingSlowMo	x4	25.264 / 19.284 / 17.663	--	--
TMNet	x4	25.279 / 19.456 / 17.058	24.916 / 19.463 / 17.190	24.022 / 19.357 / 17.653
VideoINR	x4	25.557 / 19.195 / 14.659	25.256 / 19.330 / 14.636	24.407 / 19.488 / 14.955
VideoINR	x8	22.162 / 17.053 / 18.347	22.096 / 17.211 / 18.187	21.820 / 17.324 / 18.304
VideoINR-12ch	x4	27.161 / 27.986 / 10.297	26.621 / 27.880 / 10.424	25.367 / 27.523 / 10.768
VideoINR-12ch	x8	23.353 / 26.184 / 12.065	23.223 / 26.190 / 12.015	22.716 / 26.090 / 12.133
MoTIF	x4	25.651 / 19.438 / 14.472	25.346 / 19.537 / 14.444	24.491 / 19.546 / 14.741
MoTIF	x8	22.120 / 16.955 / 18.203	22.039 / 17.135 / 18.049	21.737 / 17.211 / 18.178
MoTIF-12ch	x4	28.165 / 28.429 / 10.257	27.401 / 28.332 / 10.343	25.731 / 28.003 / 10.653
MoTIF-12ch	x8	24.059 / 26.553 / 12.410	23.565 / 26.184 / 12.015	22.905 / 26.416 / 12.533
BF-STVSR	x4	25.528 / 19.382 / 14.396	25.263 / 19.480 / 14.373	24.488 / 19.511 / 14.659
BF-STVSR	x8	22.105 / 16.952 / 18.118	22.054 / 17.099 / 17.973	21.799 / 17.160 / 18.089
BF-STVSR-12ch	x4	27.698 / 28.068 / 10.236	27.083 / 27.970 / 10.319	25.703 / 27.661 / 10.562
BF-STVSR-12ch	x8	23.728 / 26.193 / 12.060	23.688 / 26.551 / 12.395	22.979 / 26.091 / 12.100
Ours	x4	30.104 / 30.997 / 7.285	29.084 / 30.753 / 7.415	27.054 / 30.098 / 7.800
Ours	x8	25.086 / 28.497 / 9.249	24.668 / 28.438 / 9.304	23.770 / 28.200 / 9.492

Table 3. Out-of-distribution super-resolution. Entries are PSNR_I / PSNR_p / MAE.

Method	Spatial Scale	Time x3	Time x6	Time x12
VideoINR	x6	23.117 / 17.847 / 17.029	22.916 / 17.979 / 17.015	22.329 / 18.149 / 17.272
VideoINR	x10	20.960 / 16.284 / 19.926	20.885 / 16.422 / 19.842	20.643 / 16.522 / 19.976
VideoINR-12ch	x6	24.588 / 26.874 / 11.513	24.194 / 26.801 / 11.592	23.238 / 26.504 / 11.894
VideoINR-12ch	x10	21.924 / 25.515 / 13.264	21.788 / 25.516 / 13.231	21.347 / 25.426 / 13.457
MoTIF	x6	23.077 / 17.869 / 16.902	22.858 / 17.970 / 16.890	22.208 / 18.013 / 17.172
MoTIF	x10	20.878 / 16.192 / 19.691	20.792 / 16.317 / 19.630	20.522 / 16.375 / 19.786
MoTIF-12ch	x6	25.214 / 27.229 / 11.696	24.594 / 27.153 / 11.770	23.278 / 26.886 / 12.031
MoTIF-12ch	x10	22.501 / 25.948 / 13.616	22.134 / 25.935 / 13.646	21.395 / 25.841 / 13.800
BF-STVSR	x6	23.058 / 17.801 / 16.843	22.878 / 17.884 / 16.836	22.348 / 17.963 / 17.047
BF-STVSR	x10	20.947 / 16.157 / 19.576	20.881 / 16.269 / 19.521	20.655 / 16.322 / 19.633
BF-STVSR-12ch	x6	24.964 / 26.894 / 11.509	24.493 / 26.833 / 11.561	23.456 / 26.586 / 11.750
BF-STVSR-12ch	x10	22.280 / 25.570 / 13.148	22.110 / 25.567 / 13.112	21.598 / 25.507 / 13.265
Ours	x6	25.924 / 28.847 / 9.559	25.262 / 28.702 / 9.632	23.874 / 28.264 / 9.893
Ours	x10	22.800 / 27.141 / 11.085	22.449 / 27.091 / 11.128	21.722 / 26.911 / 11.298

Computational Efficiency

Computational complexity is evaluated on 200x200 mosaic arrays using an NVIDIA RTX 4090 GPU under demosaicking and 2x frame interpolation.

Method	Params (M)	FLOPs (T)	Time (ms)
SCUBA+ATD	17.364+0.753	0.403	548.92
SCUBA+PIDSR	17.364+7.647	0.195	397.26
SCUBA+PUGDiff	17.364+899.190	26.146	452.99
VFIT+ATD	29.054+0.753	0.532	300.48
VFIT+PIDSR	29.054+7.647	0.324	156.72
VFIT+PUGDiff	29.054+899.190	26.175	267.35
SuperSloMo+ATD	39.610+0.753	0.522	280.88
SuperSloMo+PIDSR	39.610+7.647	0.313	138.01
SuperSloMo+PUGDiff	39.610+899.190	26.170	284.90
ZoomingSlowMo	10.522	2.318	299.40
TMNet	11.618	3.343	447.71
VideoINR	10.732	2.126	434.68
MoTIF	13.195	2.227	419.24
BF-STVSR	12.906	2.141	339.16
Ours	17.684	1.228	103.86

Abstract

Overview

Benchmark

Results

Video 1 (Space2 Time8)

Video 2 (Space2 Time8)

Video 3 (Space4 Time4)

Quantitative Results

Table 1. Demosaicking (2x) and 2x frame interpolation.

Table 2. Super-resolution under different spatial-temporal settings. Entries are PSNRI / PSNRp / MAE.

Table 3. Out-of-distribution super-resolution. Entries are PSNRI / PSNRp / MAE.

Computational Efficiency

Table 2. Super-resolution under different spatial-temporal settings. Entries are PSNR_I / PSNR_p / MAE.

Table 3. Out-of-distribution super-resolution. Entries are PSNR_I / PSNR_p / MAE.