Github | Habr article | Project Page | Technical Report (soon)

KVAE 2.0: Video tokenizers

KVAE 2.0 and previous KVAE 1.0 are familys of video and image tokenizers with spacial compression ratio of 8 and 16 and for video models with a time compression ratio of 4

KVAE-3D-2.0-t4s8

Model KVAE-3D-2.0-t4s8 has time compression 4 and spacial compression 8x8

Evaluation of reconstruction

For the test, open datasets MCL-JCV (video in 1280x720 resolution) and BVI-DVC were used. Wan-2.1 and HunyuanVideo-1.0 were considered as alternatives for the 4x8x8 format. Below are the results of a comparison using the PSNR, SSIM, and LPIPS metrics (with features from AlexNet).

Reconstruction comparison of KVAE 2.0, Hunyuan 1.0 and Wan 2.1

Inference instruction

Installation

Clone the repo:

git clone https://github.com/kandinskylab/kvae.git
cd kvae

Create environment with torch==2.8.0 с CUDA 12.8

conda create -n kvae_inference python=3.11
conda activate kvae_inference
pip install -r requirements.txt

KVAE inference

To run an image model on some dataset to calculate metrics, you can use the script:

PYTHONPATH=. python scripts/inference_2d_kvae.py --dataset_folder ./assets/images/ --model KVAE_1.0

To run video models:

PYTHONPATH=. python scripts/inference_3d_kvae.py --dataset_folder ./assets/test1/ --model KVAE_2.0-t4s8

If you want to save the reconstructions, then set the parameter --saving_folder with the folder to save ./your_path/. Please note that this will affect the running time, especially of the video model, even though saving works asynchronously with the rest of the components.

More detailed example of work with models is presented in inference_examples.ipynb

To use the library mediapy, you will need to install ffmpeg:

conda install -c conda-forge ffmpeg
pip install -q mediapy

Model Zoo

Collection KVAE 1.0 featured 2 models for tokenizing videos and images with spacial compression ratio of 8. The collection KVAE 2.0 features 2 models, both for video tokenization, but with varying spacial compression ratio of 8 and 16, respectively. Below are links to all models KVAE

Model	Data type	time compresion	spacial compresion	Checkpoint
KVAE-3D-2.0-t4s8	video	4	8	🤗 HF
KVAE-3D-2.0-t4s16	video	4	16	🤗 HF
KVAE-3D-1.0	video	4	8	🤗 HF
KVAE-2D-1.0	image	-	8	🤗 HF

Downloads last month: 253

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including kandinskylab/KVAE-3D-2.0-t4s8

KVAE 2.0

Collection

KVAE 2.0 is a family of video tokenizers with a time compression ratio of 4 and spacial compression ratio of 8 and 16 • 2 items • Updated 12 days ago • 3