Transformers documentation
PE Video (Perception Encoder Video)
This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.
PE Video (Perception Encoder Video)
Overview
TODO
Usage
Basic usage
TODO
PeVideoVideoProcessor
class transformers.PeVideoVideoProcessor
< source >( **kwargs: typing_extensions.Unpack[transformers.processing_utils.VideosKwargs] )
PeVideoProcessor
__call__
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None text: str | list[str] | list[list[str]] | None = None videos: typing.Union[list['PIL.Image.Image'], numpy.ndarray, ForwardRef('torch.Tensor'), list[numpy.ndarray], list['torch.Tensor'], list[list['PIL.Image.Image']], list[list[numpy.ndarray]], list[list['torch.Tensor']], transformers.video_utils.URL, list[transformers.video_utils.URL], list[list[transformers.video_utils.URL]], transformers.video_utils.Path, list[transformers.video_utils.Path], list[list[transformers.video_utils.Path]], NoneType] = None audio: typing.Union[numpy.ndarray, ForwardRef('torch.Tensor'), collections.abc.Sequence[numpy.ndarray], collections.abc.Sequence['torch.Tensor'], NoneType] = None **kwargs: typing_extensions.Unpack[transformers.processing_utils.ProcessingKwargs] ) → BatchFeature
Parameters
- images (
PIL.Image.Image,np.ndarray,torch.Tensor,list[PIL.Image.Image],list[np.ndarray],list[torch.Tensor]) — The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. Both channels-first and channels-last formats are supported. - text (
TextInput,PreTokenizedInput,list[TextInput],list[PreTokenizedInput], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must setis_split_into_words=True(to lift the ambiguity with a batch of sequences). - videos (
np.ndarray,torch.Tensor,List[np.ndarray],List[torch.Tensor]) — The video or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch tensor, or a nested list of 3D frames. Both channels-first and channels-last formats are supported. - audio (
np.ndarray,torch.Tensor,list[np.ndarray],list[torch.Tensor]) — The audio or batch of audio to be prepared. Each audio can be a NumPy array or PyTorch tensor. - return_tensors (
stror TensorType, optional) — If set, will return tensors of a particular framework. Acceptable values are:'pt': Return PyTorchtorch.Tensorobjects.'np': Return NumPynp.ndarrayobjects.
Returns
A BatchFeature object with processed inputs in a dict format.
Main method to prepare for model inputs. This method forwards the each modality argument to its own processor
along with kwargs. Please refer to the docstring of the each processor attributes for more information.
PeVideoEncoderConfig
class transformers.PeVideoEncoderConfig
< source >( output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None tokenizer_class: str | transformers.tokenization_utils_base.PreTrainedTokenizerBase | None = None vision_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None hidden_size: int = 1792 intermediate_size: int = 4800 num_hidden_layers: int = 6 num_attention_heads: int = 14 num_key_value_heads: int | None = None head_dim: int = 128 hidden_act: str = 'silu' max_position_embeddings: int = 10000 initializer_range: float = 0.02 rms_norm_eps: float = 1e-05 rope_parameters: transformers.modeling_rope_utils.RopeParameters | dict | None = None attention_bias: bool = False attention_dropout: float | int = 0.0 )
Parameters
- output_hidden_states (
bool, optional, defaults toFalse) — Whether or not the model should return all hidden-states. - return_dict (
bool, optional, defaults toTrue) — Whether to return aModelOutput(dataclass) instead of a plain tuple. - dtype (
Union[str, torch.dtype], optional) — The chunk size of all feed forward layers in the residual attention blocks. A chunk size of0means that the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processesn< sequence_length embeddings at a time. For more information on feed forward chunking, see How does Feed Forward Chunking work?. - chunk_size_feed_forward (
int, optional, defaults to0) — Thedtypeof the weights. This attribute can be used to initialize the model to a non-defaultdtype(which is normallyfloat32) and thus allow for optimal storage allocation. For example, if the saved model isfloat16, ideally we want to load it back using the minimal amount of memory needed to loadfloat16weights. - is_encoder_decoder (
bool, optional, defaults toFalse) — Whether the model is used as an encoder/decoder or not. - id2label (
Union[dict[int, str], dict[str, str]], optional) — A map from index (for instance prediction index, or target index) to label. - label2id (
Union[dict[str, int], dict[str, str]], optional) — A map from label to index for the model. - problem_type (
Literal[regression, single_label_classification, multi_label_classification], optional) — Problem type forXxxForSequenceClassificationmodels. Can be one of"regression","single_label_classification"or"multi_label_classification". - tokenizer_class (
Union[str, ~tokenization_utils_base.PreTrainedTokenizerBase], optional) — The class name of model’s tokenizer. - vision_config (
Union[dict, ~configuration_utils.PreTrainedConfig], optional) — The config object or dictionary of the vision backbone. - hidden_size (
int, optional, defaults to1792) — Dimension of the hidden representations. - intermediate_size (
int, optional, defaults to4800) — Dimension of the MLP representations. - num_hidden_layers (
int, optional, defaults to6) — Number of hidden layers in the Transformer decoder. - num_attention_heads (
int, optional, defaults to14) — Number of attention heads for each attention layer in the Transformer decoder. - num_key_value_heads (
int, optional) — This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details, check out this paper. If it is not specified, will default tonum_attention_heads. - head_dim (
int, optional, defaults to128) — The attention head dimension. If None, it will default to hidden_size // num_attention_heads - hidden_act (
str, optional, defaults tosilu) — The non-linear activation function (function or string) in the decoder. For example,"gelu","relu","silu", etc. - max_position_embeddings (
int, optional, defaults to10000) — The maximum sequence length that this model might ever be used with. - initializer_range (
float, optional, defaults to0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - rms_norm_eps (
float, optional, defaults to1e-05) — The epsilon used by the rms normalization layers. - rope_parameters (
Union[~modeling_rope_utils.RopeParameters, dict], optional) — Dictionary containing the configuration parameters for the RoPE embeddings. The dictionary should contain a value forrope_thetaand optionally parameters used for scaling in case you want to use RoPE with longermax_position_embeddings. - attention_bias (
bool, optional, defaults toFalse) — Whether to use a bias in the query, key, value and output projection layers during self-attention. - attention_dropout (
Union[float, int], optional, defaults to0.0) — The dropout ratio for the attention probabilities.
This is the configuration class to store the configuration of a PeVideoModel. It is used to instantiate a Pe Video model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/pe-av-large
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
Example:
>>> from transformers import PeAudioEncoder, PeAudioEncoderConfig
>>> # Initializing a PeAudioEncoder style configuration
>>> configuration = PeAudioEncoderConfig()
>>> # Initializing a model from the pe-av-large style configuration
>>> model = PeAudioEncoder(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configPeVideoConfig
class transformers.PeVideoConfig
< source >( output_hidden_states: bool | None = False return_dict: bool | None = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False id2label: dict[int, str] | dict[str, str] | None = None label2id: dict[str, int] | dict[str, str] | None = None problem_type: typing.Optional[typing.Literal['regression', 'single_label_classification', 'multi_label_classification']] = None tokenizer_class: str | transformers.tokenization_utils_base.PreTrainedTokenizerBase | None = None text_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None video_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None )
Parameters
- output_hidden_states (
bool, optional, defaults toFalse) — Whether or not the model should return all hidden-states. - return_dict (
bool, optional, defaults toTrue) — Whether to return aModelOutput(dataclass) instead of a plain tuple. - dtype (
Union[str, torch.dtype], optional) — The chunk size of all feed forward layers in the residual attention blocks. A chunk size of0means that the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processesn< sequence_length embeddings at a time. For more information on feed forward chunking, see How does Feed Forward Chunking work?. - chunk_size_feed_forward (
int, optional, defaults to0) — Thedtypeof the weights. This attribute can be used to initialize the model to a non-defaultdtype(which is normallyfloat32) and thus allow for optimal storage allocation. For example, if the saved model isfloat16, ideally we want to load it back using the minimal amount of memory needed to loadfloat16weights. - is_encoder_decoder (
bool, optional, defaults toFalse) — Whether the model is used as an encoder/decoder or not. - id2label (
Union[dict[int, str], dict[str, str]], optional) — A map from index (for instance prediction index, or target index) to label. - label2id (
Union[dict[str, int], dict[str, str]], optional) — A map from label to index for the model. - problem_type (
Literal[regression, single_label_classification, multi_label_classification], optional) — Problem type forXxxForSequenceClassificationmodels. Can be one of"regression","single_label_classification"or"multi_label_classification". - tokenizer_class (
Union[str, ~tokenization_utils_base.PreTrainedTokenizerBase], optional) — The class name of model’s tokenizer. - text_config (
Union[dict, ~configuration_utils.PreTrainedConfig], optional) — The config object or dictionary of the text backbone. - video_config (
dictorPreTrainedConfig, optional) — Configuration for the video encoder component. - ```python —
from transformers import PeVideoModel, PeVideoConfig
This is the configuration class to store the configuration of a PeVideoModel. It is used to instantiate a Pe Video model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the facebook/pe-av-large
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
PeVideoModel
forward
< source >( input_ids: Tensor pixel_values_videos: Tensor attention_mask: torch.Tensor | None = None padding_mask_videos: torch.Tensor | None = None return_loss: bool | None = None **kwargs )
PeVideoEncoder
class transformers.PeVideoEncoder
< source >( config: PeVideoEncoderConfig )
Parameters
- config (PeVideoEncoderConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The PeVideo Encoder model.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values_videos: Tensor padding_mask_videos: torch.Tensor | None = None **kwargs )