Model Card for esp-aves2-sl-beats-all
Model Details
Model Description
esp-aves2-sl-beats-all is an audio representation learning model (bioacoustic encoder) designed to produce transferable embeddings for downstream bioacoustic tasks including species classification and detection, individual identification, and vocal repertoire discovery, as described in What Matters for Bioacoustic Encoding.
- Developed by: Marius Miron, David Robinson, Milad Alizadeh, Ellen Gilsenan-McMahon, Gagan Narula, Emmanuel Chemla, Maddie Cusimano, Felix Effenberger, Masato Hagiwara, Benjamin Hoffman, Sara Keen, Diane Kim, Jane K. Lawton, Jen-Yu Liu, Aza Raskin, Olivier Pietquin, Matthieu Geist
- Funded by: More info at
https://www.earthspecies.org/about-us#support - Shared by: Earth Species Project
- Model type: Audio representation learning model (Transformer; BEATs backbone)
- License: CC-BY-NC-SA
- Finetuned from model: BEATs pretrained on AudioSet (see Parent Models)
Model Sources
- Repository:
https://github.com/earthspecies/avex - Paper: What Matters for Bioacoustic Encoding
- Hugging Face Model: ESP-AVES2 Collection
- Configuration: train_config.yaml
Parent Models
This model is based on or fine-tuned from the following parent models:
- BEATs (pretrained on AudioSet)
- Source:
https://github.com/microsoft/unilm/tree/master/beats - Description: Self-supervised transformer audio encoder used as the base SSL checkpoint.
- License: See upstream repository
- Source:
Uses
Direct Use
esp-aves2-sl-beats-all can be used directly for bioacoustic tasks such as species classification and detection, repertoire and individual classification, retrieval and clustering of audio.
Downstream Use
The model can be used for linear probing, retrieval, and clustering of audio; it can also be fine-tuned for task- and domain-specific bioacoustic applications (taxon-, habitat-, or device-specific).
Out-of-Scope Use
The model is not designed as a generative model, and it does not produce text outputs. Using it as a stand-alone classifier without training a probe or finetuning is out of scope.
Bias, Risks, and Limitations
- Bias: The training data relies heavily on citizen-science recordings and may over-represent certain taxa and regions (e.g., Northern Hemisphere); this can impact generalization.
- Risks: Predictions and embeddings can be misused for harmful wildlife exploitation (e.g., locating endangered species) if deployed without safeguards.
- Limitations: The paper evaluates and trains models at 16 kHz for fairness; some taxa may require higher bandwidth. Performance can degrade under large distribution shifts (habitat, device, background noise).
Recommendations
Use esp-aves2-sl-beats-all as an encoder (feature extractor) and validate performance on your target domain. For sensitive deployments, apply access controls and follow conservation best practices.
How to Get Started with the Model
Loading this model requires the AVEX (Animal Vocalization Encoder) library avex to be installed.
Installation
pip install avex
Or with uv:
uv add avex
For more details, see https://github.com/earthspecies/avex.
Loading the Model
from avex import load_model
# Model config name depends on how the checkpoint is packaged for this repo.
# If/when an official config is provided, replace the string below accordingly.
model = load_model("esp_aves2_sl_beats_all", device="cuda")
Using the Model
# Case 1: embedding extraction (features only)
backbone = load_model("esp_aves2_sl_beats_all", device="cuda", return_features_only=True)
with torch.no_grad():
embeddings = backbone(audio_tensor)
# Shape: (batch, time_steps, 768) for BEATs
# Pool to get fixed-size embedding
embedding = embeddings.mean(dim=1) # Shape: (batch, 768)
# Case 2: supervised predictions (logits over label IDs; see label_map.json)
model = load_model("esp_aves2_sl_beats_all", device="cuda")
with torch.no_grad():
logits = model(audio_tensor)
predicted_class = logits.argmax(dim=-1).item()
Transfer Learning with Probes
from avex.models.probes import build_probe_from_config
from avex.configs import ProbeConfig
# Load backbone for feature extraction
base = load_model("esp_aves2_sl_beats_all", return_features_only=True, device="cuda")
# Define a probe head for your task
probe_config = ProbeConfig(
probe_type="linear",
target_layers=["last_layer"],
aggregation="mean",
freeze_backbone=True,
online_training=True,
)
probe = build_probe_from_config(
probe_config=probe_config,
base_model=base,
num_classes=10, # Your number of classes
device="cuda",
)
Class Label Mapping
The class label mapping for this supervised learning model can be found at label_map.json in the Hugging Face repository.
Training Details
Training Data
From the paper, the model uses a two-stage recipe: a BEATs SSL backbone pretrained on AudioSet, followed by supervised post-training on an All mix (Bioacoustics mix + AudioSet).
Training Data Sources
| Dataset | Description | Source | License | Size |
|---|---|---|---|---|
| AudioSet | general audio | Link | See dataset terms | 5700 hours |
| Xeno-canto | birds | Link | CC (varies) | 10416 hours |
| iNaturalist | diverse taxa | Link | CC (varies) | 1539 hours |
| Watkins | marine mammals | Link | licensing agreement (paper) | 27 hours |
| Animal Sound Archive | diverse taxa | Link | See archive terms | 78 hours |
Training Procedure
As described in the paper:
- Stage 1 (SSL): BEATs pretrained on AudioSet.
- Stage 2 (SL): supervised post-training on All (Bio + AudioSet).
- Augmentations: random additive noise with probability 0.5 at SNR sampled uniformly from ([-10, 20]) dB; mixup-style linear mixing of random pairs in-batch with probability 0.5 and union of labels.
Training Hyperparameters
Training hyperparameters are specified in train_config.yaml.
Evaluation
Testing Data, Factors & Metrics
Testing Data
The paper evaluates on a benchmark spanning:
- BEANS (classification and detection):
https://github.com/earthspecies/beans - BirdSet (detection; Dedicated Train setup):
https://huggingface.co/datasets/DBD-research-group/BirdSet - Individual ID (classification): Pipit, Chiffchaff, Little Owl, Macaques
- Vocal Repertoire (retrieval + clustering): Zebra Finch, Giant Otters, Bengalese Finch, Killer Whale
Metrics
The paper reports:
- Linear probing: accuracy (single-label) and mean average precision (multi-label/detection)
- Retrieval: ROC AUC
- Clustering: normalized mutual information (NMI) for single-label datasets
Results
Aggregate results for linear probing (frozen base model) with esp-aves2-sl-beats-all (from the provided LaTeX table):
| Benchmark | Task | Metric | Score |
|---|---|---|---|
| BEANS Classification | Probe | Accuracy | 0.832 |
| BEANS Classification | Retrieval | ROC AUC | 0.813 |
| BEANS Classification | Clustering | NMI | 0.604 |
| BEANS Detection | Probe | mAP | 0.408 |
| BEANS Detection | Retrieval | ROC AUC | 0.726 |
| BirdSet | Probe | mAP | 0.294 |
| BirdSet | Retrieval | ROC AUC | 0.732 |
| Individual ID | Probe | Accuracy | 0.511 |
| Individual ID | Retrieval | ROC AUC | 0.690 |
| Vocal Repertoire | Retrieval | ROC AUC | 0.798 |
| Vocal Repertoire | Clustering | NMI | 0.529 |
Environmental Impact
Not specified in the provided excerpt.
Technical Specifications
Model Architecture and Objective
esp-aves2-sl-beats-all uses a BEATs transformer encoder trained with a self-supervised pretraining stage (AudioSet) followed by supervised post-training on All (Bio + AudioSet), to learn general-purpose bioacoustic representations.
Key components:
- Encoder: BEATs transformer
- Feature extraction: time-series embeddings, pooled for probes/retrieval/clustering in the paper
- Output: embeddings (dimension depends on backbone configuration)
Compute Infrastructure
Not specified in the provided excerpt.
Model Configuration
Model configuration is available in train_config.yaml.
Citation
BibTeX:
@inproceedings{miron2025matters,
title={What Matters for Bioacoustic Encoding},
author={Miron, Marius and Robinson, David and Alizadeh, Milad and Gilsenan-McMahon, Ellen and Narula, Gagan and Chemla, Emmanuel and Cusimano, Maddie and Effenberger, Felix and Hagiwara, Masato and Hoffman, Benjamin and Keen, Sara and Kim, Diane and Lawton, Jane K. and Liu, Jen-Yu and Raskin, Aza and Pietquin, Olivier and Geist, Matthieu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026}
}
APA:
Miron, M., Robinson, D., Alizadeh, M., et al. (2025). What Matters for Bioacoustic Encoding. arXiv preprint arXiv:2508.11845.
Glossary
- Bioacoustic encoder: A model that maps audio to embeddings useful for downstream bioacoustic tasks.
- Linear probing: Training a simple linear model on frozen embeddings to assess representation quality.
- NMI: Normalized Mutual Information, a clustering quality metric.
More Information
- Project page:
TBA - Documentation:
TBA - Issue tracker:
https://github.com/earthspecies/avex/issues
Model Card Authors
- Earth Species Project
Model Card Contact
Contact: marius@earthspecies.org, david@earthspecies.org, milad@earthspecies.org, gagan@earthspecies.org