CIMP: Contrastive Image-Metadata Pre-training (ResNet-18, crop 512)

A contrastive encoder that aligns HAADF-STEM microscopy images with their acquisition metadata in a shared 128-d embedding space. This variant uses a ResNet-18 image encoder trained from scratch on 512×512 patches at effective batch size 512, and is the best-performing ResNet configuration reported in the accompanying paper.

Model Details

Architecture: ResNet-18 image encoder (trained from scratch, single-channel input) + 3-layer MLP metadata encoder (hidden dim 256)
Embedding dimension: 128
Image input: Single-channel grayscale, 512×512 pixels
Metadata input: 7-d z-scored vector (pixel_size, dwell_time, convergence_angle, beam_current, gain, offset, inner_collection_angle)
Loss: Symmetric cross-entropy (CLIP-style) with learnable temperature and bias
Parameters: ~11M (ResNet-18 backbone)

Retrieval Performance

Evaluated on the held-out validation split (733 images from the CMMP dataset).

Metric	Value
Top-1	0.8594
Top-5	1.0000
Top-10	1.0000
Best epoch	956 / 1000

Context among CMMP variants

Variant	Top-1	Top-5	Top-10
This model (ResNet-18, crop 512, batch 512)	0.859	1.000	1.000
ResNet-18, crop 256	0.844	0.969	0.984
ViT-pretrained, crop 256	0.828	0.969	1.000

Linear-Probe Metadata Recovery

A Ridge regression ($\alpha = 1.0$) trained on the frozen visual embedding recovers all seven acquisition parameters. Coefficient of determination ($R^2$), SMAPE (in physical units), and Pearson $r$:

Dimension	$R^2$	SMAPE	Pearson $r$
pixel_size	0.748	39.5%	0.867
dwell_time	0.819	25.9%	0.905
convergence_angle	0.629	11.6%	0.793
beam_current	0.695	33.4%	0.835
gain	0.862	5.0%	0.929
offset	0.824	9.1%	0.912
inner_coll_angle	0.626	8.5%	0.792
Mean	0.743	19.0%	0.862

The higher SMAPE on pixel_size, dwell_time, and beam_current is expected: those dimensions are stored log10-transformed because they span several orders of magnitude in physical units, so small residuals in log-space amplify when exponentiated back.

Training Configuration

Parameter	Value
Dataset	CMMP HAADF-STEM (7,330 images, 6,597/733 train/val)
Image encoder	ResNet-18 (trained from scratch, 1-channel input)
Metadata encoder	3-layer MLP, hidden dim 256
Crop size	512×512 (on-the-fly from full-resolution images)
Loss function	CLIP (symmetric cross-entropy) with learnable logit bias
Optimizer	AdamW (lr=1e-4, weight_decay=0.01)
Scheduler	Cosine annealing (LR floor ~1e-10 by epoch 1000)
Batch size	64 per GPU × 8 GPUs = 512 effective
Epochs	1000 (best checkpoint at epoch 956)
Hardware	8× H100 GPUs

Usage

import torch
from models import CMMP

# Load model
model = CMMP(
    meta_input_dim=7,
    embed_dim=128,
    image_encoder="resnet18",
    image_size=512,
    meta_hidden_dim=256,
    meta_num_layers=3,
)
model.load_state_dict(torch.load("model.pth", map_location="cpu"))
model.eval()

# Embed an image and its metadata
image = torch.randn(1, 1, 512, 512)   # single-channel grayscale [0, 1]
metadata = torch.randn(1, 7)           # z-scored metadata vector

with torch.no_grad():
    img_emb, meta_emb, temp, bias = model(image, metadata)
    # img_emb: (1, 128) — L2-normalized image embedding
    # meta_emb: (1, 128) — L2-normalized metadata embedding

Files

model.pth — Best checkpoint (epoch 956, highest Top-1 on val)
last.pth — Final checkpoint (epoch 1000)
config.json — Full training configuration (args.json from the run)
training_log.csv — Per-epoch training metrics
split_indices.npy — Train/val split indices (seed 67) for reproducibility
linear_probe_metadata.json — Ridge-probe metadata recovery metrics

Related Models

Stemson-AI/cmmp-resnet18-256 — Earlier ResNet-18 variant trained at crop 256
Stemson-AI/cmmp-vit-pretrained-256 — ViT-B/16 variant
Stemson-AI/cmmp-vit-pretrained-256-with-sample-atomagined — ViT variant trained with atomagined simulated data

Citation

@misc{cimp2026,
  title={Contrastive Image-Metadata Pre-training for Materials Transmission Electron Microscopy},
  author={Channing, Georgia and Keller, Debora and Rossell, Marta D. and Torr, Philip and Erni, Rolf and Helveg, Stig and Eliasson, Henrik},
  year={2026},
}

Downloads last month: 16

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support