Detection Heads

Detection head architectures on frozen EUPE-ViT-B features. The backbone is frozen throughout; only the head trains. The premise behind the repository is that a strong modern ViT encoder already carries enough semantic and spatial structure to support detection with a lightweight decoder on top, and that the detection head's architecture — not the backbone — is where the interesting design trade-offs live. Every result here is obtained from features extracted once and cached; training iterates over those cached tensors, which makes it possible to run architecture studies in minutes rather than the days required for end-to-end backbone training.

The headline artifact is a 2,975,067-parameter split-tower detection head that reaches 42.64 mAP on COCO val2017 (42.71 under soft NMS), beating the standard FCOS baseline (16.14M parameters, 41.0 mAP) by +1.64 at 18.4 percent of its head parameter budget. Small-object mAP is 22.3 against FCOS's 19.4 (+2.9). This is the head shipped in phanerozoic/argus. It replaces FCOS's learned feature pyramid with a zero-parameter cofiber decomposition of the backbone patch features: four frequency-separated scale bands at strides 16, 32, 64, and 128 (produced by iteratively subtracting the downsampled-then-upsampled component of the feature map), plus a stride-8 level synthesized by a single transposed convolution. Each level is processed by separate classification and regression towers (five 3×3 ConvGN blocks followed by four depthwise residual blocks at 160 hidden channels, with weights shared across levels and top-down lateral connections). Classification is cosine similarity of a Linear(160, 768) projection against CLIP ViT-L/14 multi-prompt text embeddings of the 80 COCO class names. Regression uses exponentiated LTRB distances with a learned per-level scale. Training is 16 epochs at 768-pixel input with ATSS target assignment, flip augmentation, EMA, and PC-initialized cls_project, followed by a 3-epoch partial fine-tune that updates only the classification calibration layers (cls_project, cls_bias, logit_scale) while the towers and cofiber path stay frozen.

The repository also contains a standard FCOS detector (16.14M parameters, 41.0 mAP) as the canonical reference baseline. It trains identically on the same cached features and uses the same eval pipeline, so the 42.64-vs-41.0 comparison is apples-to-apples — both heads read the same frozen backbone output, both are evaluated with pycocotools hard NMS at IoU 0.5. Anyone who wants to reproduce the comparison, or to see how far a standard FCOS head gets on modern frozen ViT features, can take the reference off the shelf.

Alongside the shipped picker and the FCOS reference sits a library of seventeen experimental architectural scaffolds under heads/. Each is a single head.py file implementing a forward pass with no trained checkpoint — hypothesis-stage designs that formulate detection in unconventional ways: wavelet decomposition in place of FPN, optimal-transport label assignment, compositional patch assembly, tropical-semiring operations, corner-pair relational reasoning, and others. A driver script (arena.py) instantiates and trains any of them against the cached backbone features, so trying a new formulation is a one-command operation. The scaffolds are recorded so alternative formulations are not lost, and so readers looking for detection ideas have somewhere to browse.

The full parameter-versus-mAP scaling curve for the cofiber line of work (from 105-parameter minimal circuits up through the current 2.98M picker at 42.64 mAP) is hosted in the separate phanerozoic/cofiber-detection repository, along with the machine-checked Rocq/HoTT proof of the decomposition's exactness, analytical closed-form heads, and synthesizable Verilog circuit implementations. This repository keeps the top-performing checkpoint and the experimental scaffold library; the research lineage lives in the sibling repo.

Reference Results (COCO val2017)

Head	Parameters	mAP	mAP@0.50	mAP@0.75
Split-tower, 160h, 768px + cls_calib fine-tune (16 ep + 3 ep partial)	2,975,067	42.64	65.70	45.10
Split-tower, 160h, 768px (16 ep, step 104k, pre-calib)	2,975,067	42.49	65.57	44.89
Split-tower, 160h, 640px (16 ep, step 100k, pre-resolution-scale)	2,975,067	41.15	63.99	43.83
Split-tower + CLIP ViT-B/32 text-aligned, 192h (8 ep, 640px)	4,164,699	25.95	39.13	28.92
Split-tower with multi-scale decomposition, 192h (8 ep, no text align)	4,068,954	24.6	37.1	27.0
Baseline FCOS (simple feature pyramid)	16,138,074	41.0	64.8	43.2

Eval protocol: per-class hard NMS (IoU 0.5), score threshold 0.05, top-100 per image, pycocotools COCO val2017. Canonical script: eval_coco_map.py. A --soft-nms flag runs an additional linear-decay soft-NMS pass for comparison.

The top row is the current picker and the head shipped in phanerozoic/argus. At 2.98M learnable parameters it reaches 42.64 mAP on COCO val2017 (42.71 under soft NMS), passing the 16.14M FCOS baseline (41.0 mAP) by +1.64 while using 18.4 percent of its head parameter budget. Small-object mAP is 22.3 (vs FCOS 19.4, +2.9). Soft NMS gives 42.71 mAP, 65.73 @0.50, 45.33 @0.75.

The FCOS baseline uses a simple feature pyramid that synthesizes five scale levels (P3 at stride 8 through P7 at stride 128, each with 256 channels and GroupNorm) from the backbone's stride-16 spatial features, followed by two shared four-layer convolutional towers (one for classification, one for box regression plus centerness) and three 1×1 prediction heads. It trains in eight epochs at 640-pixel input. The configuration is standard and the implementation is direct; the point of having it in this repository is to provide a clean, comparable reference.

The split-tower head differs from the FCOS baseline in three ways: the multi-scale structure comes from a zero-parameter cofiber decomposition of the backbone features rather than from a learned FPN; the classification and regression towers use separate weights throughout rather than sharing a tower; and the classifier is a linear projection into frozen CLIP text-embedding space followed by cosine similarity against the 80 COCO class-name embeddings, replacing a learned per-class weight matrix. The classifier projection is a single Linear(hidden, text_embed_dim) without bias; class prototypes are the L2-normalized CLIP text-encoder outputs for the 80 COCO class names; a learned scalar temperature and per-class bias calibrate the cosine scores.

The current checkpoint is the end point of a six-step lineage that lifted the same architecture from 36.88 to 42.64 mAP at roughly constant parameter count:

Step	Change	mAP (hard NMS)
baseline	160h 5+4 towers, FCOS assignment, no aug, 640px	36.88
+ ATSS + horizontal flip aug	stride-scaled pseudo-IoU assignment, 50% flip	40.06
+ EMA + PC init + CLIP ViT-L/14 8-prompt	EMA decay 0.9998, cls_project columns initialized from SVD of text_embed, switch from ViT-B/32 single-prompt to ViT-L/14 8-prompt-average text embeddings	40.89 (live) / 40.96 (EMA)
+ 16-epoch schedule, late-training sweep	double the schedule, select peak from 11-point sweep over steps 80k-117k	41.15 (live) / 41.27 (soft NMS)
+ 768-pixel training	same recipe at 48×48 backbone grid instead of 40×40	42.49 (step 104k peak)
+ 3-epoch cls_calib fine-tune	partial fine-tune updating only `cls_project`, `cls_bias`, `logit_scale` at lr 1e-4 (towers and cofiber path frozen)	42.64 / 42.71 (soft NMS)

The 4.16M / 25.95 mAP row is a prior-generation picker (192-hidden, 8 epoch, CLIP ViT-B/32 single-prompt text embeddings), retained for continuity. The jump from 25.95 to 42.64 is training-recipe and training-resolution: ATSS target assignment, horizontal flip augmentation, EMA, PC-initialized cls_project, ViT-L/14 multi-prompt text embeddings, a doubled schedule, the switch from 640-pixel to 768-pixel training input, and a 3-epoch classification-calibration fine-tune. Hidden width went down from 192 to 160 and the head still shrank while gaining 16.7 mAP. The recipe changes live in train_split_tower_5scale.py; eval JSONs for every shipping checkpoint sit next to the weights in the same directory.

Analytical single-class person detector

A probe of how far a detection head can go with zero gradient descent on the same frozen backbone. The best closed-form configuration reaches 2.33 AP / 10.3 AP@0.50 on COCO val2017 for the person class at ~105K learnable parameters, all derived from simple statistics of the cached training features (no SGD, no training loop):

Fisher LDA classifier — 768-dim direction computed in closed form from pooled covariance of person-positive vs. background features across 85K positives and 51M negatives. Bayes-calibrated bias incorporates the ~1:600 class-imbalance prior of person-positive patches in COCO.
15-anchor aspect grid — 5 scale quantile bins × 3 fixed aspect ratios (0.4, 1.0, 2.5) guarantees landscape-orientation coverage even though COCO person training data skews portrait. A per-scale ridge-regressed 15-way classifier selects the anchor at each patch location.
Per-anchor fine refinement — 15 independent 770-dim ridge regressors each predict residuals (dx, dy, d_log_w, d_log_h) relative to their assigned anchor shape. Splitting the regression into per-anchor subproblems avoids the averaging failure that broke earlier global-regression variants.

AP@0.75 is effectively zero (0.2): the 15 discrete anchor shapes plus linear residual correction produce boxes precise enough for loose localization but not for strict IoU thresholds. Generative-classifier variants (per-class Gaussian mixtures via Mahalanobis distance) and soft anchor weighting were tested and both underperformed the discriminative-classifier + argmax configuration — a clean negative result showing Fisher LDA beats GMM for this task, and hard anchor selection beats soft weighted mean.

Checkpoint at heads/cofiber_threshold/analytical_person_v3j/head_v3j_canonical.pth; build-and-evaluate script at _analytical_person_v3j.py. Total wall time to reproduce: ~3 minutes (statistics pass + closed-form solves + full val eval). The 80-class analytical scaling of this approach is in the separate cofiber-detection repository referenced below.

Cofiber decomposition (separate repository)

A large parallel line of work on this backbone replaces the FCOS feature pyramid with cofiber decomposition: a zero-parameter multi-scale operation that iteratively subtracts the downsampled-then-upsampled component of a feature map to produce frequency-separated scale bands. Given backbone features f, the decomposition produces f − upsample(avgpool(f)) as a high-frequency band, then recurses on the low-frequency remainder. The three bands that result are pairwise orthogonal in frequency content, which lets a lightweight head attend to them separately without needing a learned FPN to mediate between scales.

The decomposition has been machine-checked in Rocq/HoTT. The proof frames average pooling and bilinear upsampling as an adjoint pair whose counit gives a short exact sequence in a semi-additive category; the cofiber bands are the kernels of the resulting projections, and any input is uniquely expressible as a sum of its bands. The practical consequence is that the multi-scale construction which typically costs 11M parameters in an FPN is free — no parameters, no training, no FLOPs beyond pooling and interpolation.

This has accumulated enough work around it to warrant its own home. The sibling repository phanerozoic/cofiber-detection contains:

Closed-form analytical heads derived from least-squares regression on the decomposed features (zero training, 70K parameters, 1.6 mAP)
Trained neural heads at multiple scales, from 70K parameters (5.2 mAP) up to the 3.85M-parameter split-tower included here (20.3 mAP) and 4.27M-parameter variants with top-down lateral connections (19.9 mAP)
INT8 threshold-logic heads using Heaviside step functions (92K parameters, 5.9 mAP) and their pruned variants down to 46K nonzero parameters
Synthesizable Verilog circuit implementations, including a 93-parameter person image classifier that achieves 99.8% recall on COCO val images
Evolutionary dimension search (GPU-batched, 200 generations per second) that identifies informative feature subsets; the evolved 10-dimension person-detection circuit outperforms a greedy 100-dimension circuit at 10× fewer gates
Sheaf-cohomology regression features (directional feature differences at spatial boundaries, interpretable as Čech 1-cocycles)
The Rocq/HoTT proof of decomposition exactness
Full training and evaluation scripts

Readers interested in any of the above — parameter-efficient detection, formalized multi-scale operations, hardware circuits, evolutionary feature selection, topological regression features — should go there. This repository keeps the top-performing head as a comparison point but does not attempt to mirror the full research lineage.

Experimental Architectural Scaffolds

The seventeen scaffolds under heads/ each explore a distinct hypothesis about how detection could be formulated on frozen ViT features. None has been trained to convergence; the point is to record architectural options rather than to benchmark them:

Scaffold	Core idea
`adaptive_query`	Learned query vectors that adaptively attend to backbone patches, producing detections as attention outputs rather than per-location predictions
`cascade_pool`	A cascade of pooling operations to synthesize multi-scale features without an FPN
`centernet`	CenterNet-style keypoint detection: predict object centers as a heatmap, then regress size and offset at center locations
`compression`	Head operating on compressed backbone features, testing how aggressively features can be reduced before detection quality collapses
`curvature`	Object boundaries as high curvature of the feature manifold; detection as curvature estimation
`depth_fusion`	Joint detection and depth estimation sharing features, exploiting the alignment between depth gradients and object boundaries
`feature_graph`	Detection as a graph problem over patch tokens, with edges indicating co-occurrence or spatial adjacency
`mutual_attention`	Cross-attention between patch features to model inter-object relationships before detection
`optimal_transport`	Label assignment via optimal transport between ground-truth boxes and prediction locations
`patch_assembly`	Compositional object construction from overlapping patch detections, inspired by self-assembly tile systems
`prototype`	Replacing learned classification weights with fixed class prototypes (text-embedded or image-mean)
`relational_corners`	Corner-pair detection: predict corner heatmaps, then pair them via learned embeddings
`scale_classify`	Explicit scale classification as an auxiliary head to produce scale-consistent boxes
`sparse_query`	Sparse query mechanism to limit prediction locations and reduce background noise
`threshold_prototype`	Prototype classification with a threshold gate
`tropical`	Tropical-semiring (min-plus) operations in the head
`wavelet`	Wavelet-based multi-scale decomposition as an alternative to FPN and to cofiber decomposition

Each is a one-file architectural commitment. Most will not beat the FCOS baseline; some may suggest directions that do. The arena script makes training any of them a one-command operation against the cached features.

FCOS Variants

Three FCOS-family heads are present as controlled variants on the baseline, each differing from the reference in one isolated way:

heads/baseline_fcos/ — the full simple-feature-pyramid FCOS head at 16.14M parameters, the 41.0-mAP reference.
heads/slim_fcos/ — a reduced-parameter variant with narrower towers and fewer FPN channels. It exists to help locate how much of the baseline's capacity is load-bearing.
heads/hook_fcos/ — an FCOS head that reads intermediate backbone features via forward hooks rather than only the final spatial output. This follows the pattern used by the DPT depth decoder in the multi-task Argus model, where intermediate ViT blocks (2, 5, 8, 11) carry information that the final block does not expose; this variant tests whether the same access pattern improves detection.

Arena

arena.py and multi_domain_arena.py are lightweight driver scripts that instantiate any head from the heads/ directory, train it for a configurable number of epochs on cached COCO features, and evaluate with pycocotools. The multi-domain version extends evaluation across the twenty RF100-VL cross-domain datasets in addition to COCO, to test whether a head's performance generalizes beyond the benchmark it was tuned on.

Shared Infrastructure

losses/ contains the FCOS-style focal loss for classification, GIoU loss for boxes, BCE for centerness, and a CenterNet-style focal-loss variant for keypoint heads.
utils/ contains the feature caching code (shard format, manifest, shard loader), the decoder that converts per-location predictions to COCO-format boxes, and the evaluation wrapper around pycocotools.
eval_coco_map.py is the standalone evaluation entry point; it can be pointed at any saved checkpoint.
train_split_tower_5scale.py and eval_split_tower_5scale.py reproduce the split-tower results above.

Backbone

All results use EUPE-ViT-B with letterbox padding. The backbone produces 768-channel spatial features at stride 16 — a 40×40 grid at 640-pixel input (used by the FCOS reference baseline and the earlier-generation cofiber pickers) or a 48×48 grid at 768-pixel input (used by the shipped 2.98M picker). The backbone has 86M parameters and is frozen; features are cached once per resolution before any head training begins. The framework is backbone-agnostic: any ViT-scale encoder producing a consistent spatial grid can be substituted by replacing the feature extraction step, and the head library will run without modification. EUPE-ViT-B is the default because its multi-teacher distillation produces features that work well across detection, segmentation, depth, and correspondence, which makes it a sensible proving ground for heads that might eventually share a backbone with other tasks.

References

Tian, Shen, Chen, He. FCOS: Fully Convolutional One-Stage Object Detection. ICCV 2019.
Li, Mao, Girshick, He. Exploring Plain Vision Transformer Backbones for Object Detection. ECCV 2022.
Zhou, Wang, Krähenbühl. Objects as Points. arXiv:1904.07850, 2019.

License

Apache 2.0. Users are responsible for complying with the license terms of whatever backbone they substitute for EUPE-ViT-B.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for phanerozoic/detection-heads

Base model

facebook/EUPE-ViT-B

Finetuned

(5)

this model

Dataset used to train phanerozoic/detection-heads

Paper for phanerozoic/detection-heads

Objects as Points

Paper • 1904.07850 • Published Apr 16, 2019

phanerozoic
/

detection-heads