Detection Heads
Detection head architectures on frozen EUPE-ViT-B features. The backbone is frozen throughout; only the head trains. The premise behind the repository is that a strong modern ViT encoder already carries enough semantic and spatial structure to support detection with a lightweight decoder on top, and that the detection head's architecture — not the backbone — is where the interesting design trade-offs live. Every result here is obtained from features extracted once and cached; training iterates over those cached tensors, which makes it possible to run architecture studies in minutes rather than the days required for end-to-end backbone training.
Contents
The headline artifact is a 2,975,067-parameter split-tower detection head that reaches 42.64 mAP on COCO val2017 (42.71 under soft NMS), beating the standard FCOS baseline (16.14M parameters, 41.0 mAP) by +1.64 at 18.4 percent of its head parameter budget. Small-object mAP is 22.3 against FCOS's 19.4 (+2.9). This is the head shipped in phanerozoic/argus. It replaces FCOS's learned feature pyramid with a zero-parameter cofiber decomposition of the backbone patch features: four frequency-separated scale bands at strides 16, 32, 64, and 128 (produced by iteratively subtracting the downsampled-then-upsampled component of the feature map), plus a stride-8 level synthesized by a single transposed convolution. Each level is processed by separate classification and regression towers (five 3×3 ConvGN blocks followed by four depthwise residual blocks at 160 hidden channels, with weights shared across levels and top-down lateral connections). Classification is cosine similarity of a Linear(160, 768) projection against CLIP ViT-L/14 multi-prompt text embeddings of the 80 COCO class names. Regression uses exponentiated LTRB distances with a learned per-level scale. Training is 16 epochs at 768-pixel input with ATSS target assignment, flip augmentation, EMA, and PC-initialized cls_project, followed by a 3-epoch partial fine-tune that updates only the classification calibration layers (cls_project, cls_bias, logit_scale) while the towers and cofiber path stay frozen.
The repository also contains a standard FCOS detector (16.14M parameters, 41.0 mAP) as the canonical reference baseline. It trains identically on the same cached features and uses the same eval pipeline, so the 42.64-vs-41.0 comparison is apples-to-apples — both heads read the same frozen backbone output, both are evaluated with pycocotools hard NMS at IoU 0.5. Anyone who wants to reproduce the comparison, or to see how far a standard FCOS head gets on modern frozen ViT features, can take the reference off the shelf.
Alongside the shipped picker and the FCOS reference sits a library of seventeen experimental architectural scaffolds under heads/. Each is a single head.py file implementing a forward pass with no trained checkpoint — hypothesis-stage designs that formulate detection in unconventional ways: wavelet decomposition in place of FPN, optimal-transport label assignment, compositional patch assembly, tropical-semiring operations, corner-pair relational reasoning, and others. A driver script (arena.py) instantiates and trains any of them against the cached backbone features, so trying a new formulation is a one-command operation. The scaffolds are recorded so alternative formulations are not lost, and so readers looking for detection ideas have somewhere to browse.
The full parameter-versus-mAP scaling curve for the cofiber line of work (from 105-parameter minimal circuits up through the current 2.98M picker at 42.64 mAP) is hosted in the separate phanerozoic/cofiber-detection repository, along with the machine-checked Rocq/HoTT proof of the decomposition's exactness, analytical closed-form heads, and synthesizable Verilog circuit implementations. This repository keeps the top-performing checkpoint and the experimental scaffold library; the research lineage lives in the sibling repo.
Reference Results (COCO val2017)
| Head | Parameters | mAP | mAP@0.50 | mAP@0.75 |
|---|---|---|---|---|
| Split-tower, 160h, 768px + cls_calib fine-tune (16 ep + 3 ep partial) | 2,975,067 | 42.64 | 65.70 | 45.10 |
| Split-tower, 160h, 768px (16 ep, step 104k, pre-calib) | 2,975,067 | 42.49 | 65.57 | 44.89 |
| Split-tower, 160h, 640px (16 ep, step 100k, pre-resolution-scale) | 2,975,067 | 41.15 | 63.99 | 43.83 |
| Split-tower + CLIP ViT-B/32 text-aligned, 192h (8 ep, 640px) | 4,164,699 | 25.95 | 39.13 | 28.92 |
| Split-tower with multi-scale decomposition, 192h (8 ep, no text align) | 4,068,954 | 24.6 | 37.1 | 27.0 |
| Baseline FCOS (simple feature pyramid) | 16,138,074 | 41.0 | 64.8 | 43.2 |
Eval protocol: per-class hard NMS (IoU 0.5), score threshold 0.05, top-100 per image, pycocotools COCO val2017. Canonical script: eval_coco_map.py. A --soft-nms flag runs an additional linear-decay soft-NMS pass for comparison.
The top row is the current picker and the head shipped in phanerozoic/argus. At 2.98M learnable parameters it reaches 42.64 mAP on COCO val2017 (42.71 under soft NMS), passing the 16.14M FCOS baseline (41.0 mAP) by +1.64 while using 18.4 percent of its head parameter budget. Small-object mAP is 22.3 (vs FCOS 19.4, +2.9). Soft NMS gives 42.71 mAP, 65.73 @0.50, 45.33 @0.75.
The FCOS baseline uses a simple feature pyramid that synthesizes five scale levels (P3 at stride 8 through P7 at stride 128, each with 256 channels and GroupNorm) from the backbone's stride-16 spatial features, followed by two shared four-layer convolutional towers (one for classification, one for box regression plus centerness) and three 1×1 prediction heads. It trains in eight epochs at 640-pixel input. The configuration is standard and the implementation is direct; the point of having it in this repository is to provide a clean, comparable reference.
The split-tower head differs from the FCOS baseline in three ways: the multi-scale structure comes from a zero-parameter cofiber decomposition of the backbone features rather than from a learned FPN; the classification and regression towers use separate weights throughout rather than sharing a tower; and the classifier is a linear projection into frozen CLIP text-embedding space followed by cosine similarity against the 80 COCO class-name embeddings, replacing a learned per-class weight matrix. The classifier projection is a single Linear(hidden, text_embed_dim) without bias; class prototypes are the L2-normalized CLIP text-encoder outputs for the 80 COCO class names; a learned scalar temperature and per-class bias calibrate the cosine scores.
The current checkpoint is the end point of a six-step lineage that lifted the same architecture from 36.88 to 42.64 mAP at roughly constant parameter count:
| Step | Change | mAP (hard NMS) |
|---|---|---|
| baseline | 160h 5+4 towers, FCOS assignment, no aug, 640px | 36.88 |
| + ATSS + horizontal flip aug | stride-scaled pseudo-IoU assignment, 50% flip | 40.06 |
| + EMA + PC init + CLIP ViT-L/14 8-prompt | EMA decay 0.9998, cls_project columns initialized from SVD of text_embed, switch from ViT-B/32 single-prompt to ViT-L/14 8-prompt-average text embeddings | 40.89 (live) / 40.96 (EMA) |
| + 16-epoch schedule, late-training sweep | double the schedule, select peak from 11-point sweep over steps 80k-117k | 41.15 (live) / 41.27 (soft NMS) |
| + 768-pixel training | same recipe at 48×48 backbone grid instead of 40×40 | 42.49 (step 104k peak) |
| + 3-epoch cls_calib fine-tune | partial fine-tune updating only cls_project, cls_bias, logit_scale at lr 1e-4 (towers and cofiber path frozen) |
42.64 / 42.71 (soft NMS) |
The 4.16M / 25.95 mAP row is a prior-generation picker (192-hidden, 8 epoch, CLIP ViT-B/32 single-prompt text embeddings), retained for continuity. The jump from 25.95 to 42.64 is training-recipe and training-resolution: ATSS target assignment, horizontal flip augmentation, EMA, PC-initialized cls_project, ViT-L/14 multi-prompt text embeddings, a doubled schedule, the switch from 640-pixel to 768-pixel training input, and a 3-epoch classification-calibration fine-tune. Hidden width went down from 192 to 160 and the head still shrank while gaining 16.7 mAP. The recipe changes live in train_split_tower_5scale.py; eval JSONs for every shipping checkpoint sit next to the weights in the same directory.
Analytical single-class person detector
A probe of how far a detection head can go with zero gradient descent on the same frozen backbone. The best closed-form configuration reaches 2.33 AP / 10.3 AP@0.50 on COCO val2017 for the person class at ~105K learnable parameters, all derived from simple statistics of the cached training features (no SGD, no training loop):
- Fisher LDA classifier — 768-dim direction computed in closed form from pooled covariance of person-positive vs. background features across 85K positives and 51M negatives. Bayes-calibrated bias incorporates the ~1:600 class-imbalance prior of person-positive patches in COCO.
- 15-anchor aspect grid — 5 scale quantile bins × 3 fixed aspect ratios (0.4, 1.0, 2.5) guarantees landscape-orientation coverage even though COCO person training data skews portrait. A per-scale ridge-regressed 15-way classifier selects the anchor at each patch location.
- Per-anchor fine refinement — 15 independent 770-dim ridge regressors each predict residuals (dx, dy, d_log_w, d_log_h) relative to their assigned anchor shape. Splitting the regression into per-anchor subproblems avoids the averaging failure that broke earlier global-regression variants.
AP@0.75 is effectively zero (0.2): the 15 discrete anchor shapes plus linear residual correction produce boxes precise enough for loose localization but not for strict IoU thresholds. Generative-classifier variants (per-class Gaussian mixtures via Mahalanobis distance) and soft anchor weighting were tested and both underperformed the discriminative-classifier + argmax configuration — a clean negative result showing Fisher LDA beats GMM for this task, and hard anchor selection beats soft weighted mean.
Checkpoint at heads/cofiber_threshold/analytical_person_v3j/head_v3j_canonical.pth; build-and-evaluate script at _analytical_person_v3j.py. Total wall time to reproduce: ~3 minutes (statistics pass + closed-form solves + full val eval). The 80-class analytical scaling of this approach is in the separate cofiber-detection repository referenced below.
Cofiber decomposition (separate repository)
A large parallel line of work on this backbone replaces the FCOS feature pyramid with cofiber decomposition: a zero-parameter multi-scale operation that iteratively subtracts the downsampled-then-upsampled component of a feature map to produce frequency-separated scale bands. Given backbone features f, the decomposition produces f − upsample(avgpool(f)) as a high-frequency band, then recurses on the low-frequency remainder. The three bands that result are pairwise orthogonal in frequency content, which lets a lightweight head attend to them separately without needing a learned FPN to mediate between scales.
The decomposition has been machine-checked in Rocq/HoTT. The proof frames average pooling and bilinear upsampling as an adjoint pair whose counit gives a short exact sequence in a semi-additive category; the cofiber bands are the kernels of the resulting projections, and any input is uniquely expressible as a sum of its bands. The practical consequence is that the multi-scale construction which typically costs 11M parameters in an FPN is free — no parameters, no training, no FLOPs beyond pooling and interpolation.
This has accumulated enough work around it to warrant its own home. The sibling repository phanerozoic/cofiber-detection contains:
- Closed-form analytical heads derived from least-squares regression on the decomposed features (zero training, 70K parameters, 1.6 mAP)
- Trained neural heads at multiple scales, from 70K parameters (5.2 mAP) up to the 3.85M-parameter split-tower included here (20.3 mAP) and 4.27M-parameter variants with top-down lateral connections (19.9 mAP)
- INT8 threshold-logic heads using Heaviside step functions (92K parameters, 5.9 mAP) and their pruned variants down to 46K nonzero parameters
- Synthesizable Verilog circuit implementations, including a 93-parameter person image classifier that achieves 99.8% recall on COCO val images
- Evolutionary dimension search (GPU-batched, 200 generations per second) that identifies informative feature subsets; the evolved 10-dimension person-detection circuit outperforms a greedy 100-dimension circuit at 10× fewer gates
- Sheaf-cohomology regression features (directional feature differences at spatial boundaries, interpretable as ÄŒech 1-cocycles)
- The Rocq/HoTT proof of decomposition exactness
- Full training and evaluation scripts
Readers interested in any of the above — parameter-efficient detection, formalized multi-scale operations, hardware circuits, evolutionary feature selection, topological regression features — should go there. This repository keeps the top-performing head as a comparison point but does not attempt to mirror the full research lineage.
Experimental Architectural Scaffolds
The seventeen scaffolds under heads/ each explore a distinct hypothesis about how detection could be formulated on frozen ViT features. None has been trained to convergence; the point is to record architectural options rather than to benchmark them:
| Scaffold | Core idea |
|---|---|
adaptive_query |
Learned query vectors that adaptively attend to backbone patches, producing detections as attention outputs rather than per-location predictions |
cascade_pool |
A cascade of pooling operations to synthesize multi-scale features without an FPN |
centernet |
CenterNet-style keypoint detection: predict object centers as a heatmap, then regress size and offset at center locations |
compression |
Head operating on compressed backbone features, testing how aggressively features can be reduced before detection quality collapses |
curvature |
Object boundaries as high curvature of the feature manifold; detection as curvature estimation |
depth_fusion |
Joint detection and depth estimation sharing features, exploiting the alignment between depth gradients and object boundaries |
feature_graph |
Detection as a graph problem over patch tokens, with edges indicating co-occurrence or spatial adjacency |
mutual_attention |
Cross-attention between patch features to model inter-object relationships before detection |
optimal_transport |
Label assignment via optimal transport between ground-truth boxes and prediction locations |
patch_assembly |
Compositional object construction from overlapping patch detections, inspired by self-assembly tile systems |
prototype |
Replacing learned classification weights with fixed class prototypes (text-embedded or image-mean) |
relational_corners |
Corner-pair detection: predict corner heatmaps, then pair them via learned embeddings |
scale_classify |
Explicit scale classification as an auxiliary head to produce scale-consistent boxes |
sparse_query |
Sparse query mechanism to limit prediction locations and reduce background noise |
threshold_prototype |
Prototype classification with a threshold gate |
tropical |
Tropical-semiring (min-plus) operations in the head |
wavelet |
Wavelet-based multi-scale decomposition as an alternative to FPN and to cofiber decomposition |
Each is a one-file architectural commitment. Most will not beat the FCOS baseline; some may suggest directions that do. The arena script makes training any of them a one-command operation against the cached features.
FCOS Variants
Three FCOS-family heads are present as controlled variants on the baseline, each differing from the reference in one isolated way:
heads/baseline_fcos/— the full simple-feature-pyramid FCOS head at 16.14M parameters, the 41.0-mAP reference.heads/slim_fcos/— a reduced-parameter variant with narrower towers and fewer FPN channels. It exists to help locate how much of the baseline's capacity is load-bearing.heads/hook_fcos/— an FCOS head that reads intermediate backbone features via forward hooks rather than only the final spatial output. This follows the pattern used by the DPT depth decoder in the multi-task Argus model, where intermediate ViT blocks (2, 5, 8, 11) carry information that the final block does not expose; this variant tests whether the same access pattern improves detection.
Arena
arena.py and multi_domain_arena.py are lightweight driver scripts that instantiate any head from the heads/ directory, train it for a configurable number of epochs on cached COCO features, and evaluate with pycocotools. The multi-domain version extends evaluation across the twenty RF100-VL cross-domain datasets in addition to COCO, to test whether a head's performance generalizes beyond the benchmark it was tuned on.
Shared Infrastructure
losses/contains the FCOS-style focal loss for classification, GIoU loss for boxes, BCE for centerness, and a CenterNet-style focal-loss variant for keypoint heads.utils/contains the feature caching code (shard format, manifest, shard loader), the decoder that converts per-location predictions to COCO-format boxes, and the evaluation wrapper around pycocotools.eval_coco_map.pyis the standalone evaluation entry point; it can be pointed at any saved checkpoint.train_split_tower_5scale.pyandeval_split_tower_5scale.pyreproduce the split-tower results above.
Backbone
All results use EUPE-ViT-B with letterbox padding. The backbone produces 768-channel spatial features at stride 16 — a 40×40 grid at 640-pixel input (used by the FCOS reference baseline and the earlier-generation cofiber pickers) or a 48×48 grid at 768-pixel input (used by the shipped 2.98M picker). The backbone has 86M parameters and is frozen; features are cached once per resolution before any head training begins. The framework is backbone-agnostic: any ViT-scale encoder producing a consistent spatial grid can be substituted by replacing the feature extraction step, and the head library will run without modification. EUPE-ViT-B is the default because its multi-teacher distillation produces features that work well across detection, segmentation, depth, and correspondence, which makes it a sensible proving ground for heads that might eventually share a backbone with other tasks.
References
- Tian, Shen, Chen, He. FCOS: Fully Convolutional One-Stage Object Detection. ICCV 2019.
- Li, Mao, Girshick, He. Exploring Plain Vision Transformer Backbones for Object Detection. ECCV 2022.
- Zhou, Wang, Krähenbühl. Objects as Points. arXiv:1904.07850, 2019.
License
Apache 2.0. Users are responsible for complying with the license terms of whatever backbone they substitute for EUPE-ViT-B.
Model tree for phanerozoic/detection-heads
Base model
facebook/EUPE-ViT-B