Detection Heads

Detection head architectures on frozen EUPE-ViT-B features. The backbone is frozen throughout; only the head trains. The premise behind the repository is that a strong modern ViT encoder already carries enough semantic and spatial structure to support detection with a lightweight decoder on top, and that the detection head's architecture — not the backbone — is where the interesting design trade-offs live. Every result here is obtained from features extracted once and cached; training iterates over those cached tensors, which makes it possible to run architecture studies in minutes rather than the days required for end-to-end backbone training.

Contents

The headline artifact is a 2,975,067-parameter split-tower detection head that reaches 42.64 mAP on COCO val2017 (42.71 under soft NMS), beating the standard FCOS baseline (16.14M parameters, 41.0 mAP) by +1.64 at 18.4 percent of its head parameter budget. Small-object mAP is 22.3 against FCOS's 19.4 (+2.9). This is the head shipped in phanerozoic/argus. It replaces FCOS's learned feature pyramid with a zero-parameter cofiber decomposition of the backbone patch features: four frequency-separated scale bands at strides 16, 32, 64, and 128 (produced by iteratively subtracting the downsampled-then-upsampled component of the feature map), plus a stride-8 level synthesized by a single transposed convolution. Each level is processed by separate classification and regression towers (five 3×3 ConvGN blocks followed by four depthwise residual blocks at 160 hidden channels, with weights shared across levels and top-down lateral connections). Classification is cosine similarity of a Linear(160, 768) projection against CLIP ViT-L/14 multi-prompt text embeddings of the 80 COCO class names. Regression uses exponentiated LTRB distances with a learned per-level scale. Training is 16 epochs at 768-pixel input with ATSS target assignment, flip augmentation, EMA, and PC-initialized cls_project, followed by a 3-epoch partial fine-tune that updates only the classification calibration layers (cls_project, cls_bias, logit_scale) while the towers and cofiber path stay frozen.

The repository also contains a standard FCOS detector (16.14M parameters, 41.0 mAP) as the canonical reference baseline. It trains identically on the same cached features and uses the same eval pipeline, so the 42.64-vs-41.0 comparison is apples-to-apples — both heads read the same frozen backbone output, both are evaluated with pycocotools hard NMS at IoU 0.5. Anyone who wants to reproduce the comparison, or to see how far a standard FCOS head gets on modern frozen ViT features, can take the reference off the shelf.

Alongside the shipped picker and the FCOS reference sits a library of seventeen experimental architectural scaffolds under heads/. Each is a single head.py file implementing a forward pass with no trained checkpoint — hypothesis-stage designs that formulate detection in unconventional ways: wavelet decomposition in place of FPN, optimal-transport label assignment, compositional patch assembly, tropical-semiring operations, corner-pair relational reasoning, and others. A driver script (arena.py) instantiates and trains any of them against the cached backbone features, so trying a new formulation is a one-command operation. The scaffolds are recorded so alternative formulations are not lost, and so readers looking for detection ideas have somewhere to browse.

The full parameter-versus-mAP scaling curve for the cofiber line of work (from 105-parameter minimal circuits up through the current 2.98M picker at 42.64 mAP) is hosted in the separate phanerozoic/cofiber-detection repository, along with the machine-checked Rocq/HoTT proof of the decomposition's exactness, analytical closed-form heads, and synthesizable Verilog circuit implementations. This repository keeps the top-performing checkpoint and the experimental scaffold library; the research lineage lives in the sibling repo.

Reference Results (COCO val2017)

Head Parameters mAP mAP@0.50 mAP@0.75
Split-tower, 160h, 768px + cls_calib fine-tune (16 ep + 3 ep partial) 2,975,067 42.64 65.70 45.10
Split-tower, 160h, 768px (16 ep, step 104k, pre-calib) 2,975,067 42.49 65.57 44.89
Split-tower, 160h, 640px (16 ep, step 100k, pre-resolution-scale) 2,975,067 41.15 63.99 43.83
Split-tower + CLIP ViT-B/32 text-aligned, 192h (8 ep, 640px) 4,164,699 25.95 39.13 28.92
Split-tower with multi-scale decomposition, 192h (8 ep, no text align) 4,068,954 24.6 37.1 27.0
Baseline FCOS (simple feature pyramid) 16,138,074 41.0 64.8 43.2

Eval protocol: per-class hard NMS (IoU 0.5), score threshold 0.05, top-100 per image, pycocotools COCO val2017. Canonical script: eval_coco_map.py. A --soft-nms flag runs an additional linear-decay soft-NMS pass for comparison.

The top row is the current picker and the head shipped in phanerozoic/argus. At 2.98M learnable parameters it reaches 42.64 mAP on COCO val2017 (42.71 under soft NMS), passing the 16.14M FCOS baseline (41.0 mAP) by +1.64 while using 18.4 percent of its head parameter budget. Small-object mAP is 22.3 (vs FCOS 19.4, +2.9). Soft NMS gives 42.71 mAP, 65.73 @0.50, 45.33 @0.75.

The FCOS baseline uses a simple feature pyramid that synthesizes five scale levels (P3 at stride 8 through P7 at stride 128, each with 256 channels and GroupNorm) from the backbone's stride-16 spatial features, followed by two shared four-layer convolutional towers (one for classification, one for box regression plus centerness) and three 1×1 prediction heads. It trains in eight epochs at 640-pixel input. The configuration is standard and the implementation is direct; the point of having it in this repository is to provide a clean, comparable reference.

The split-tower head differs from the FCOS baseline in three ways: the multi-scale structure comes from a zero-parameter cofiber decomposition of the backbone features rather than from a learned FPN; the classification and regression towers use separate weights throughout rather than sharing a tower; and the classifier is a linear projection into frozen CLIP text-embedding space followed by cosine similarity against the 80 COCO class-name embeddings, replacing a learned per-class weight matrix. The classifier projection is a single Linear(hidden, text_embed_dim) without bias; class prototypes are the L2-normalized CLIP text-encoder outputs for the 80 COCO class names; a learned scalar temperature and per-class bias calibrate the cosine scores.

The current checkpoint is the end point of a six-step lineage that lifted the same architecture from 36.88 to 42.64 mAP at roughly constant parameter count:

Step Change mAP (hard NMS)
baseline 160h 5+4 towers, FCOS assignment, no aug, 640px 36.88
+ ATSS + horizontal flip aug stride-scaled pseudo-IoU assignment, 50% flip 40.06
+ EMA + PC init + CLIP ViT-L/14 8-prompt EMA decay 0.9998, cls_project columns initialized from SVD of text_embed, switch from ViT-B/32 single-prompt to ViT-L/14 8-prompt-average text embeddings 40.89 (live) / 40.96 (EMA)
+ 16-epoch schedule, late-training sweep double the schedule, select peak from 11-point sweep over steps 80k-117k 41.15 (live) / 41.27 (soft NMS)
+ 768-pixel training same recipe at 48×48 backbone grid instead of 40×40 42.49 (step 104k peak)
+ 3-epoch cls_calib fine-tune partial fine-tune updating only cls_project, cls_bias, logit_scale at lr 1e-4 (towers and cofiber path frozen) 42.64 / 42.71 (soft NMS)

The 4.16M / 25.95 mAP row is a prior-generation picker (192-hidden, 8 epoch, CLIP ViT-B/32 single-prompt text embeddings), retained for continuity. The jump from 25.95 to 42.64 is training-recipe and training-resolution: ATSS target assignment, horizontal flip augmentation, EMA, PC-initialized cls_project, ViT-L/14 multi-prompt text embeddings, a doubled schedule, the switch from 640-pixel to 768-pixel training input, and a 3-epoch classification-calibration fine-tune. Hidden width went down from 192 to 160 and the head still shrank while gaining 16.7 mAP. The recipe changes live in train_split_tower_5scale.py; eval JSONs for every shipping checkpoint sit next to the weights in the same directory.

Analytical single-class person detector

A probe of how far a detection head can go with zero gradient descent on the same frozen backbone. The best closed-form configuration reaches 2.33 AP / 10.3 AP@0.50 on COCO val2017 for the person class at ~105K learnable parameters, all derived from simple statistics of the cached training features (no SGD, no training loop):

  • Fisher LDA classifier — 768-dim direction computed in closed form from pooled covariance of person-positive vs. background features across 85K positives and 51M negatives. Bayes-calibrated bias incorporates the ~1:600 class-imbalance prior of person-positive patches in COCO.
  • 15-anchor aspect grid — 5 scale quantile bins × 3 fixed aspect ratios (0.4, 1.0, 2.5) guarantees landscape-orientation coverage even though COCO person training data skews portrait. A per-scale ridge-regressed 15-way classifier selects the anchor at each patch location.
  • Per-anchor fine refinement — 15 independent 770-dim ridge regressors each predict residuals (dx, dy, d_log_w, d_log_h) relative to their assigned anchor shape. Splitting the regression into per-anchor subproblems avoids the averaging failure that broke earlier global-regression variants.

AP@0.75 is effectively zero (0.2): the 15 discrete anchor shapes plus linear residual correction produce boxes precise enough for loose localization but not for strict IoU thresholds. Generative-classifier variants (per-class Gaussian mixtures via Mahalanobis distance) and soft anchor weighting were tested and both underperformed the discriminative-classifier + argmax configuration — a clean negative result showing Fisher LDA beats GMM for this task, and hard anchor selection beats soft weighted mean.

Checkpoint at heads/cofiber_threshold/analytical_person_v3j/head_v3j_canonical.pth; build-and-evaluate script at _analytical_person_v3j.py. Total wall time to reproduce: ~3 minutes (statistics pass + closed-form solves + full val eval). The 80-class analytical scaling of this approach is in the separate cofiber-detection repository referenced below.

Cofiber decomposition (separate repository)

A large parallel line of work on this backbone replaces the FCOS feature pyramid with cofiber decomposition: a zero-parameter multi-scale operation that iteratively subtracts the downsampled-then-upsampled component of a feature map to produce frequency-separated scale bands. Given backbone features f, the decomposition produces f − upsample(avgpool(f)) as a high-frequency band, then recurses on the low-frequency remainder. The three bands that result are pairwise orthogonal in frequency content, which lets a lightweight head attend to them separately without needing a learned FPN to mediate between scales.

The decomposition has been machine-checked in Rocq/HoTT. The proof frames average pooling and bilinear upsampling as an adjoint pair whose counit gives a short exact sequence in a semi-additive category; the cofiber bands are the kernels of the resulting projections, and any input is uniquely expressible as a sum of its bands. The practical consequence is that the multi-scale construction which typically costs 11M parameters in an FPN is free — no parameters, no training, no FLOPs beyond pooling and interpolation.

This has accumulated enough work around it to warrant its own home. The sibling repository phanerozoic/cofiber-detection contains:

  • Closed-form analytical heads derived from least-squares regression on the decomposed features (zero training, 70K parameters, 1.6 mAP)
  • Trained neural heads at multiple scales, from 70K parameters (5.2 mAP) up to the 3.85M-parameter split-tower included here (20.3 mAP) and 4.27M-parameter variants with top-down lateral connections (19.9 mAP)
  • INT8 threshold-logic heads using Heaviside step functions (92K parameters, 5.9 mAP) and their pruned variants down to 46K nonzero parameters
  • Synthesizable Verilog circuit implementations, including a 93-parameter person image classifier that achieves 99.8% recall on COCO val images
  • Evolutionary dimension search (GPU-batched, 200 generations per second) that identifies informative feature subsets; the evolved 10-dimension person-detection circuit outperforms a greedy 100-dimension circuit at 10× fewer gates
  • Sheaf-cohomology regression features (directional feature differences at spatial boundaries, interpretable as ÄŒech 1-cocycles)
  • The Rocq/HoTT proof of decomposition exactness
  • Full training and evaluation scripts

Readers interested in any of the above — parameter-efficient detection, formalized multi-scale operations, hardware circuits, evolutionary feature selection, topological regression features — should go there. This repository keeps the top-performing head as a comparison point but does not attempt to mirror the full research lineage.

Experimental Architectural Scaffolds

The seventeen scaffolds under heads/ each explore a distinct hypothesis about how detection could be formulated on frozen ViT features. None has been trained to convergence; the point is to record architectural options rather than to benchmark them:

Scaffold Core idea
adaptive_query Learned query vectors that adaptively attend to backbone patches, producing detections as attention outputs rather than per-location predictions
cascade_pool A cascade of pooling operations to synthesize multi-scale features without an FPN
centernet CenterNet-style keypoint detection: predict object centers as a heatmap, then regress size and offset at center locations
compression Head operating on compressed backbone features, testing how aggressively features can be reduced before detection quality collapses
curvature Object boundaries as high curvature of the feature manifold; detection as curvature estimation
depth_fusion Joint detection and depth estimation sharing features, exploiting the alignment between depth gradients and object boundaries
feature_graph Detection as a graph problem over patch tokens, with edges indicating co-occurrence or spatial adjacency
mutual_attention Cross-attention between patch features to model inter-object relationships before detection
optimal_transport Label assignment via optimal transport between ground-truth boxes and prediction locations
patch_assembly Compositional object construction from overlapping patch detections, inspired by self-assembly tile systems
prototype Replacing learned classification weights with fixed class prototypes (text-embedded or image-mean)
relational_corners Corner-pair detection: predict corner heatmaps, then pair them via learned embeddings
scale_classify Explicit scale classification as an auxiliary head to produce scale-consistent boxes
sparse_query Sparse query mechanism to limit prediction locations and reduce background noise
threshold_prototype Prototype classification with a threshold gate
tropical Tropical-semiring (min-plus) operations in the head
wavelet Wavelet-based multi-scale decomposition as an alternative to FPN and to cofiber decomposition

Each is a one-file architectural commitment. Most will not beat the FCOS baseline; some may suggest directions that do. The arena script makes training any of them a one-command operation against the cached features.

FCOS Variants

Three FCOS-family heads are present as controlled variants on the baseline, each differing from the reference in one isolated way:

  • heads/baseline_fcos/ — the full simple-feature-pyramid FCOS head at 16.14M parameters, the 41.0-mAP reference.
  • heads/slim_fcos/ — a reduced-parameter variant with narrower towers and fewer FPN channels. It exists to help locate how much of the baseline's capacity is load-bearing.
  • heads/hook_fcos/ — an FCOS head that reads intermediate backbone features via forward hooks rather than only the final spatial output. This follows the pattern used by the DPT depth decoder in the multi-task Argus model, where intermediate ViT blocks (2, 5, 8, 11) carry information that the final block does not expose; this variant tests whether the same access pattern improves detection.

Arena

arena.py and multi_domain_arena.py are lightweight driver scripts that instantiate any head from the heads/ directory, train it for a configurable number of epochs on cached COCO features, and evaluate with pycocotools. The multi-domain version extends evaluation across the twenty RF100-VL cross-domain datasets in addition to COCO, to test whether a head's performance generalizes beyond the benchmark it was tuned on.

Shared Infrastructure

  • losses/ contains the FCOS-style focal loss for classification, GIoU loss for boxes, BCE for centerness, and a CenterNet-style focal-loss variant for keypoint heads.
  • utils/ contains the feature caching code (shard format, manifest, shard loader), the decoder that converts per-location predictions to COCO-format boxes, and the evaluation wrapper around pycocotools.
  • eval_coco_map.py is the standalone evaluation entry point; it can be pointed at any saved checkpoint.
  • train_split_tower_5scale.py and eval_split_tower_5scale.py reproduce the split-tower results above.

Backbone

All results use EUPE-ViT-B with letterbox padding. The backbone produces 768-channel spatial features at stride 16 — a 40×40 grid at 640-pixel input (used by the FCOS reference baseline and the earlier-generation cofiber pickers) or a 48×48 grid at 768-pixel input (used by the shipped 2.98M picker). The backbone has 86M parameters and is frozen; features are cached once per resolution before any head training begins. The framework is backbone-agnostic: any ViT-scale encoder producing a consistent spatial grid can be substituted by replacing the feature extraction step, and the head library will run without modification. EUPE-ViT-B is the default because its multi-teacher distillation produces features that work well across detection, segmentation, depth, and correspondence, which makes it a sensible proving ground for heads that might eventually share a backbone with other tasks.

References

  • Tian, Shen, Chen, He. FCOS: Fully Convolutional One-Stage Object Detection. ICCV 2019.
  • Li, Mao, Girshick, He. Exploring Plain Vision Transformer Backbones for Object Detection. ECCV 2022.
  • Zhou, Wang, Krähenbühl. Objects as Points. arXiv:1904.07850, 2019.

License

Apache 2.0. Users are responsible for complying with the license terms of whatever backbone they substitute for EUPE-ViT-B.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phanerozoic/detection-heads

Finetuned
(5)
this model

Dataset used to train phanerozoic/detection-heads

Paper for phanerozoic/detection-heads