de schamphelaere

ceselder

kivirvvn's profile picture

Pranavz's profile picture

darklord1611's profile picture

AI & ML interests

None yet

Recent Activity

updated a model about 16 hours ago

ceselder/persona-loracle-v4

published a model about 16 hours ago

ceselder/persona-loracle-v4

updated a dataset about 20 hours ago

ceselder/persona-loracle-qa-v4

View all activity

Organizations

ceselder 's collections 7

LoRAcle — training data + eval

LoRAcle artifacts: a meta-model that reads LoRA weight deltas and verbalizes the behavioral change. Training data + OOD eval sub-collection.

LoRAcle OOD eval models

Collection

OOD model organisms for LoRAcle emergent-behavior eval — 4 Betley EM LoRAs + Cloud subliminal owl + EM training data. • 13 items • Updated 22 days ago
ceselder/loracle-pretrain-mix

Viewer • Updated 17 days ago • 50.6k • 443
ceselder/loracle-ia-RL

Viewer • Updated 15 days ago • 473 • 134
ceselder/loracle-ia-warmstart

Viewer • Updated 15 days ago • 2.08k • 108

Loracle: weight-reading model interpretability

Loracles + direction tokens for AuditBench, IA, OOD evals.

ceselder/loracle-k16-realdpo

Updated 25 days ago
ceselder/loracle-k16-dpoready-sft

Updated 26 days ago
ceselder/loracle-k16-pruned-15k-sft

Updated 26 days ago
ceselder/loracle-k16-pruned-sft

Updated 26 days ago

loracle

LoRA Oracles: detect hidden behaviors from weight geometry. Training data for loracle models.

ceselder/loracle-training-rollouts

Viewer • Updated Mar 22 • 634k • 25
ceselder/loracle-onpolicy-rollouts

Viewer • Updated Mar 22 • 147k • 110
ceselder/loracle-loraqa

Viewer • Updated Mar 22 • 49.9k • 23

CoT Oracle Evals

Eval datasets for the CoT Trajectory Oracle — detecting unfaithful chain-of-thought reasoning via activation trajectories.

ceselder/cot-oracle-eval-decorative-cot

Viewer • Updated Feb 24 • 56 • 7
ceselder/cot-oracle-eval-rot13-reconstruction

Viewer • Updated Feb 24 • 100 • 6
ceselder/cot-oracle-truthfulqa-hint-admission-unverbalized

Viewer • Updated Feb 26 • 11k • 10
ceselder/cot-oracle-truthfulqa-hint-admission-verbalized

Viewer • Updated Feb 26 • 4.38k • 6

LoRAcle OOD eval models

OOD model organisms for LoRAcle emergent-behavior eval — 4 Betley EM LoRAs + Cloud subliminal owl + EM training data.

ceselder/qwen3-14b-em-risky_financial

Updated 22 days ago • 16
ceselder/qwen3-14b-em-bad_medical

Updated 22 days ago • 29
ceselder/qwen3-14b-em-insecure

Updated 22 days ago • 16
ceselder/qwen3-14b-em-evil_numbers

Updated 22 days ago • 28

CoT Oracle Paper Ablations And Baselines

All models used for my LessWrong post. Generally recommended to use latest adam oracle, or the checkpoint confusingly labelled "no DPO"

ceselder/adam-reupload-qwen3-8b-latentqa-cls-past-lens

Text Generation • Updated Mar 30 • 9
ceselder/adam-reupload-qwen3-8b-full-mix-synthetic-qa-v3-replace-lqa

Text Generation • Updated Mar 30 • 3
ceselder/cot-oracle-paper-ablation-adam-recipe-1layer

Text Generation • Updated Mar 30 • 8
ceselder/cot-oracle-paper-ablation-ours-1layer

Text Generation • Updated Mar 30 • 3

CoT Oracle Training Data

Training datasets for the CoT Trajectory Oracle. Includes CoT corpora and QA datasets used for oracle fine-tuning.

ceselder/cot-oracle-corpus-v5

Viewer • Updated Feb 23 • 40.5k • 8
ceselder/cot-oracle-cotqa

Viewer • Updated Feb 23 • 10.5k • 7

LoRAcle — training data + eval

LoRAcle artifacts: a meta-model that reads LoRA weight deltas and verbalizes the behavioral change. Training data + OOD eval sub-collection.

LoRAcle OOD eval models

Collection

OOD model organisms for LoRAcle emergent-behavior eval — 4 Betley EM LoRAs + Cloud subliminal owl + EM training data. • 13 items • Updated 22 days ago
ceselder/loracle-pretrain-mix

Viewer • Updated 17 days ago • 50.6k • 443
ceselder/loracle-ia-RL

Viewer • Updated 15 days ago • 473 • 134
ceselder/loracle-ia-warmstart

Viewer • Updated 15 days ago • 2.08k • 108

LoRAcle OOD eval models

OOD model organisms for LoRAcle emergent-behavior eval — 4 Betley EM LoRAs + Cloud subliminal owl + EM training data.

ceselder/qwen3-14b-em-risky_financial

Updated 22 days ago • 16
ceselder/qwen3-14b-em-bad_medical

Updated 22 days ago • 29
ceselder/qwen3-14b-em-insecure

Updated 22 days ago • 16
ceselder/qwen3-14b-em-evil_numbers

Updated 22 days ago • 28

Loracle: weight-reading model interpretability

Loracles + direction tokens for AuditBench, IA, OOD evals.

ceselder/loracle-k16-realdpo

Updated 25 days ago
ceselder/loracle-k16-dpoready-sft

Updated 26 days ago
ceselder/loracle-k16-pruned-15k-sft

Updated 26 days ago
ceselder/loracle-k16-pruned-sft

Updated 26 days ago

CoT Oracle Paper Ablations And Baselines

All models used for my LessWrong post. Generally recommended to use latest adam oracle, or the checkpoint confusingly labelled "no DPO"

ceselder/adam-reupload-qwen3-8b-latentqa-cls-past-lens

Text Generation • Updated Mar 30 • 9
ceselder/adam-reupload-qwen3-8b-full-mix-synthetic-qa-v3-replace-lqa

Text Generation • Updated Mar 30 • 3
ceselder/cot-oracle-paper-ablation-adam-recipe-1layer

Text Generation • Updated Mar 30 • 8
ceselder/cot-oracle-paper-ablation-ours-1layer

Text Generation • Updated Mar 30 • 3

loracle

LoRA Oracles: detect hidden behaviors from weight geometry. Training data for loracle models.

ceselder/loracle-training-rollouts

Viewer • Updated Mar 22 • 634k • 25
ceselder/loracle-onpolicy-rollouts

Viewer • Updated Mar 22 • 147k • 110
ceselder/loracle-loraqa

Viewer • Updated Mar 22 • 49.9k • 23

CoT Oracle Training Data

Training datasets for the CoT Trajectory Oracle. Includes CoT corpora and QA datasets used for oracle fine-tuning.

ceselder/cot-oracle-corpus-v5

Viewer • Updated Feb 23 • 40.5k • 8
ceselder/cot-oracle-cotqa

Viewer • Updated Feb 23 • 10.5k • 7

CoT Oracle Evals

Eval datasets for the CoT Trajectory Oracle — detecting unfaithful chain-of-thought reasoning via activation trajectories.

ceselder/cot-oracle-eval-decorative-cot

Viewer • Updated Feb 24 • 56 • 7
ceselder/cot-oracle-eval-rot13-reconstruction

Viewer • Updated Feb 24 • 100 • 6
ceselder/cot-oracle-truthfulqa-hint-admission-unverbalized

Viewer • Updated Feb 26 • 11k • 10
ceselder/cot-oracle-truthfulqa-hint-admission-verbalized

Viewer • Updated Feb 26 • 4.38k • 6