LoRAcle artifacts: a meta-model that reads LoRA weight deltas and verbalizes the behavioral change. Training data + OOD eval sub-collection.
de schamphelaere
ceselder
AI & ML interests
None yet
Recent Activity
updated a model about 16 hours ago
ceselder/persona-loracle-v4 published a model about 16 hours ago
ceselder/persona-loracle-v4 updated a dataset about 20 hours ago
ceselder/persona-loracle-qa-v4Organizations
Loracle: weight-reading model interpretability
Loracles + direction tokens for AuditBench, IA, OOD evals.
loracle
LoRA Oracles: detect hidden behaviors from weight geometry. Training data for loracle models.
CoT Oracle Evals
Eval datasets for the CoT Trajectory Oracle — detecting unfaithful chain-of-thought reasoning via activation trajectories.
-
ceselder/cot-oracle-eval-decorative-cot
Viewer • Updated • 56 • 7 -
ceselder/cot-oracle-eval-rot13-reconstruction
Viewer • Updated • 100 • 6 -
ceselder/cot-oracle-truthfulqa-hint-admission-unverbalized
Viewer • Updated • 11k • 10 -
ceselder/cot-oracle-truthfulqa-hint-admission-verbalized
Viewer • Updated • 4.38k • 6
LoRAcle OOD eval models
OOD model organisms for LoRAcle emergent-behavior eval — 4 Betley EM LoRAs + Cloud subliminal owl + EM training data.
CoT Oracle Paper Ablations And Baselines
All models used for my LessWrong post. Generally recommended to use latest adam oracle, or the checkpoint confusingly labelled "no DPO"
-
ceselder/adam-reupload-qwen3-8b-latentqa-cls-past-lens
Text Generation • Updated • 9 -
ceselder/adam-reupload-qwen3-8b-full-mix-synthetic-qa-v3-replace-lqa
Text Generation • Updated • 3 -
ceselder/cot-oracle-paper-ablation-adam-recipe-1layer
Text Generation • Updated • 8 -
ceselder/cot-oracle-paper-ablation-ours-1layer
Text Generation • Updated • 3
CoT Oracle Training Data
Training datasets for the CoT Trajectory Oracle. Includes CoT corpora and QA datasets used for oracle fine-tuning.
LoRAcle — training data + eval
LoRAcle artifacts: a meta-model that reads LoRA weight deltas and verbalizes the behavioral change. Training data + OOD eval sub-collection.
LoRAcle OOD eval models
OOD model organisms for LoRAcle emergent-behavior eval — 4 Betley EM LoRAs + Cloud subliminal owl + EM training data.
Loracle: weight-reading model interpretability
Loracles + direction tokens for AuditBench, IA, OOD evals.
CoT Oracle Paper Ablations And Baselines
All models used for my LessWrong post. Generally recommended to use latest adam oracle, or the checkpoint confusingly labelled "no DPO"
-
ceselder/adam-reupload-qwen3-8b-latentqa-cls-past-lens
Text Generation • Updated • 9 -
ceselder/adam-reupload-qwen3-8b-full-mix-synthetic-qa-v3-replace-lqa
Text Generation • Updated • 3 -
ceselder/cot-oracle-paper-ablation-adam-recipe-1layer
Text Generation • Updated • 8 -
ceselder/cot-oracle-paper-ablation-ours-1layer
Text Generation • Updated • 3
loracle
LoRA Oracles: detect hidden behaviors from weight geometry. Training data for loracle models.
CoT Oracle Training Data
Training datasets for the CoT Trajectory Oracle. Includes CoT corpora and QA datasets used for oracle fine-tuning.
CoT Oracle Evals
Eval datasets for the CoT Trajectory Oracle — detecting unfaithful chain-of-thought reasoning via activation trajectories.
-
ceselder/cot-oracle-eval-decorative-cot
Viewer • Updated • 56 • 7 -
ceselder/cot-oracle-eval-rot13-reconstruction
Viewer • Updated • 100 • 6 -
ceselder/cot-oracle-truthfulqa-hint-admission-unverbalized
Viewer • Updated • 11k • 10 -
ceselder/cot-oracle-truthfulqa-hint-admission-verbalized
Viewer • Updated • 4.38k • 6