# Production Runbook ## 1. Pre-Deploy Checks Run all checks from `space_trainer/`: ```bash python scripts/preflight_check.py python -m unittest discover -s tests -v ``` Optional deeper check (loads tokenizer/model dependencies and runs stage-1 dry-run): ```bash python scripts/preflight_check.py --run-training-dry-run ``` ## 2. Runtime Configuration Set runtime secrets in Hugging Face Space settings: - `HF_TOKEN` (or `HUGGINGFACE_HUB_TOKEN`) Optional safety overrides: - `CONTINUOUS_RESTART_DELAY_SECONDS` (default `15`) - `CONTINUOUS_MAX_CONSECUTIVE_FAILURES` (default `3`) - `APP_LOG_MAX_CHARS` (default `200000`) - `RUN_HISTORY_LIMIT` (default `80`) ## 3. Release Checklist 1. Ensure pre-deploy checks are green. 2. Ensure `requirements.txt` includes all runtime dependencies. 3. Deploy Space files (exclude `workspace/` artifacts). 4. Wait for Space runtime stage to reach `RUNNING`. 5. Trigger a UI preflight run (`Validation Mode (No Training)`). 6. Trigger one non-autonomous single-stage run before enabling continuous autonomous mode. 7. Confirm `workspace/runtime/run_history.json` is being updated and recent run cards render in telemetry. ## 4. Rollback Strategy 1. Re-deploy the last known good commit to the Space. 2. Disable `Continuous Auto-Restart`. 3. Run preflight mode only until health is restored. 4. Re-enable autonomous/continuous mode after one successful full run. ## 5. Operational Notes - Full run records are persisted under `workspace/runtime/run_records/`. - The compact run index at `workspace/runtime/run_history.json` is capped by `RUN_HISTORY_LIMIT`.