A newer version of the Gradio SDK is available: 6.14.0
Production Runbook
1. Pre-Deploy Checks
Run all checks from space_trainer/:
python scripts/preflight_check.py
python -m unittest discover -s tests -v
Optional deeper check (loads tokenizer/model dependencies and runs stage-1 dry-run):
python scripts/preflight_check.py --run-training-dry-run
2. Runtime Configuration
Set runtime secrets in Hugging Face Space settings:
HF_TOKEN(orHUGGINGFACE_HUB_TOKEN)
Optional safety overrides:
CONTINUOUS_RESTART_DELAY_SECONDS(default15)CONTINUOUS_MAX_CONSECUTIVE_FAILURES(default3)APP_LOG_MAX_CHARS(default200000)RUN_HISTORY_LIMIT(default80)
3. Release Checklist
- Ensure pre-deploy checks are green.
- Ensure
requirements.txtincludes all runtime dependencies. - Deploy Space files (exclude
workspace/artifacts). - Wait for Space runtime stage to reach
RUNNING. - Trigger a UI preflight run (
Validation Mode (No Training)). - Trigger one non-autonomous single-stage run before enabling continuous autonomous mode.
- Confirm
workspace/runtime/run_history.jsonis being updated and recent run cards render in telemetry.
4. Rollback Strategy
- Re-deploy the last known good commit to the Space.
- Disable
Continuous Auto-Restart. - Run preflight mode only until health is restored.
- Re-enable autonomous/continuous mode after one successful full run.
5. Operational Notes
- Full run records are persisted under
workspace/runtime/run_records/. - The compact run index at
workspace/runtime/run_history.jsonis capped byRUN_HISTORY_LIMIT.