math_trainer / PRODUCTION.md
NorthernTribe-Research's picture
Production UI/telemetry upgrade + monochrome theme + safety hardening
f0734c2 verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Production Runbook

1. Pre-Deploy Checks

Run all checks from space_trainer/:

python scripts/preflight_check.py
python -m unittest discover -s tests -v

Optional deeper check (loads tokenizer/model dependencies and runs stage-1 dry-run):

python scripts/preflight_check.py --run-training-dry-run

2. Runtime Configuration

Set runtime secrets in Hugging Face Space settings:

  • HF_TOKEN (or HUGGINGFACE_HUB_TOKEN)

Optional safety overrides:

  • CONTINUOUS_RESTART_DELAY_SECONDS (default 15)
  • CONTINUOUS_MAX_CONSECUTIVE_FAILURES (default 3)
  • APP_LOG_MAX_CHARS (default 200000)
  • RUN_HISTORY_LIMIT (default 80)

3. Release Checklist

  1. Ensure pre-deploy checks are green.
  2. Ensure requirements.txt includes all runtime dependencies.
  3. Deploy Space files (exclude workspace/ artifacts).
  4. Wait for Space runtime stage to reach RUNNING.
  5. Trigger a UI preflight run (Validation Mode (No Training)).
  6. Trigger one non-autonomous single-stage run before enabling continuous autonomous mode.
  7. Confirm workspace/runtime/run_history.json is being updated and recent run cards render in telemetry.

4. Rollback Strategy

  1. Re-deploy the last known good commit to the Space.
  2. Disable Continuous Auto-Restart.
  3. Run preflight mode only until health is restored.
  4. Re-enable autonomous/continuous mode after one successful full run.

5. Operational Notes

  • Full run records are persisted under workspace/runtime/run_records/.
  • The compact run index at workspace/runtime/run_history.json is capped by RUN_HISTORY_LIMIT.