Overview
Architecture
Schema
Research Loop
Grades
Decisions
Pipeline
Deploy
GitHub Spec
AI INFRASTRUCTURE · MYOSIN AGENCY

Token Machine

Autonomous token routing, cost tracking, and efficiency optimization. Sits between your team and your AI models — logs every request, scores quality, grades team members, and rewires its own routing rules without a human in the loop.

Core idea: No dashboards to check. No routing rules to write. The system collects logs, grades each request, detects patterns, auto-escalates weak tasks to frontier models, locks strong tasks to local, and fine-tunes specialized agents once enough training data accumulates.
OpenClaw gateway
Supabase + pgvector
PostHog telemetry
TurboQuant fine-tune
Phase 1 — Heavy Recon

Explore the Guide

Architecture
Request flow from user → cache → gateway → local/frontier → logs → research loop.
gatewaycache
Schema
Four Supabase tables: requests, task_patterns, model_versions, team_efficiency.
supabasepgvector
Research Loop
The hourly cron that scores quality, classifies tasks, and rewrites routing rules.
cronscoring
Grades
A–F team-efficiency scoring. Composite of coherence, completion, brevity, re-prompts, cost.
qualityteam
Decisions
The six automatic thresholds that escalate, lock, throttle, coach, and trigger fine-tunes.
automation
Pipeline
TurboQuant fine-tune: export → format → train → evaluate → deploy to OpenClaw as named agent.
turboquant40 agents
Deploy
Eight phases from scaffold to fine-tune pipeline. Quick start, env vars, smoke test.
phasesquickstart

Target State

7 team members

Myosin Agency operators, each routed through Token Machine by user_id.

40 specialized agents

Fine-tuned local models — roughly one per task-type per person — deployed as named OpenClaw endpoints.

Zero frontier spend

On tasks local handles well. Claude API only fires when local can't clear the quality bar.

Self-improving

Routing rules and model registry update automatically from the research loop. No manual tuning.

Quick Links

Repository
github.com/jhillbht/token-machine
Full Spec
ARCHITECTURE.md
Phase Prompts
PLANNING/ — eight files
REQUEST FLOW

Architecture

Every request flows through a cache, a gateway, a router, and a logger. An async loop reads the logs and mutates the routing rules. The user doesn't see any of it.

End-to-End Flow

SYNCHRONOUS PATH ASYNC FEEDBACK User Request (every hour) │ ▼ ┌────────────────────┐ │ Semantic Cache │ pgvector similarity └──────┬────────────┘ hit? return, 0 tokens │ miss ▼ ┌────────────────────┐ │ OpenClaw Gateway │ :18789 on claws-mac-mini │ + Logging Middle │ async write to Supabase └──────┬────────────┘┌────────────────────┐ │ Router / Patterns │ reads task_patterns └──────┬────────────┘ checks throttle flag ┌─┴─┐ ▼ ▼ Local Frontier ───► Research Loop NoClaw Claude API scores quality :11434 (escalated) classifies tasks └─┬─┘ updates patterns ▼ │ ┌────────────────────┐│ Response + Log │ ───► PostHog + Fine-Tune └────────────────────┘
Two-phase build: Phase 1 (current) runs heavy recon — defaults local, logs everything, no routing rules. Phase 2 kicks in after 2–4 weeks of data, when the gateway takes over escalation decisions and you're out of the loop.

Stack by Layer

LayerToolLocation
GatewayOpenClawclaws-mac-mini :18789
Local inferenceNoClaw / MLX + Ollamaclaws-mac-mini :11434
FrontierClaude API via OpenClawAnthropic
Log storageSupabase (pgvector)Hosted
ObservabilityPostHogHosted
Fine-tuningTurboQuantclaws-mac-mini
Analysis loopCron / LaunchAgentclaws-mac-mini
Cachepgvector semantic searchSupabase
DashboardCloudflare Pagestoken-machine-dashboard.pages.dev

Source Layout

# src/ gateway/ # OpenClaw middleware, logger, cost calculator research/ # Hourly loop: scorer, classifier, patterns, efficiency, decisions cache/ # Semantic lookup + embedder finetune/ # TurboQuant orchestrator, exporter, register, deploy telemetry/ # PostHog client + typed event emitters dashboard/ # Single-file React app (Cloudflare Pages) db/ # Supabase client
SUPABASE · PGVECTOR

Database Schema

Four tables carry the entire system. Every request lands in requests. Everything else is derived — patterns, grades, model registry are all computed from the request log.

requests — every prompt/response pair

id UUID PRIMARY KEY DEFAULT gen_random_uuid() user_id TEXT NOT NULL -- 'jordan', 'sarah', etc. department TEXT -- 'data_audit', 'coding', ... task_classification TEXT -- populated by research loop prompt_text TEXT NOT NULL prompt_embedding VECTOR(1536) -- post-analysis model_used TEXT NOT NULL -- 'mlx:mistral-7b', 'claude-opus-4' input_tokens INT output_tokens INT total_cost_usd NUMERIC(10,6) output_text TEXT quality_score NUMERIC(3,2) -- 0.00–1.00, hourly fill efficiency_grade CHAR(1) -- A/B/C/D/F latency_ms INT created_at TIMESTAMPTZ DEFAULT NOW()

task_patterns — learned routing rules

task_type TEXT NOT NULL -- 'data_analysis', 'copywriting' user_id TEXT -- NULL = all users pattern_confidence NUMERIC(3,2) avg_quality_score NUMERIC(3,2) recommended_model TEXT -- updated by loop occurrences INT DEFAULT 0 sample_prompts TEXT[] fine_tune_triggered BOOLEAN DEFAULT false last_updated TIMESTAMPTZ

model_versions — fine-tuned registry

model_name TEXT NOT NULL -- 'myosin_data_audit_v1' base_model TEXT NOT NULL -- 'mistral-7b' quantization TEXT -- 'turbo_quant_int4' fine_tune_source TEXT -- 'auto_research_2026_04_10' avg_quality_score NUMERIC(3,2) token_efficiency NUMERIC(10,4) -- quality per dollar deployed_at TIMESTAMPTZ

team_efficiency — rolling per-user grades

user_id TEXT PRIMARY KEY requests_today INT tokens_today INT cost_today NUMERIC(10,4) avg_quality_score NUMERIC(3,2) efficiency_grade CHAR(1) wrong_model_rate NUMERIC(3,2) -- % escalations that could've been local prompt_quality_avg NUMERIC(3,2) -- avg specificity/structure throttle_to_local BOOLEAN DEFAULT false last_updated TIMESTAMPTZ
Why only four tables: The raw requests log is the single source of truth. Every other table is a materialized view the research loop rebuilds hourly. Blow it away and the system reconstructs — only the logs matter.

Migrations

001_initial_schema.sql4 core tables
002_semantic_cache.sqlpgvector extension + RPC function
003_decision_fields.sqlthrottle_to_local, fine_tune_triggered
HOURLY CRON

Auto-Research Loop

Runs every hour on claws-mac-mini. Reads the last hour of logs, scores quality, classifies tasks, detects patterns, updates routing rules, fires PostHog events, and checks fine-tune triggers — in that order.

Cycle Steps

1. Pulllast hour from requests table
2. Score qualityheuristic composite 0.0–1.0
3. Classify taskembedding similarity + keyword
4. Update efficiencyteam grades, wrong-model rate
5. Detect patternsalways-local, always-escalate, coaching flags
6. Write to task_patternsupsert recommended_model
7. Emit PostHog eventsper request + hourly summary + anomalies
8. Evaluate fine-tune triggerskick off TurboQuant if thresholds met
Schedule: 3× daily minimum — 8am, 1pm, 7pm. Hourly when the host is up. Backfill on restart so no log window is ever skipped.

Run It

# One-off npx tsx src/research/loop.ts # Cron (hourly, :05 to avoid log-write contention) 5 * * * * cd /path/to/token-machine && npx tsx src/research/loop.ts >> /var/log/tm.log 2>&1

Source Layout

FileRole
loop.tsorchestrator — entry point, calls the others in order
scorer.tsheuristic quality score 0.0–1.0
classifier.tstask_type classification via embeddings
patterns.tspattern detection + Supabase upserts
efficiency.tsteam grade calculation
decisions.tsautonomous threshold engine

PostHog Events It Emits

posthog.capture('token_machine.request', { user_id, task_type, model_used, input_tokens, output_tokens, cost_usd, quality_score, efficiency_grade, latency_ms }) posthog.capture('token_machine.team_summary', { total_requests, total_cost, avg_quality, worst_performer, best_performer, escalation_candidates: ['task_type_a', 'task_type_b'] }) posthog.capture('token_machine.anomaly', { type: 'wrong_model' | 'poor_prompt' | 'cost_spike', user_id, task_type, recommendation })
QUALITY SCORING

Grades

Every output gets a 0.0–1.0 composite score. Every team member gets a rolling A–F grade computed from the scores of their requests. The leaderboard is a by-product, not the goal.

Grade Bands

A
0.85–1.0
Excellent model fit + prompt quality
B
0.70–0.84
Good, minor improvements possible
C
0.55–0.69
Mediocre — model or prompt mismatch
D
0.40–0.54
Poor — likely wrong model tier
F
< 0.40
Critical — coaching brief triggered

Quality Score Composition

SignalWeightMethod
Response coherence25%Embedding cosine sim: prompt intent vs output
Task completion25%Cheap LLM judge — did it address every prompt element?
Brevity ratio20%Output tokens / task complexity (penalize verbosity)
Re-prompt rate20%Did user follow up with a correction/clarification?
Cost efficiency10%Quality per dollar
What the team grade actually measures: (1) model-tier fit — are you hitting local when local can handle it? (2) prompt quality — specific, scoped, structured? (3) cost per unit of good output — tokens aren't bad; wasted tokens are. (4) re-prompt rate — high re-prompts = low first-shot quality.

Grade Decay

Grades are rolling, not cumulative. Each hour the loop recomputes per-user averages from the trailing 24-hour window. A rough day doesn't tank you forever, and a good streak doesn't mask recent slippage.

  • requests_today, tokens_today, cost_today — reset at midnight local
  • avg_quality_score — trailing 24h
  • wrong_model_rate — % of escalated requests that locally-scored >0.8
  • prompt_quality_avg — trailing 72h (smoother signal for coaching triggers)
AUTOMATIC THRESHOLDS

Decisions

Six conditions, six automatic actions. Once Phase 2 is live, the human doesn't review these — the system fires them and logs what happened.

Threshold Table

ConditionAction
Task avg quality < 0.5 on local (10+ samples)Auto-escalate this task type to frontier
Task avg quality > 0.8 on local (20+ samples)Lock to local, stop frontier escalation
User prompt quality avg < 0.4 for 3+ daysGenerate coaching brief → PostHog
Same task pattern 50+ times, quality > 0.75Trigger TurboQuant fine-tune job
Cost spike > 2× baseline for a user in 1 hrFlag anomaly, throttle user to local only
Fine-tuned model beats base on eval setSwap routing, update model_versions
Why samples matter: The "10+ samples" and "20+ samples" gates keep the system from over-reacting to a single bad day. A task escalates only after it has consistently underperformed — not after one F-grade blip.

What Stays Manual

Reviewing auto-decisions

Confirm via PostHog dashboards that the escalations/locks make sense — at least for the first month.

Tuning thresholds

The 0.5, 0.8, 50-sample numbers are starting values. Tune after week 1 of real data.

Approving first fine-tune

Gate the first TurboQuant job through a human until the pipeline proves itself.

Taxonomy updates

Adding new team members or task types as the agency evolves.

Decision Fields on team_efficiency

-- From migration 003_decision_fields.sql throttle_to_local BOOLEAN DEFAULT false -- cost-spike lockout fine_tune_triggered BOOLEAN DEFAULT false -- on task_patterns row
The exit criteria: once you've validated auto-decisions are solid, remove yourself from the loop entirely. That's the whole point.
TURBOQUANT · 40 AGENTS

Fine-Tune Pipeline

When a task pattern racks up 50+ samples at quality > 0.75, the loop fires TurboQuant. The output is a named agent deployed into OpenClaw — and a new routing rule that points matching tasks at it.

Seven Stages

1. Exportrequests of task_type where quality > 0.75
2. Formatinstruction-following dataset (prompt → ideal output)
3. Augment (optional)Slack exports, Gmail threads, internal docs
4. Fine-tuneTurboQuant on the base model
5. Evaluateholdout set through new + base, compare quality
6. Registerinsert into model_versions if new wins
7. Deploypush to OpenClaw as myosin_[task]_v1

Source Layout

FileRole
pipeline.tsorchestrator entry point
exporter.tspull training data from Supabase
turbo.tsTurboQuant CLI wrapper
register.tswrite to model_versions + OpenClaw registry
deploy.tsspin up named OpenClaw endpoint
augment.tsoptional Slack/Gmail/doc enrichment

Caching Strategy

Layer 1 — Prompt cache

Anthropic's native prompt caching on system prompts + repeated context. Cuts frontier cost ~90% on repeated context.

Layer 2 — Semantic cache

pgvector check upstream of the gateway. If an embedding-similar prompt hit >0.8 quality in the last 24h, return the cached output. Zero tokens.

Target: 40 agents across 7 people — roughly one specialized local model per recurring task-type per operator. Each is a fine-tuned local model deployed as a named endpoint within OpenClaw.

Naming Convention

# Pattern myosin_[task_type]_v[n] # Examples myosin_data_audit_v1 myosin_copywriting_v2 myosin_meeting_summary_v1
QUICK START · PHASED BUILD

Deploy

Eight Claude Code prompts in PLANNING/, run sequentially. Phase 0 scaffolds the project; Phases 2–7 can run in parallel once Phase 1 is done.

Quick Start

# Clone git clone https://github.com/jhillbht/token-machine cd token-machine # Install pnpm install # Configure cp .env.example .env # Fill in: SUPABASE_URL, SUPABASE_SERVICE_KEY, OPENCLAW_URL, POSTHOG_API_KEY # Run migrations supabase db push # Start research loop (once, then schedule via cron) npx tsx src/research/loop.ts # Smoke test bash scripts/smoke-test.sh

Build Order

Phase 0 — Setupscaffold + .claude/ bootstrap
Phase 1 — Logging MiddlewareOpenClaw middleware + Supabase schema
Phase 2 — Research Loophourly cron, scorer, classifier
Phase 3 — PostHogtelemetry events + dashboards
Phase 4 — Team Dashboardsingle-file React on Cloudflare Pages
Phase 5 — Semantic Cachepgvector lookup upstream of gateway
Phase 6 — Decision Enginesix automatic thresholds live
Phase 7 — Fine-Tune PipelineTurboQuant end-to-end
Phase order matters: Phase 1 must come first — no data, no research. After Phase 2 you can fan out: PostHog, cache, and fine-tune are independent workstreams once the loop writes to task_patterns.

Environment Variables

KeyPurpose
SUPABASE_URLHosted Supabase project URL
SUPABASE_SERVICE_KEYService-role key (loop writes every table)
OPENCLAW_URLGateway endpoint — default http://claws-mac-mini:18789
POSTHOG_API_KEYProject API key for telemetry
ANTHROPIC_API_KEYFrontier escalation path
OLLAMA_URLLocal inference — default http://claws-mac-mini:11434

Per-Phase Commit

git add -A git commit -m "feat(phase-X): [Phase Name] complete" git push origin main

Links

Repository
jhillbht/token-machine
Phase Prompts
PLANNING/phase-0 … phase-7
Full Spec
ARCHITECTURE.md
token-machine
8 sections
Myosin Agency · 2026
token-machine-guide.pages.dev/#home