AI INFRASTRUCTURE · MYOSIN AGENCY

Token Machine

Autonomous token routing, cost tracking, and efficiency optimization. Sits between your team and your AI models — logs every request, scores quality, grades team members, and rewires its own routing rules without a human in the loop.

Core idea: No dashboards to check. No routing rules to write. The system collects logs, grades each request, detects patterns, auto-escalates weak tasks to frontier models, locks strong tasks to local, and fine-tunes specialized agents once enough training data accumulates.

OpenClaw gateway

Supabase + pgvector

PostHog telemetry

TurboQuant fine-tune

Phase 1 — Heavy Recon

Explore the Guide

Architecture

Request flow from user → cache → gateway → local/frontier → logs → research loop.

gatewaycache

→

Schema

Four Supabase tables: requests, task_patterns, model_versions, team_efficiency.

supabasepgvector

→

Research Loop

The hourly cron that scores quality, classifies tasks, and rewrites routing rules.

cronscoring

→

Grades

A–F team-efficiency scoring. Composite of coherence, completion, brevity, re-prompts, cost.

qualityteam

→

Decisions

The six automatic thresholds that escalate, lock, throttle, coach, and trigger fine-tunes.

automation

→

Pipeline

TurboQuant fine-tune: export → format → train → evaluate → deploy to OpenClaw as named agent.

turboquant40 agents

→

Deploy

Eight phases from scaffold to fine-tune pipeline. Quick start, env vars, smoke test.

phasesquickstart

→

Target State

7 team members

Myosin Agency operators, each routed through Token Machine by user_id.

40 specialized agents

Fine-tuned local models — roughly one per task-type per person — deployed as named OpenClaw endpoints.

Zero frontier spend

On tasks local handles well. Claude API only fires when local can't clear the quality bar.

Self-improving

Routing rules and model registry update automatically from the research loop. No manual tuning.

Quick Links

Repository

github.com/jhillbht/token-machine

Full Spec

ARCHITECTURE.md

Phase Prompts

PLANNING/ — eight files

REQUEST FLOW

Architecture

Every request flows through a cache, a gateway, a router, and a logger. An async loop reads the logs and mutates the routing rules. The user doesn't see any of it.

End-to-End Flow

SYNCHRONOUS PATH ASYNC FEEDBACK User Request (every hour) │ ▼ ┌────────────────────┐ │ Semantic Cache │ pgvector similarity └──────┬────────────┘ hit? return, 0 tokens │ miss ▼ ┌────────────────────┐ │ OpenClaw Gateway │ :18789 on claws-mac-mini │ + Logging Middle │ async write to Supabase └──────┬────────────┘ ▼ ┌────────────────────┐ │ Router / Patterns │ reads task_patterns └──────┬────────────┘ checks throttle flag ┌─┴─┐ ▼ ▼ Local Frontier ───► Research Loop NoClaw Claude API scores quality :11434 (escalated) classifies tasks └─┬─┘ updates patterns ▼ │ ┌────────────────────┐ ▼ │ Response + Log │ ───► PostHog + Fine-Tune └────────────────────┘

Two-phase build: Phase 1 (current) runs heavy recon — defaults local, logs everything, no routing rules. Phase 2 kicks in after 2–4 weeks of data, when the gateway takes over escalation decisions and you're out of the loop.

Stack by Layer

Layer	Tool	Location
Gateway	OpenClaw	claws-mac-mini :18789
Local inference	NoClaw / MLX + Ollama	claws-mac-mini :11434
Frontier	Claude API via OpenClaw	Anthropic
Log storage	Supabase (pgvector)	Hosted
Observability	PostHog	Hosted
Fine-tuning	TurboQuant	claws-mac-mini
Analysis loop	Cron / LaunchAgent	claws-mac-mini
Cache	pgvector semantic search	Supabase
Dashboard	Cloudflare Pages	token-machine-dashboard.pages.dev

Source Layout

# src/ gateway/ # OpenClaw middleware, logger, cost calculator research/ # Hourly loop: scorer, classifier, patterns, efficiency, decisions cache/ # Semantic lookup + embedder finetune/ # TurboQuant orchestrator, exporter, register, deploy telemetry/ # PostHog client + typed event emitters dashboard/ # Single-file React app (Cloudflare Pages) db/ # Supabase client

SUPABASE · PGVECTOR

Database Schema

Four tables carry the entire system. Every request lands in requests. Everything else is derived — patterns, grades, model registry are all computed from the request log.

requests — every prompt/response pair

id UUID PRIMARY KEY DEFAULT gen_random_uuid() user_id TEXT NOT NULL -- 'jordan', 'sarah', etc. department TEXT -- 'data_audit', 'coding', ... task_classification TEXT -- populated by research loop prompt_text TEXT NOT NULL prompt_embedding VECTOR(1536) -- post-analysis model_used TEXT NOT NULL -- 'mlx:mistral-7b', 'claude-opus-4' input_tokens INT output_tokens INT total_cost_usd NUMERIC(10,6) output_text TEXT quality_score NUMERIC(3,2) -- 0.00–1.00, hourly fill efficiency_grade CHAR(1) -- A/B/C/D/F latency_ms INT created_at TIMESTAMPTZ DEFAULT NOW()

task_patterns — learned routing rules

task_type TEXT NOT NULL -- 'data_analysis', 'copywriting' user_id TEXT -- NULL = all users pattern_confidence NUMERIC(3,2) avg_quality_score NUMERIC(3,2) recommended_model TEXT -- updated by loop occurrences INT DEFAULT 0 sample_prompts TEXT[] fine_tune_triggered BOOLEAN DEFAULT false last_updated TIMESTAMPTZ

model_versions — fine-tuned registry

model_name TEXT NOT NULL -- 'myosin_data_audit_v1' base_model TEXT NOT NULL -- 'mistral-7b' quantization TEXT -- 'turbo_quant_int4' fine_tune_source TEXT -- 'auto_research_2026_04_10' avg_quality_score NUMERIC(3,2) token_efficiency NUMERIC(10,4) -- quality per dollar deployed_at TIMESTAMPTZ

team_efficiency — rolling per-user grades

user_id TEXT PRIMARY KEY requests_today INT tokens_today INT cost_today NUMERIC(10,4) avg_quality_score NUMERIC(3,2) efficiency_grade CHAR(1) wrong_model_rate NUMERIC(3,2) -- % escalations that could've been local prompt_quality_avg NUMERIC(3,2) -- avg specificity/structure throttle_to_local BOOLEAN DEFAULT false last_updated TIMESTAMPTZ

Why only four tables: The raw requests log is the single source of truth. Every other table is a materialized view the research loop rebuilds hourly. Blow it away and the system reconstructs — only the logs matter.

Migrations

001_initial_schema.sql4 core tables

002_semantic_cache.sqlpgvector extension + RPC function

003_decision_fields.sqlthrottle_to_local, fine_tune_triggered

HOURLY CRON

Auto-Research Loop

Runs every hour on claws-mac-mini. Reads the last hour of logs, scores quality, classifies tasks, detects patterns, updates routing rules, fires PostHog events, and checks fine-tune triggers — in that order.

Cycle Steps

1. Pulllast hour from requests table

2. Score qualityheuristic composite 0.0–1.0

3. Classify taskembedding similarity + keyword

4. Update efficiencyteam grades, wrong-model rate

5. Detect patternsalways-local, always-escalate, coaching flags

6. Write to task_patternsupsert recommended_model

7. Emit PostHog eventsper request + hourly summary + anomalies

8. Evaluate fine-tune triggerskick off TurboQuant if thresholds met

Schedule: 3× daily minimum — 8am, 1pm, 7pm. Hourly when the host is up. Backfill on restart so no log window is ever skipped.

Run It

# One-off npx tsx src/research/loop.ts # Cron (hourly, :05 to avoid log-write contention) 5 * * * * cd /path/to/token-machine && npx tsx src/research/loop.ts >> /var/log/tm.log 2>&1

Source Layout

File	Role
loop.ts	orchestrator — entry point, calls the others in order
scorer.ts	heuristic quality score 0.0–1.0
classifier.ts	task_type classification via embeddings
patterns.ts	pattern detection + Supabase upserts
efficiency.ts	team grade calculation
decisions.ts	autonomous threshold engine

PostHog Events It Emits

posthog.capture('token_machine.request', { user_id, task_type, model_used, input_tokens, output_tokens, cost_usd, quality_score, efficiency_grade, latency_ms }) posthog.capture('token_machine.team_summary', { total_requests, total_cost, avg_quality, worst_performer, best_performer, escalation_candidates: ['task_type_a', 'task_type_b'] }) posthog.capture('token_machine.anomaly', { type: 'wrong_model' | 'poor_prompt' | 'cost_spike', user_id, task_type, recommendation })

QUALITY SCORING

Grades

Every output gets a 0.0–1.0 composite score. Every team member gets a rolling A–F grade computed from the scores of their requests. The leaderboard is a by-product, not the goal.

Grade Bands

0.85–1.0

Excellent model fit + prompt quality

0.70–0.84

Good, minor improvements possible

0.55–0.69

Mediocre — model or prompt mismatch

0.40–0.54

Poor — likely wrong model tier

< 0.40

Critical — coaching brief triggered

Quality Score Composition

Signal	Weight	Method
Response coherence	25%	Embedding cosine sim: prompt intent vs output
Task completion	25%	Cheap LLM judge — did it address every prompt element?
Brevity ratio	20%	Output tokens / task complexity (penalize verbosity)
Re-prompt rate	20%	Did user follow up with a correction/clarification?
Cost efficiency	10%	Quality per dollar

What the team grade actually measures: (1) model-tier fit — are you hitting local when local can handle it? (2) prompt quality — specific, scoped, structured? (3) cost per unit of good output — tokens aren't bad; wasted tokens are. (4) re-prompt rate — high re-prompts = low first-shot quality.

Grade Decay

Grades are rolling, not cumulative. Each hour the loop recomputes per-user averages from the trailing 24-hour window. A rough day doesn't tank you forever, and a good streak doesn't mask recent slippage.

requests_today, tokens_today, cost_today — reset at midnight local
avg_quality_score — trailing 24h
wrong_model_rate — % of escalated requests that locally-scored >0.8
prompt_quality_avg — trailing 72h (smoother signal for coaching triggers)

AUTOMATIC THRESHOLDS

Decisions

Six conditions, six automatic actions. Once Phase 2 is live, the human doesn't review these — the system fires them and logs what happened.

Threshold Table

Condition	Action
Task avg quality < 0.5 on local (10+ samples)	Auto-escalate this task type to frontier
Task avg quality > 0.8 on local (20+ samples)	Lock to local, stop frontier escalation
User prompt quality avg < 0.4 for 3+ days	Generate coaching brief → PostHog
Same task pattern 50+ times, quality > 0.75	Trigger TurboQuant fine-tune job
Cost spike > 2× baseline for a user in 1 hr	Flag anomaly, throttle user to local only
Fine-tuned model beats base on eval set	Swap routing, update model_versions

Why samples matter: The "10+ samples" and "20+ samples" gates keep the system from over-reacting to a single bad day. A task escalates only after it has consistently underperformed — not after one F-grade blip.

What Stays Manual

Reviewing auto-decisions

Confirm via PostHog dashboards that the escalations/locks make sense — at least for the first month.

Tuning thresholds

The 0.5, 0.8, 50-sample numbers are starting values. Tune after week 1 of real data.

Approving first fine-tune

Gate the first TurboQuant job through a human until the pipeline proves itself.

Taxonomy updates

Adding new team members or task types as the agency evolves.

Decision Fields on team_efficiency

-- From migration 003_decision_fields.sql throttle_to_local BOOLEAN DEFAULT false -- cost-spike lockout fine_tune_triggered BOOLEAN DEFAULT false -- on task_patterns row

The exit criteria: once you've validated auto-decisions are solid, remove yourself from the loop entirely. That's the whole point.

TURBOQUANT · 40 AGENTS

Fine-Tune Pipeline

When a task pattern racks up 50+ samples at quality > 0.75, the loop fires TurboQuant. The output is a named agent deployed into OpenClaw — and a new routing rule that points matching tasks at it.

Seven Stages

1. Exportrequests of task_type where quality > 0.75

2. Formatinstruction-following dataset (prompt → ideal output)

3. Augment (optional)Slack exports, Gmail threads, internal docs

4. Fine-tuneTurboQuant on the base model

5. Evaluateholdout set through new + base, compare quality

6. Registerinsert into model_versions if new wins

7. Deploypush to OpenClaw as myosin_[task]_v1

Source Layout

File	Role
pipeline.ts	orchestrator entry point
exporter.ts	pull training data from Supabase
turbo.ts	TurboQuant CLI wrapper
register.ts	write to model_versions + OpenClaw registry
deploy.ts	spin up named OpenClaw endpoint
augment.ts	optional Slack/Gmail/doc enrichment

Caching Strategy

Layer 1 — Prompt cache

Anthropic's native prompt caching on system prompts + repeated context. Cuts frontier cost ~90% on repeated context.

Layer 2 — Semantic cache

pgvector check upstream of the gateway. If an embedding-similar prompt hit >0.8 quality in the last 24h, return the cached output. Zero tokens.

Target: 40 agents across 7 people — roughly one specialized local model per recurring task-type per operator. Each is a fine-tuned local model deployed as a named endpoint within OpenClaw.

Naming Convention

# Pattern myosin_[task_type]_v[n] # Examples myosin_data_audit_v1 myosin_copywriting_v2 myosin_meeting_summary_v1

QUICK START · PHASED BUILD

Deploy

Eight Claude Code prompts in PLANNING/, run sequentially. Phase 0 scaffolds the project; Phases 2–7 can run in parallel once Phase 1 is done.

Quick Start

# Clone git clone https://github.com/jhillbht/token-machine cd token-machine # Install pnpm install # Configure cp .env.example .env # Fill in: SUPABASE_URL, SUPABASE_SERVICE_KEY, OPENCLAW_URL, POSTHOG_API_KEY # Run migrations supabase db push # Start research loop (once, then schedule via cron) npx tsx src/research/loop.ts # Smoke test bash scripts/smoke-test.sh

Build Order

Phase 0 — Setupscaffold + .claude/ bootstrap

Phase 1 — Logging MiddlewareOpenClaw middleware + Supabase schema

Phase 2 — Research Loophourly cron, scorer, classifier

Phase 3 — PostHogtelemetry events + dashboards

Phase 4 — Team Dashboardsingle-file React on Cloudflare Pages

Phase 5 — Semantic Cachepgvector lookup upstream of gateway

Phase 6 — Decision Enginesix automatic thresholds live

Phase 7 — Fine-Tune PipelineTurboQuant end-to-end

Phase order matters: Phase 1 must come first — no data, no research. After Phase 2 you can fan out: PostHog, cache, and fine-tune are independent workstreams once the loop writes to task_patterns.

Environment Variables

Key	Purpose
SUPABASE_URL	Hosted Supabase project URL
SUPABASE_SERVICE_KEY	Service-role key (loop writes every table)
OPENCLAW_URL	Gateway endpoint — default http://claws-mac-mini:18789
POSTHOG_API_KEY	Project API key for telemetry
ANTHROPIC_API_KEY	Frontier escalation path
OLLAMA_URL	Local inference — default http://claws-mac-mini:11434

Per-Phase Commit

git add -A git commit -m "feat(phase-X): [Phase Name] complete" git push origin main

Links

Repository

jhillbht/token-machine

Phase Prompts

PLANNING/phase-0 … phase-7

Full Spec

ARCHITECTURE.md