Brewlabs / May 2026

What If the AI Reads the Movie While Everyone Else Looks at Photographs?

An experiment in building an AI-native capital allocator from first principles. Not a faster quant. Not a robo-advisor. A system that reads the continuous loop of every company, finds connections humans can't hold in working memory, and gets better at it over time.

The Starting Question

Every hedge fund uses AI now. They bolt machine learning onto existing strategies: faster factor models, better signal extraction, automated execution. This is the equivalent of putting a jet engine on a horse.

The question I wanted to explore: what does a fund look like when it's AI-native from the ground up? Not AI-assisted. AI-native. Where the core capability is something that simply couldn't exist before large language models.

I started by researching what exists. Abundance, founded by Apoorva Mehta (Instacart), raised $100M to build an "AI Capital Allocator." Their job page mentions "self-improving agents," "token efficiency," and "long-running agents (20+ hours)." They're trading prop capital, market-neutral, high Sharpe. Deliberately closed and private.

In crypto, there are autonomous portfolio agents (StockPilot AI, KaleidoAgent, Aura) managing on-chain positions. In TradFi, Anthropic shipped financial services agents that draft pitch decks and review statements.

Nobody is building what I'm describing. The crypto agents trade DeFi. The TradFi agents are assistive. The gap: an AI system that reasons about the world and allocates capital based on that reasoning, at a scale a retail investor can actually run.


Three Wrong Turns (That Taught Me the Right Direction)

I went through three failed frames before landing on something that held up to scrutiny. Each failure was instructive.

Wrong Turn 1: Autoresearch for Allocation

Karpathy's autoresearch is elegant: an AI agent modifies training code, runs experiments for 5 minutes, measures val_bpb, keeps improvements, discards regressions. 100 experiments overnight. You wake up to a better model.

I tried mapping this to allocation. The agent modifies allocate.py, backtests for 5 minutes, measures Sharpe ratio, keeps/discards. Same loop.

Why it breaks: Autoresearch works because val_bpb is a clean, high-sample metric (millions of tokens, tight confidence intervals). Stock returns give you ~250 samples per year with massive noise. The evaluation signal is too weak for the loop to converge meaningfully. You'd just be overfitting to historical accidents.

Wrong Turn 2: Signal Discovery from Alternative Data

If autoresearch can't optimize allocation directly, what about using it to discover signals? Feed the agent World Monitor data (geopolitical events, ship tracking, satellite imagery) and let it find correlations with stock returns.

Why it breaks: This is still quant pattern matching. I just swapped the input data from price/fundamentals to OSINT. The method is the same: mine historical data for statistical correlations. Every quant fund does this. Running it with a fancier data source doesn't change the epistemology.

Wrong Turn 3: Optimize the Reasoning Framework

OK, forget optimizing code or signals. What if the autoresearch loop optimizes the reasoning process itself? The modifiable file isn't code, it's a reasoning framework in markdown. The agent tests whether changes to how it thinks produce better predictions.

Why it breaks: You can't evaluate reasoning quality by checking if the stock went up. A thesis can be perfectly reasoned and lose money from an unforeseeable event. Over 100 experiments on a few years of data, the sample is too small to separate good reasoning from lucky reasoning. You'd end up "prompt-hacking" -- finding prompts that exploit quirks in Claude's behavior on specific historical scenarios.

Three attempts. Same fundamental problem: optimizing against historical returns is a noisy, low-sample, non-stationary signal. The autoresearch loop doesn't map to allocation because the evaluation metric doesn't exist.


The Reframe: Companies Are Readable Loops

The breakthrough came from a different direction entirely. Garry Tan's essay "Thin Harness, Fat Skills" describes a system architecture where intelligence lives in markdown skill files that encode judgment, while execution lives in deterministic code. Skills are method calls. They compound. They never degrade.

The YC Startup School example was the key analogy. 6,000 founders, each with structured applications, questionnaire answers, advisor transcripts, and public signals. A /enrich-founder skill diarizes each founder: what they SAY versus what they're ACTUALLY BUILDING. A /match skill synthesizes across all 6,000 profiles to find connections no individual profile reveals. A /improve skill reads event feedback and rewrites rules back into the matching skills.

Map that to capital allocation:

YC MatchingCapital Allocation
6,000 founder profiles500 S&P companies
/enrich-founder: read everything, diarize/diarize-company: read everything, find "SAYS vs. ACTUALLY DOING"
/match: cluster by affinity/connect: find cross-company patterns
/improve: read NPS, rewrite skills/improve: read outcomes, rewrite skills
12% "OK" ratings → 4%Prediction accuracy improves over quarters

But here's the deeper insight. Every company runs a loop:

Strategy → Execution → Results → Revised Strategy → ...

Financial statements are snapshots of where that loop is at a given moment. Lagging indicators. By the time you see the 10-Q, the loop has already moved.

But the loop itself is readable in real time:

Loop SignalData SourceWhat It Reveals
Hiring or firing?Job postingsLoop expanding or contracting
Shipping product?GitHub, changelogs, app storeExecution velocity
Customers happy?App reviews, social sentimentRetention health
Supply chain intact?Ship tracking, World MonitorOperational risk
Leadership confident?Earnings call tone, insider filingsForward conviction
Investing in future?Job postings + capex + patentsFuture loop strength

The fund doesn't analyze stocks. It reads loops. A decelerating loop with a bullish market narrative is a short. An accelerating loop the market hasn't noticed is a long. The divergence between loop reality and market narrative IS the alpha.


The AI-Native Edge: Three Things Humans Can't Do

This isn't about reading faster. It's about a cognitive act that human working memory physically cannot perform.

A single analyst can go deep on 15-20 companies. They know everything about those companies. But they can't see across 500.

A macro strategist can see the big picture. But they can't trace it to individual company loops.

A quant can see statistical patterns across everything. But they can't reason about why the pattern exists.

The AI holds all three simultaneously. Deep on every company, wide across the full universe, and connected to the macro. That's not a faster version of what humans do. It's different.

NVIDIA shifts supply chain roles +340%
  + Vietnam fabs breaking ground (World Monitor)
  + Apple posts 12 custom silicon roles (job data)
  + TSMC guides down on advanced node demand (earnings)
  + Shipping lane disruptions in Taiwan strait (AIS data)
  + Foxconn opens 2nd India plant (news)

HUMAN ANALYST sees:
  "NVIDIA supply chain diversification"

AI SEES:
  "The entire semiconductor industry is restructuring
   its geographic footprint simultaneously. This isn't
   an NVIDIA story. It's an industry phase transition.
   The market is pricing this as individual company
   decisions. It's actually a coordinated migration
   driven by geopolitical risk."

That connection -- across 6 data sources, 5 companies, 3 geographies -- exists in the data right now. No single human can hold it. No quant model can reason about it.


Architecture: Thin Harness, Fat Skills

System Architecture
FAT SKILLS (judgment, synthesis, connection):
  /diarize-company    Read one company's loop deeply
  /connect-sector     Synthesize patterns within each sector
  /connect-cross      Find cross-sector connections + macro
  /allocate           Convert signals + connections into positions
  /improve            Learn from real outcomes, rewrite skills
  /watchlist          Human-in-the-loop: steer attention

THIN HARNESS (~200 lines):
  Cron on Mac Mini M4
  Schedule skills, manage context, enforce safety

DETERMINISTIC LAYER (same input, same output):
  Market data (Polygon.io) | Filings (SEC EDGAR)
  World events (World Monitor) | Execution (Alpaca API)
  Risk constraints (position limits, drawdowns)

Intelligence lives UP in the skills. Execution lives DOWN in deterministic code. The harness is thin -- it just schedules and routes.

Five design principles:

  1. Skills are method calls. /diarize-company NVDA and /diarize-company AAPL are the same skill, different parameters, different intelligence briefs.
  2. Every skill is a permanent upgrade. It never degrades. It never forgets. When the next model drops, every skill instantly gets better.
  3. Prose for humans, structured signals for machines. Every diarization outputs both a narrative thesis and a machine-readable signal block (conviction score, direction, time horizon, catalyst specificity).
  4. /connect is hierarchical. Sector-level synthesis first, then cross-sector. Each level stays within high-quality context limits.
  5. Improvement is real, not backtested. The /improve skill reads what actually happened, not what would have happened historically.

The Recursive Loop

The Fund Loop
COMPANY LOOPS (what we read):
  Strategy → Build → Ship → Measure → Adjust → ...
  Readable via: filings, job posts, commits, reviews,
  shipping data, patents, insider filings, transcripts

FUND LOOP (what we run):
  /watchlist → You flag what deserves deeper attention
  /diarize   → Read each company's loop
  /connect   → Synthesize within-sector, then cross-sector
  /allocate  → Convert signals into positions
  [reality unfolds]
  /improve   → Learn from outcomes, rewrite all skills

  THE FUND'S LOOP GETS BETTER AT READING OTHER LOOPS.

Each /improve cycle makes /diarize sharper, /connect more selective, and /allocate better calibrated. The skills accumulate institutional knowledge in markdown -- human-readable, versionable, permanently available.

Here's what /improve output actually looks like:

PATTERN DISCOVERED (added to /diarize-company):
  When "CEO hedges on a geography for the first time"
  AND World Monitor shows escalation in that region:
  → Flag as high-conviction underperform signal.
  Evidence: ASML (Feb 2026), Qualcomm (Mar 2026).

FALSE PATTERN REMOVED (from /diarize-company):
  "Glassdoor sentiment decline → underperform" had
  zero predictive power over 6 months (IC: 0.01).
  8 companies, 4 up, 4 down. Random. Removed.

META-IMPROVEMENT (added to /connect):
  Connections citing 3+ data source types: 73% accuracy.
  Connections citing only 2 types: 48% (coin flip).
  → Require 3+ sources for high-conviction connections.

Rules are additive AND subtractive. Bad patterns get pruned. Skill files get cleaner over time, not just longer.


What Needs to Fail for This to Break

Failure ModeSeverityDetection
LLM diarizations are genericFatalPhase 1, week 2. Cost: ~$50.
/connect finds spurious connectionsDegrades alpha/improve catches false patterns
/improve converges on "buy momentum"Wastes architectureMonitor accumulated rules
Skills accumulate contradictory rulesJudgment degradesGit diff review of changes
Predictions appear accurate (overfitting)Capital lossOut-of-sample validation

The cheapest failure: Phase 1 costs ~$50 in Claude API and 2 weeks. If the fundamental premise fails, you learn fast and cheap.

The most dangerous failure: /improve writing plausible-looking rules that are actually overfit. Mitigated by human review gate on all skill changes and 6 months of paper trading before live capital.


The Build Plan

Phase 0 ~$20 / 1 weekend

Data Source Ablation Study

Manually diarize 10 companies using each data source solo, then combined. Which sources actually produce non-obvious insights? Build pipelines only for the top 3-4. Permanently de-risks the data architecture.

Phase 1 ~$50/mo / Month 1-2

Proof of Concept: 50 Companies

Build thin harness on Mac Mini M4. Run /diarize-company nightly on 50 companies. Validate: does it produce non-obvious "SAYS vs. ACTUALLY DOING" divergences?

Phase 2 ~$100/mo / Month 3-4

Connection Layer: 500 Companies

Scale to full S&P 500. Build /connect (hierarchical: sector then cross-sector). Build /allocate. Start paper trading via Alpaca. Track predictions in a log.

Phase 3 ~$150/mo / Month 5-6

Learning Loop

Build /improve. First improvement cycle runs after 90 days of predictions. Does prediction accuracy measurably improve after the skills rewrite themselves?

Phase 4 $25K-$50K capital / Month 7+

Live Capital

Switch from paper to live. Add Kronos price forecasting. Continuous /improve loop. Scale capital as track record builds. Target: Sharpe > 1.0.


Infrastructure

ComponentChoiceCost
ComputeMac Mini M4 (already owned)$0
Market dataPolygon.io free tier$0
FilingsSEC EDGAR$0
GeopoliticalWorld Monitor (open source)$0
LLMClaude API~$50-150/mo
ExecutionAlpaca (paper + live)$0

Total monthly cost: $50-150. Almost entirely Claude API. No cloud infrastructure. The Mac Mini runs the cron, the skills, and eventually local Kronos inference. Everything else is API calls.


What This Is Not

What This Is

A recursive loop that reads the world, forms investment theses grounded in how companies actually operate (not how they report), makes connections that only emerge from cross-domain synthesis at scale, allocates capital based on conviction, and gets better at all of this over time through real outcomes.

Encoded as fat skills that compound on a thin harness, running on a Mac Mini while you sleep.

The AI reads the movie. Everyone else is still looking at photographs.