What If the AI Reads the Movie While Everyone Else Looks at Photographs?
An experiment in building an AI-native capital allocator from first principles. Not a faster quant. Not a robo-advisor. A system that reads the continuous loop of every company, finds connections humans can't hold in working memory, and gets better at it over time.
The Starting Question
Every hedge fund uses AI now. They bolt machine learning onto existing strategies: faster factor models, better signal extraction, automated execution. This is the equivalent of putting a jet engine on a horse.
The question I wanted to explore: what does a fund look like when it's AI-native from the ground up? Not AI-assisted. AI-native. Where the core capability is something that simply couldn't exist before large language models.
I started by researching what exists. Abundance, founded by Apoorva Mehta (Instacart), raised $100M to build an "AI Capital Allocator." Their job page mentions "self-improving agents," "token efficiency," and "long-running agents (20+ hours)." They're trading prop capital, market-neutral, high Sharpe. Deliberately closed and private.
In crypto, there are autonomous portfolio agents (StockPilot AI, KaleidoAgent, Aura) managing on-chain positions. In TradFi, Anthropic shipped financial services agents that draft pitch decks and review statements.
Nobody is building what I'm describing. The crypto agents trade DeFi. The TradFi agents are assistive. The gap: an AI system that reasons about the world and allocates capital based on that reasoning, at a scale a retail investor can actually run.
Three Wrong Turns (That Taught Me the Right Direction)
I went through three failed frames before landing on something that held up to scrutiny. Each failure was instructive.
Wrong Turn 1: Autoresearch for Allocation
Karpathy's autoresearch is elegant: an AI agent modifies training code, runs experiments for 5 minutes, measures val_bpb, keeps improvements, discards regressions. 100 experiments overnight. You wake up to a better model.
I tried mapping this to allocation. The agent modifies allocate.py, backtests for 5 minutes, measures Sharpe ratio, keeps/discards. Same loop.
Why it breaks: Autoresearch works because val_bpb is a clean, high-sample metric (millions of tokens, tight confidence intervals). Stock returns give you ~250 samples per year with massive noise. The evaluation signal is too weak for the loop to converge meaningfully. You'd just be overfitting to historical accidents.
Wrong Turn 2: Signal Discovery from Alternative Data
If autoresearch can't optimize allocation directly, what about using it to discover signals? Feed the agent World Monitor data (geopolitical events, ship tracking, satellite imagery) and let it find correlations with stock returns.
Why it breaks: This is still quant pattern matching. I just swapped the input data from price/fundamentals to OSINT. The method is the same: mine historical data for statistical correlations. Every quant fund does this. Running it with a fancier data source doesn't change the epistemology.
Wrong Turn 3: Optimize the Reasoning Framework
OK, forget optimizing code or signals. What if the autoresearch loop optimizes the reasoning process itself? The modifiable file isn't code, it's a reasoning framework in markdown. The agent tests whether changes to how it thinks produce better predictions.
Why it breaks: You can't evaluate reasoning quality by checking if the stock went up. A thesis can be perfectly reasoned and lose money from an unforeseeable event. Over 100 experiments on a few years of data, the sample is too small to separate good reasoning from lucky reasoning. You'd end up "prompt-hacking" -- finding prompts that exploit quirks in Claude's behavior on specific historical scenarios.
Three attempts. Same fundamental problem: optimizing against historical returns is a noisy, low-sample, non-stationary signal. The autoresearch loop doesn't map to allocation because the evaluation metric doesn't exist.
The Reframe: Companies Are Readable Loops
The breakthrough came from a different direction entirely. Garry Tan's essay "Thin Harness, Fat Skills" describes a system architecture where intelligence lives in markdown skill files that encode judgment, while execution lives in deterministic code. Skills are method calls. They compound. They never degrade.
The YC Startup School example was the key analogy. 6,000 founders, each with structured applications, questionnaire answers, advisor transcripts, and public signals. A /enrich-founder skill diarizes each founder: what they SAY versus what they're ACTUALLY BUILDING. A /match skill synthesizes across all 6,000 profiles to find connections no individual profile reveals. A /improve skill reads event feedback and rewrites rules back into the matching skills.
Map that to capital allocation:
| YC Matching | Capital Allocation |
|---|---|
| 6,000 founder profiles | 500 S&P companies |
| /enrich-founder: read everything, diarize | /diarize-company: read everything, find "SAYS vs. ACTUALLY DOING" |
| /match: cluster by affinity | /connect: find cross-company patterns |
| /improve: read NPS, rewrite skills | /improve: read outcomes, rewrite skills |
| 12% "OK" ratings → 4% | Prediction accuracy improves over quarters |
But here's the deeper insight. Every company runs a loop:
Strategy → Execution → Results → Revised Strategy → ...
Financial statements are snapshots of where that loop is at a given moment. Lagging indicators. By the time you see the 10-Q, the loop has already moved.
But the loop itself is readable in real time:
| Loop Signal | Data Source | What It Reveals |
|---|---|---|
| Hiring or firing? | Job postings | Loop expanding or contracting |
| Shipping product? | GitHub, changelogs, app store | Execution velocity |
| Customers happy? | App reviews, social sentiment | Retention health |
| Supply chain intact? | Ship tracking, World Monitor | Operational risk |
| Leadership confident? | Earnings call tone, insider filings | Forward conviction |
| Investing in future? | Job postings + capex + patents | Future loop strength |
The fund doesn't analyze stocks. It reads loops. A decelerating loop with a bullish market narrative is a short. An accelerating loop the market hasn't noticed is a long. The divergence between loop reality and market narrative IS the alpha.
The AI-Native Edge: Three Things Humans Can't Do
This isn't about reading faster. It's about a cognitive act that human working memory physically cannot perform.
A single analyst can go deep on 15-20 companies. They know everything about those companies. But they can't see across 500.
A macro strategist can see the big picture. But they can't trace it to individual company loops.
A quant can see statistical patterns across everything. But they can't reason about why the pattern exists.
The AI holds all three simultaneously. Deep on every company, wide across the full universe, and connected to the macro. That's not a faster version of what humans do. It's different.
NVIDIA shifts supply chain roles +340%
+ Vietnam fabs breaking ground (World Monitor)
+ Apple posts 12 custom silicon roles (job data)
+ TSMC guides down on advanced node demand (earnings)
+ Shipping lane disruptions in Taiwan strait (AIS data)
+ Foxconn opens 2nd India plant (news)
HUMAN ANALYST sees:
"NVIDIA supply chain diversification"
AI SEES:
"The entire semiconductor industry is restructuring
its geographic footprint simultaneously. This isn't
an NVIDIA story. It's an industry phase transition.
The market is pricing this as individual company
decisions. It's actually a coordinated migration
driven by geopolitical risk."
That connection -- across 6 data sources, 5 companies, 3 geographies -- exists in the data right now. No single human can hold it. No quant model can reason about it.
Architecture: Thin Harness, Fat Skills
FAT SKILLS (judgment, synthesis, connection):
/diarize-company Read one company's loop deeply
/connect-sector Synthesize patterns within each sector
/connect-cross Find cross-sector connections + macro
/allocate Convert signals + connections into positions
/improve Learn from real outcomes, rewrite skills
/watchlist Human-in-the-loop: steer attention
THIN HARNESS (~200 lines):
Cron on Mac Mini M4
Schedule skills, manage context, enforce safety
DETERMINISTIC LAYER (same input, same output):
Market data (Polygon.io) | Filings (SEC EDGAR)
World events (World Monitor) | Execution (Alpaca API)
Risk constraints (position limits, drawdowns)
Intelligence lives UP in the skills. Execution lives DOWN in deterministic code. The harness is thin -- it just schedules and routes.
Five design principles:
- Skills are method calls.
/diarize-company NVDAand/diarize-company AAPLare the same skill, different parameters, different intelligence briefs. - Every skill is a permanent upgrade. It never degrades. It never forgets. When the next model drops, every skill instantly gets better.
- Prose for humans, structured signals for machines. Every diarization outputs both a narrative thesis and a machine-readable signal block (conviction score, direction, time horizon, catalyst specificity).
- /connect is hierarchical. Sector-level synthesis first, then cross-sector. Each level stays within high-quality context limits.
- Improvement is real, not backtested. The
/improveskill reads what actually happened, not what would have happened historically.
The Recursive Loop
COMPANY LOOPS (what we read):
Strategy → Build → Ship → Measure → Adjust → ...
Readable via: filings, job posts, commits, reviews,
shipping data, patents, insider filings, transcripts
FUND LOOP (what we run):
/watchlist → You flag what deserves deeper attention
/diarize → Read each company's loop
/connect → Synthesize within-sector, then cross-sector
/allocate → Convert signals into positions
[reality unfolds]
/improve → Learn from outcomes, rewrite all skills
THE FUND'S LOOP GETS BETTER AT READING OTHER LOOPS.
Each /improve cycle makes /diarize sharper, /connect more selective, and /allocate better calibrated. The skills accumulate institutional knowledge in markdown -- human-readable, versionable, permanently available.
Here's what /improve output actually looks like:
PATTERN DISCOVERED (added to /diarize-company):
When "CEO hedges on a geography for the first time"
AND World Monitor shows escalation in that region:
→ Flag as high-conviction underperform signal.
Evidence: ASML (Feb 2026), Qualcomm (Mar 2026).
FALSE PATTERN REMOVED (from /diarize-company):
"Glassdoor sentiment decline → underperform" had
zero predictive power over 6 months (IC: 0.01).
8 companies, 4 up, 4 down. Random. Removed.
META-IMPROVEMENT (added to /connect):
Connections citing 3+ data source types: 73% accuracy.
Connections citing only 2 types: 48% (coin flip).
→ Require 3+ sources for high-conviction connections.
Rules are additive AND subtractive. Bad patterns get pruned. Skill files get cleaner over time, not just longer.
What Needs to Fail for This to Break
| Failure Mode | Severity | Detection |
|---|---|---|
| LLM diarizations are generic | Fatal | Phase 1, week 2. Cost: ~$50. |
| /connect finds spurious connections | Degrades alpha | /improve catches false patterns |
| /improve converges on "buy momentum" | Wastes architecture | Monitor accumulated rules |
| Skills accumulate contradictory rules | Judgment degrades | Git diff review of changes |
| Predictions appear accurate (overfitting) | Capital loss | Out-of-sample validation |
The cheapest failure: Phase 1 costs ~$50 in Claude API and 2 weeks. If the fundamental premise fails, you learn fast and cheap.
The most dangerous failure: /improve writing plausible-looking rules that are actually overfit. Mitigated by human review gate on all skill changes and 6 months of paper trading before live capital.
The Build Plan
Data Source Ablation Study
Manually diarize 10 companies using each data source solo, then combined. Which sources actually produce non-obvious insights? Build pipelines only for the top 3-4. Permanently de-risks the data architecture.
Proof of Concept: 50 Companies
Build thin harness on Mac Mini M4. Run /diarize-company nightly on 50 companies. Validate: does it produce non-obvious "SAYS vs. ACTUALLY DOING" divergences?
Connection Layer: 500 Companies
Scale to full S&P 500. Build /connect (hierarchical: sector then cross-sector). Build /allocate. Start paper trading via Alpaca. Track predictions in a log.
Learning Loop
Build /improve. First improvement cycle runs after 90 days of predictions. Does prediction accuracy measurably improve after the skills rewrite themselves?
Live Capital
Switch from paper to live. Add Kronos price forecasting. Continuous /improve loop. Scale capital as track record builds. Target: Sharpe > 1.0.
Infrastructure
| Component | Choice | Cost |
|---|---|---|
| Compute | Mac Mini M4 (already owned) | $0 |
| Market data | Polygon.io free tier | $0 |
| Filings | SEC EDGAR | $0 |
| Geopolitical | World Monitor (open source) | $0 |
| LLM | Claude API | ~$50-150/mo |
| Execution | Alpaca (paper + live) | $0 |
Total monthly cost: $50-150. Almost entirely Claude API. No cloud infrastructure. The Mac Mini runs the cron, the skills, and eventually local Kronos inference. Everything else is API calls.
What This Is Not
- Not a quant fund. No factor models, no statistical arbitrage, no HFT. The edge is judgment, not speed.
- Not a robo-advisor. Robo-advisors follow rules. This system reasons.
- Not backtesting-driven. The improvement loop learns from real outcomes, not historical optimization.
- Not a black box. Every position has a written rationale. Every skill change has an evidence trail. The entire system is human-readable markdown.
What This Is
A recursive loop that reads the world, forms investment theses grounded in how companies actually operate (not how they report), makes connections that only emerge from cross-domain synthesis at scale, allocates capital based on conviction, and gets better at all of this over time through real outcomes.
Encoded as fat skills that compound on a thin harness, running on a Mac Mini while you sleep.
The AI reads the movie. Everyone else is still looking at photographs.