Web App
Web App · Top-venue Paper · Technical Deep Dive · Discussion
The Steam Engine Moment of the Information Age
Everything LLM sees on each call:
Agent has no memory — it can only see what's in the context
Context Window Evolution
But How Large Is Your Project?
What Agent does:
For every LLM call, decide which information goes into context —
which files to read, which preferences to remember, which history to drop
📂
Which files to read
🧠
Which prefs to keep
🗑️
Which history to drop
Today's Core Argument
A structural shift in efficiency, not just a new tool.
We had no system at all before — everything was in our heads or on paper, total chaos.
In between research sessions, I spent a few days building her a system with Claude Code.
Traditional Web Development
With Claude Code ✦
App Development
✓ schema.sql generated
✓ HTML + CSS written
✓ errors auto-fixed
✓ ready to deploy
The execution barrier has dropped — your ceiling is how many worthwhile things you can think to do.
Conformal Prediction × Annotation Ambiguity
When ground truth is inherently ambiguous (annotators disagree), do CP's coverage guarantees still hold?
Calibration in This Setting Hasn't Been Done
Annotation ambiguity + calibration — even “how to evaluate” has no consensus
Pivot → CalibrationAGT
Calibration in this setting is equally worth systematic study.
Literature Review — papers, abstracts, notes · All text
Research Proposal — hypothesis, design, rationale · Text
Experiment Scripts — code is text
Errors & Results — stack traces, CSV, logs · Text with instant feedback
Figures — matplotlib / seaborn script · Also text
Paper (LaTeX) — everything converges · Into text
Before: ChatGPT
Now: Agent
You set the direction — Agent reads, executes, verifies
→ You're the Director, not the middleman
What is a Skill?
A Markdown file telling the Agent how to handle a task category. Trigger it — the Agent runs the full workflow.
What is Skill-Creator?
A skill for writing skills. Describe your need — Agent writes the spec.
Use Cases
Literature review · paper writing · experiments · code review · data preprocessing — any repeatable workflow
Skills = reusable workflows — define once, invoke repeatedly
Examples: /paper-review review workflow ·
/experiment-log experiment record ·
/debug-cluster GPU cluster debug ·
/weekly-report auto weekly report
Pipeline
Phase 0 · Setup
venue · topic · compute → config.md
Idea Loop · Phase 1–5
Literature Review
ArXiv MCP · WebSearch · gap identification
Idea Generation
Generate 4 candidate ideas
6-Agent Debate ← subagents
Critic · Champion · Devil's Advocate…
AC Gate
REVISE → loop · REJECT → drop · ACCEPT ↓
Pilot Experiment
Quick feasibility check · PASS to continue
Full Experiments · GPU Auto
SSH · gnvitop scheduling · autonomous execution
6-Agent Result Debate ← subagents
Result interpretation · Contribution positioning
Paper Writing + Figures
seaborn figures · parallel section writing
Review ← subagent → Submit
Revisions · Telegram notification
/ai-research-paper is just a starting point —
any repetitive workflow can be packaged as a skill.
More automated research pipelines exist
Chengcheng & Jinxu
AutoAI Research System
github.com/Sibyl-Research-Team/AutoResearch-SibylSystem
Run in tmux on a server — session persists across disconnects
VS Code plugin cannot guarantee session continuity
VS Code plugin is great for instant tasks — e.g. making slides
Good for short, focused tasks that finish quickly
Terminal: use Ghostty — officially recommended by Anthropic
Claude Code is still early-stage — Ghostty has the fewest bugs
Chrome extension worth trying — can directly control the browser UI
Better compatibility than the local desktop app
Claude Code
Light use — ~4 hours of focused work per day
Sufficient for most — daily research + project work covered
Run 5–10 projects in parallel simultaneously
Codex (OpenAI)
Limited Codex access included
Parallel tasks, 5h rolling cap + weekly limit — not truly unlimited
Task too big to fit in one context?
The main agent spawns multiple sub-agents that handle subtasks in parallel, returning only summarized results to the main context.
Why Does Context Grow So Fast?
Tool outputs (tool observations) account for 84% of context
The model's own words are only ~10%
Method 1 of 2
Observation Masking
Replace tool output with a single line:
“There used to be tool output here”
Looks brutal — but experiments show it works about as well as LLM summarization
LLM Summarization
History too long → compress it with LLM summarization
Claude Code has this compaction mechanism built in
Sub-agent = Automatic Compression
Language models dislike compressing their own memory
So compression is usually enforced by the Agent framework
Why it's powerful: shell commands are text, and text is exactly what LLM excels at
A single conversation can load 4000+ tokens of system info —
that's why the agent can continuously "remember" project state
Agent as a persistent teammate — remembers context across sessions, reacts to events automatically, and schedules its own check-ins.
Work is about being accountable for outcomes
Not the process, not the lines of code, not "I wrote it myself"
"Vibe coding — fully give in to the vibes, embrace exponentials, and forget that the code even exists."
X · Feb 2025 → now: "agentic engineering"
"We may see the first AI agents join the workforce and materially change the output of companies."
Blog · Jan 2025
"AI could soon compress decades of scientific progress into just a few years."
Machines of Loving Grace · Oct 2024
Agent lets anyone churn out papers fast — hundreds per day on arxiv, reviewers already can't keep up
arXiv annual submissions · arxiv.org/stats (2025: exceeded 28,000/month)
Maybe: asking the right question is the real core competency
Stanford · Oct 2025 · agents4science.stanford.edu
AI can generate papers — but asking the right question still needs humans
Ref: Hung-yi Lee · AI Agent (3/3) · NTU 2026
Andrew Hall (Stanford) · "100x Research Assistant"
PhD student: 16h / $1,040 vs Claude Code: 1h / $10 (104× cheaper)
But: humans haven't been replaced
People vs. Agents
Do juniors still have a shot?
This isn't pessimism — it's a new starting line
Real-time GPU monitoring across all lab servers —
see which cards are idle, check if you're hogging too many.
~/.ssh/config, SSHs into all serversThanks
github.com/Linwei94/talks