Lab Seminar · 2026.03.20
Linwei Tao
AI Agent in Practice

AI Agent for Research & Engineering

Web App · Top-venue Paper · Technical Deep Dive · Discussion
The Steam Engine Moment of the Information Age

claude — ~/CalibrationAGT
Today's Two Projects
Live Demo
Arigato · Cat Boarding Web App
9:41
CalibrationAGT · Top-venue Paper
Agenda
Agenda
🐱
Part 1 · Engineering
Cat Boarding Web App — a production app built with Claude Code in the gaps between research
~5 min
🔬
Part 2 · Research
CalibrationAGT — how a top-venue paper was completed using AI Agent
~20 min
⚙️
Part 3 · Technical Anatomy
How does AI Agent actually work? Tool use, Memory, Sub-agent
~10 min
💭
Part 4 · Discussion
Paper inflation, Senior productivity ×10 — how will research and engineering change?
~20 min
🎁
Bonus
Stay tuned
You've Heard of These
AI Coding Agents

You've Probably Heard These Names

GitHub Copilot
Cursor
Windsurf
Gemini CLI
Claude Code ✦
OpenClaw
Codex
O
OpenClaw
  • General personal assistant, not a coding tool
  • Powerful memory design — but burns tokens
  • Chat via WhatsApp / Telegram / 15+ apps
Claude Code ✦ Today's tool
  • Rich ecosystem: Skills, MCP, Hook — highly extensible
  • Best for vibe coding & vibe research
Cx
Codex GPT-5.4
  • Claude Code's main rival — model keeps improving
  • Ecosystem still catching up — fewer skills & integrations
Core difference: developer & research Agent vs personal life assistant

The Essence of
AI Agent

What Is an AI Agent?
Context Engineering

Agent Has No Memory — It Only Sees the Context

Everything LLM sees on each call:

[ sys rules ]
CLAUDE.md · tool list · memory files
[ history ]
you: "analyse this" → Claude: "reading file..."
[ tool out ]
Read(main.py) → 500 lines of code
[ input ▶ ]
"now write the tests"
// packed together → sent to LLM output = LLM(context)  // predict next token

Agent has no memory — it can only see what's in the context

What Is an AI Agent?
Context Engineering

Context Window Is Growing — But Still Not Enough

Context Window Evolution

ModelReleased
GPT-4o
128K
2024.05
Claude Sonnet 4
200K
2025.07
GPT-5.2
400K
2025.12
GPT-5.4
1M
2026.03
Claude Sonnet 4.6
1M
2026.03

But How Large Is Your Project?

A top-venue paper ~8K tokens ✓
A PhD thesis ~100K tokens ✓
CalibrationAGT codebase ~400K tokens ⚠
Medium-sized production codebase ~2–5M tokens ✗
Your habits + env config + docs Can't fit at all ✗
What Is an AI Agent?
Context Engineering

Agent's Core Job: Decide What to Feed

What Agent does:

For every LLM call, decide which information goes into context —
which files to read, which preferences to remember, which history to drop

📂

Which files to read

🧠

Which prefs to keep

🗑️

Which history to drop

Today's Core Argument

Productivity Is Being Repriced

A structural shift in efficiency, not just a new tool.

01
Part One · Engineering
Cat Boarding
Web App
Zero to Production, Built End-to-End with Claude Code
Engineering · Background
Side Project

You Know My GF Runs a Cat Boarding Business 🐱

We had no system at all before — everything was in our heads or on paper, total chaos.
In between research sessions, I spent a few days building her a system with Claude Code.

📋 Before
Notebooks, WeChat group messages, kept in memory — check-in/out times were constantly getting mixed up
✅ Now
Boarding management, home visits, booking links, e-signatures, revenue tracking — complete feature set
Engineering · System Demo
Desktop

Arigato · Cat Boarding Web App

taolinwei.com/cat-boarding-system
Engineering · Build Process
Claude Code in Action

Traditional Web Development

Break Down Requirements Requirements doc
feature breakdown + schedule
Database Design schema.sql
hand-written migrations
UI Design Wireframes → components
hand-coded HTML + CSS
Frontend Dev Pure HTML + ES Modules
no framework, no build step
Debug & Iterate Stack Overflow + docs
manual debugging
Production Launch Real users
Real business

With Claude Code ✦

App Development

You: I want to build a cat boarding app.

✓ schema.sql generated

✓ HTML + CSS written

✓ errors auto-fixed

✓ ready to deploy

Engineering · Implications
What This Means
🏢
Solo Company
One person doing what once needed a full team
Pieter Levels — $3.1M ARR · 70+ products
Lovable — $100M ARR in 8 months
🔬
Researcher's Toolbox
Build whatever tool you're missing
Komodo Health — 6–8 weeks → hours
Patrick Mineault — NeuroAI · analysis pipelines
💡
Startup MVP
Validate an idea in days, not months
Indragie — macOS app · 19k lines by Claude
Invoice system — 1 day · $3.65 API cost
🛠️
Self-Sufficient Living
Build what you need, when you need it
Home Assistant community — smart home w/ Claude Code
Non-programmer — finance app · 0 lines of code

The execution barrier has dropped — your ceiling is how many worthwhile things you can think to do.

02
Part Two · Research
CalibrationAGT
Confidence Calibration Under Annotation Ambiguity
Research · Motivation
CalibrationAGT

How Did This Paper Start?

Annotation Ambiguity Example
01

Conformal Prediction × Annotation Ambiguity

When ground truth is inherently ambiguous (annotators disagree), do CP's coverage guarantees still hold?

02

Calibration in This Setting Hasn't Been Done

Annotation ambiguity + calibration — even “how to evaluate” has no consensus

03

Pivot → CalibrationAGT

Calibration in this setting is equally worth systematic study.

Research · Workflow
Every Step Is Text

Breaking Down Research in AI — Every Step Is Just Text

📚

Literature Review — papers, abstracts, notes · All text

💡

Research Proposal — hypothesis, design, rationale · Text

🔬

Experiment Scripts — code is text

⚠️

Errors & Results — stack traces, CSV, logs · Text with instant feedback

📊

Figures — matplotlib / seaborn script · Also text

📄

Paper (LaTeX) — everything converges · Into text

Research · Background
LLM → Agent

From LLM to Agent

Before: ChatGPT

# Error → copy, paste, ask, copy back Error: KeyError: 'ece_true' ↓ manually copy → paste → ask ChatGPT Fix: metrics[key] = ece_fn(logits, labels)
You are the middleman — manually ferrying information back and forth

Now: Agent

# Error → Agent handles it end-to-end Error: KeyError: 'ece_true' ↓ Agent reads, locates, fixes, runs tests ✓ All tests passed. Fix pushed.

You set the direction — Agent reads, executes, verifies
→ You're the Director, not the middleman

Research · Claude Code in Action
Live Interaction
claude — ~/CalibrationAGT
Research · Toolchain
Skills & Skill-Creator

Skills — Custom Workflows for Your Agent

What is a Skill?

A Markdown file telling the Agent how to handle a task category. Trigger it — the Agent runs the full workflow.

What is Skill-Creator?

A skill for writing skills. Describe your need — Agent writes the spec.

Use Cases

Literature review · paper writing · experiments · code review · data preprocessing — any repeatable workflow

Skills = reusable workflows — define once, invoke repeatedly

Examples: /paper-review review workflow · /experiment-log experiment record · /debug-cluster GPU cluster debug · /weekly-report auto weekly report

claude — ~/research
Research · Skill in Action
ai-research-paper

Pipeline

🔧

Phase 0 · Setup

venue · topic · compute → config.md

↻ LOOP

Idea Loop · Phase 1–5

📚

Literature Review

ArXiv MCP · WebSearch · gap identification

💡

Idea Generation

Generate 4 candidate ideas

🤖

6-Agent Debate ← subagents

Critic · Champion · Devil's Advocate…

🚪

AC Gate

REVISE → loop · REJECT → drop · ACCEPT

🔬

Pilot Experiment

Quick feasibility check · PASS to continue

↓ ACCEPT

Full Experiments · GPU Auto

SSH · gnvitop scheduling · autonomous execution

📊

6-Agent Result Debate ← subagents

Result interpretation · Contribution positioning

✍️

Paper Writing + Figures

seaborn figures · parallel section writing

📱

Review ← subagent → Submit

Revisions · Telegram notification

claude — ~/CalibrationAGT
Research · Going Further
Just the Beginning

/ai-research-paper is just a starting point —
any repetitive workflow can be packaged as a skill.

More automated research pipelines exist

Chengcheng & Jinxu

AutoAI Research System

github.com/Sibyl-Research-Team/AutoResearch-SibylSystem

Research · More Examples
Skills in Action
/brainstorm — ~/CalibrationAGT
/frontend-slides — ~/talks
Research · Usage Tips
Tips

Tips for Using Claude Code

1

Run in tmux on a server — session persists across disconnects

VS Code plugin cannot guarantee session continuity

2

VS Code plugin is great for instant tasks — e.g. making slides

Good for short, focused tasks that finish quickly

3

Terminal: use Ghostty — officially recommended by Anthropic

Claude Code is still early-stage — Ghostty has the fewest bugs

4

Chrome extension worth trying — can directly control the browser UI

Better compatibility than the local desktop app

Research · Usage Tips
Pricing

Pricing

Claude Code

Pro $20 / mo

Light use — ~4 hours of focused work per day

Max · 5× $100 / mo

Sufficient for most — daily research + project work covered

Max · 20× $200 / mo

Run 5–10 projects in parallel simultaneously

Codex (OpenAI)

ChatGPT Plus $20 / mo

Limited Codex access included

ChatGPT Pro $200 / mo

Parallel tasks, 5h rolling cap + weekly limit — not truly unlimited

03
Part Three · Technical Anatomy
Technical Anatomy
How Does an AI Agent Actually Work?
Reference: Hung-yi Lee Machine Learning Course 2025 · NTU
Tech · Sub-agent
Context Engineering

Sub-agent: Solving the Context Bottleneck

Task too big to fit in one context?
The main agent spawns multiple sub-agents that handle subtasks in parallel, returning only summarized results to the main context.

Main Agent
→ spawn
Sub A: Read Paper A
→ spawn
Sub B: Read Paper B
← Returns summary only, not full text
  • Sub-agent has a lean system prompt, focused on a single subtask
  • Multiple sub-agents run in parallel — faster
  • Context window contains only task-relevant info — more accurate reasoning
Tech · Context Compression
From Hongyi Lee's Course

Context Overflow — How to Compress? (1/2)

Why Does Context Grow So Fast?

Tool outputs (tool observations) account for 84% of context
The model's own words are only ~10%

Method 1 of 2

1

Observation Masking

Replace tool output with a single line:
There used to be tool output here

Looks brutal — but experiments show it works about as well as LLM summarization

Tech · Context Compression
From Hongyi Lee's Course

Context Overflow — How to Compress? (2/2)

2

LLM Summarization

History too long → compress it with LLM summarization
Claude Code has this compaction mechanism built in

Sub-agent = Automatic Compression

Main Agent spawns Sub-agent → Sub-agent accumulates its own Context
↓ sub-agent return
Sub-agent's entire conversation vanishes, only the returned sentence remains

Language models dislike compressing their own memory
So compression is usually enforced by the Agent framework

Tech · Tool Use
Function Calling

How Agent "Acts": The Tool Use Loop

Agent receives task
Sent to LLM (with tool list)
[tool_use] Read("main.py")
Execute Read("main.py") on computer
Return file contents
Add to context, call LLM again
↓ Loop until task complete
[tool_use] Write("fix.py", ...)
Write to file
"done" [END]

Why it's powerful: shell commands are text, and text is exactly what LLM excels at

Tech · Memory
Memory Architecture

The System Prompt Holds the Agent's "Soul"

  • CLAUDE.md: project rules, coding conventions, notes
  • memory/*.md: user preferences, past decisions, long-term memory
  • skills/*.md: reusable workflows (skill library)
  • Tool list: Read, Write, Bash, WebSearch…
  • At every LLM call, all of this is stuffed into the system prompt
# CLAUDE.md example ## Rules - Reply in English - Check GPU usage before running experiments - Do not commit untested code ## Project Context Top-venue paper, ddl: 5/31

A single conversation can load 4000+ tokens of system info —
that's why the agent can continuously "remember" project state

Tech · Persistence
Hooks · Cron · Memory
🪝
Hooks
Run shell commands before / after any tool call — validate, log, notify
# settings.json hooks: PostToolUse: - command: "notify.sh" match: Bash
Cron
Schedule agent runs at fixed intervals — monitor experiments, track papers, send digests
# every 30 min CronCreate( "*/30 * * * *", "check experiments" )
🧠
Auto-Memory
Agent writes structured memory files — persisted across all future sessions
# ~/.claude/projects/ MEMORY.md # index memory/ user.md # preferences project.md # current goals

Agent as a persistent teammate — remembers context across sessions, reacts to events automatically, and schedules its own check-ins.

04
Part Four · Discussion
What Has AI Agent
Changed?
Research · Engineering · Human Value
Discussion · New Mindset
Working in the Agent Era

Work is about being accountable for outcomes
Not the process, not the lines of code, not "I wrote it myself"

Andrej Karpathy
Andrej Karpathy

"Vibe coding — fully give in to the vibes, embrace exponentials, and forget that the code even exists."

X · Feb 2025 → now: "agentic engineering"

Sam Altman
Sam Altman · OpenAI

"We may see the first AI agents join the workforce and materially change the output of companies."

Blog · Jan 2025

Dario Amodei
Dario Amodei · Anthropic

"AI could soon compress decades of scientific progress into just a few years."

Machines of Loving Grace · Oct 2024

Discussion · Research
Research

The Paper Inflation Era

Agent lets anyone churn out papers fast — hundreds per day on arxiv, reviewers already can't keep up

arXiv annual submissions · arxiv.org/stats (2025: exceeded 28,000/month)

arXiv new submissions per year by subject area (1991–2021)
Discussion · Research
Research

How Will the Evaluation System Shift?

Maybe: asking the right question is the real core competency

Discussion · Research
Agents4Science 2025

The First AI-Only Academic Conference

Stanford · Oct 2025 · agents4science.stanford.edu

Agents4Science 2025 conference statistics figure
  • 314 submitted · 48 accepted (~16%)
  • AI as sole first author + reviewer on every paper
  • Accepted papers: more human input on hypothesis & experimental design
  • AI reviews consistent but shallow — "neither interesting nor important"
  • ~44% of submissions had hallucinated references

AI can generate papers — but asking the right question still needs humans

Discussion · AI Agent's Impact
Academic Research

Will AI Agent Replace Researchers?

Ref: Hung-yi Lee · AI Agent (3/3) · NTU 2026

Andrew Hall (Stanford) · "100x Research Assistant"

Cost: PhD student 16h $1040 vs Claude Code 1h $10

PhD student: 16h / $1,040  vs  Claude Code: 1h / $10 (104× cheaper)

But: humans haven't been replaced

  • AI ideas look novel — but don't outperform humans after execution
  • Accepted papers: humans intervene more on idea and experiment design
  • AI excels at execution, but "what to do" still needs human judgment
Discussion · Engineering
Engineering

Senior Engineers 10× Productive — Why Still Hire Juniors?

People vs. Agents

  • Juniors: onboard · mentor · review
  • Agent: follows orders · always on · no salary
  • 1 person + agent = team of 5
  • Junior's repetitive work? Agent handles it all
  • Headcount logic is being repriced

Do juniors still have a shot?

  • Yes — but find your irreplaceable part earlier
  • Systems judgment · user empathy · defining the real problem
  • Direct agents = new era's 10× engineer
  • Solo full-stack: 5 years → 1 year

This isn't pessimism — it's a new starting line

Discussion · Engineering
Engineering

Anthropic Labor Market Study (2026)

  • Programmers: AI covers 75% of job tasks
  • AI-exposed jobs: avg. earnings 47% higher than unexposed
  • Workers aged 22–25 in high-AI-exposure occupations: 13% drop in employment (Stanford, 2025)
  • Global workforce: 3.3 billion workers worldwide
  • Affected workers: 1.1 billion — roughly 1 in 3 globally
Claude Career Report (English)
🎁 Bonus
gnvitop
Global GPU Monitor
A side project built during research — one command to monitor all your lab GPUs
🎁 Bonus
Tool Recommendation

gnvitop — Global GPU Monitor

Real-time GPU monitoring across all lab servers —
see which cards are idle, check if you're hogging too many.

pip install gnvitop # install
gnvitop # run
  • Auto-reads ~/.ssh/config, SSHs into all servers
  • Live web dashboard showing all GPU status
  • Supports ProxyJump bastion hosts
  • Current user's processes highlighted in blue
gnvitop dashboard

Thanks

Q & A

github.com/Linwei94/talks