Seminar · 2026.03.24

Linwei Tao

AI Agent in Practice

AI Agent for Research & Engineering

Web App · Top-venue Paper · Technical Deep Dive · Discussion
The Steam Engine Moment of the Information Age

claude — ~/CalibrationAGT

❯

Today's Two Projects

Live Demo

①

Arigato · Cat Boarding Web App

9:41

②

CalibrationAGT · Top-venue Paper

Agenda

🐱

Part 1 · Engineering

Cat Boarding Web App — a production app built with Claude Code in the gaps between research

~5 min

🔬

Part 2 · Research

CalibrationAGT — how a top-venue paper was completed using AI Agent

~20 min

⚙️

Part 3 · Technical Anatomy

How does AI Agent actually work? Tool use, Memory, Sub-agent

~10 min

💭

Part 4 · Discussion

Paper inflation, Senior productivity ×10 — how will research and engineering change?

~20 min

🎁

Bonus

Stay tuned

You've Heard of These

AI Coding Agents

You've Probably Heard These Names

GitHub Copilot

Cursor

Windsurf

Gemini CLI

Claude Code ✦

OpenClaw

Codex

O

OpenClaw

General personal assistant, not a coding tool
Powerful memory design — but burns tokens
Chat via WhatsApp / Telegram / 15+ apps

Claude Code ✦ Today's tool

Rich ecosystem: Skills, MCP, Hook — highly extensible
Best for vibe coding & vibe research

Cx

Codex GPT-5.4

Claude Code's main rival — model keeps improving
Ecosystem still catching up — fewer skills & integrations

Core difference: developer & research Agent vs personal life assistant开发者 & 科研 Agent vs 个人生活助理

The Essence of
AI Agent

Foundation · How LLMs Work

Autoregressive Generation

Most LLMs Generate One Token at a Time

Think of it like a Markov chain: given all previous tokens, predict the next one — then repeat. The model never “plans ahead”; it only ever answers: what comes next?

Input

The cat sat on the → mat

Feed back

The cat sat on the mat → .

↑ The newly generated token is appended to the input, and the model runs again.

Context window = working memory

Everything the model sees is in one long sequence — no persistent state between calls.

Output = probability distribution over vocabulary

The model outputs a distribution P(next token | all previous). Temperature controls how sharp that distribution is.

What Is an AI Agent?

Context Engineering

Agent Has No Memory — It Only Sees the Context

Everything LLM sees on each call:

[ sys rules ]
CLAUDE.md · tool list · memory files
[ history   ]
你: "帮我分析..." → Claude: "正在读文件..."
[ tool out  ]
Read(main.py) → 500 lines of code
[ input ▶   ]
"现在帮我写测试"

                ↓
                // packed together → sent to LLM
                output = LLM(context)  // predict next token
            

Agent has no memory — it can only see what's in the context——它只能看到 context 里有什么

What Is an AI Agent?

Context Engineering

Context Window Is Growing — But Still Not Enough

Context Window Evolution

ModelReleased

GPT-4o

128K

2024.05

Claude Sonnet 4

200K

2025.07

GPT-5.2

400K

2025.12

GPT-5.4

1M

2026.03

Claude Sonnet 4.6

1M

2026.03

But How Large Is Your Project?

A top-venue paper ~8K tokens ✓

A PhD thesis ~100K tokens ✓

CalibrationAGT codebase ~400K tokens ⚠

Medium-sized production codebase ~2–5M tokens ✗

Your habits + env config + docs Can't fit at all ✗

What Is an AI Agent?

Context Engineering

Agent's Core Job: Decide What to Feed

What Agent does:

For every LLM call, decide which information goes into context —
which files to read, which preferences to remember, which history to drop信息塞进 context——
读哪些文件、记忆哪些偏好、丢掉哪些历史

📂

Which files to read

🧠

Which prefs to keep

🗑️

Which history to drop

Today's Core Argument

生产力正在被
重新定价

A structural shift in efficiency, not just a new tool.

01

Part One · Engineering

Cat Boarding
Web App

Zero to Production, Built End-to-End with Claude Code

Engineering · Background

Side Project

You Know My GF Runs a Cat Boarding Business 🐱

We had no system at all before — everything was in our heads or on paper, total chaos.
In between research sessions, I spent a few days building her a system with Claude Code.，记录全靠脑子和纸，非常乱。
research之余的小项目

📋 Before

Notebooks, WeChat group messages, kept in memory — check-in/out times were constantly getting mixed up

✅ Now

Boarding management, home visits, booking links, e-signatures, revenue tracking — complete feature set

Engineering · System Demo

Desktop

Arigato · Cat Boarding Web App

taolinwei.com/cat-boarding-system

Engineering · Build Process

Claude Code in Action

Traditional Web Development

Break Down Requirements Requirements doc
feature breakdown + schedule

→

Database Design schema.sql
hand-written migrations

→

UI Design Wireframes → components
hand-coded HTML + CSS

→

Frontend Dev Pure HTML + ES Modules
no framework, no build step

→

Debug & Iterate Stack Overflow + docs
manual debugging

→

Production Launch Real users
Real business

With Claude Code ✦

App Development

You: I want to build a cat boarding app.

→

✓ schema.sql generated schema.sql 已生成

✓ HTML + CSS written HTML + CSS 已写好

✓ errors auto-fixed 报错已自动修复

✓ ready to deploy 可以上线了

Engineering · Implications

What This Means

🏢

Solo Company

One person doing what once needed a full team

Pieter Levels — $3.1M ARR · 70+ products
Lovable — $100M ARR in 8 months — 0 名员工 · $3.1M ARR · 70+ 产品
Lovable — 1 人创立 · 8 个月达 $100M ARR

🔬

Researcher's Toolbox

Build whatever tool you're missing

Komodo Health — 6–8 weeks → hours
Patrick Mineault — NeuroAI · analysis pipelines — 6–8 周分析缩至数小时
Patrick Mineault — NeuroAI 研究员，用 Claude Code 写分析流水线

💡

Startup MVP

Validate an idea in days, not months

Indragie — macOS app · 19k lines by Claude
Invoice system — 1 day · $3.65 API cost — macOS app · 19k/20k 行由 Claude 写
发票管理系统 — 1 天构建 · API 成本 $3.65

🛠️

Self-Sufficient Living

Build what you need, when you need it

Home Assistant community — smart home w/ Claude Code
Non-programmer — finance app · 0 lines of code — 用 Claude Code 从零搭智能家居
非程序员 — Claude Opus 4 + Cursor · 0 行代码写出理财 app

✦

The execution barrier has dropped — your ceiling is how many worthwhile things you can think to do.——跨领域不再是瓶颈。你的上限，是你能想到多少值得做的事。

02

Part Two · Research

CalibrationAGT

Confidence Calibration Under Annotation Ambiguity

Research · Motivation

CalibrationAGT

How Did This Paper Start?

01

Conformal Prediction × Annotation Ambiguity

When ground truth is inherently ambiguous (annotators disagree), do CP's coverage guarantees still hold?

02

Calibration in This Setting Hasn't Been Done

Annotation ambiguity + calibration — even “how to evaluate” has no consensus

03

Pivot → CalibrationAGT

Calibration in this setting is equally worth systematic study.

Research · Workflow

Every Step Is Text

Breaking Down Research in AI — Every Step Is Just Text

📚

Literature Review — papers, abstracts, notes · All text

↓

💡

Research Proposal — hypothesis, design, rationale · Text

↓

🔬

Experiment Scripts — code is text

↓

⚠️

Errors & Results — stack traces, CSV, logs · Text with instant feedback

↓

📊

Figures — matplotlib / seaborn script · Also text

↓

📄

Paper (LaTeX) — everything converges · Into text

Research · Background

LLM → Agent

From LLM to Agent

Before: ChatGPT

# Error → copy, paste, ask, copy back
Error: KeyError: 'ece_true'
↓  manually copy → paste → ask ChatGPT
Fix: metrics[key] = ece_fn(logits, labels)
Error: KeyError: 'ece_true'
↓  手动 copy → paste → 问 ChatGPT
Fix: metrics[key] = ece_fn(logits, labels)

You are the middleman — manually ferrying information back and forth

Now: Agent

# Error → Agent handles it end-to-end
Error: KeyError: 'ece_true'
↓  Agent reads, locates, fixes, runs tests
✓ All tests passed. Fix pushed.
Error: KeyError: 'ece_true'
↓  Agent 读文件、定位原因、改好、跑测试
✓ All tests passed. Fix pushed.

You set the direction — Agent reads, executes, verifies
→ You're the Director, not the middleman

Research · Claude Code in Action

Live Interaction

claude — ~/CalibrationAGT

❯

Research · Toolchain

Skills & Skill-Creator

Skills — Custom Workflows for Your Agent

What is a Skill?

A Markdown file telling the Agent how to handle a task category. Trigger it — the Agent runs the full workflow.

What is Skill-Creator?

A skill for writing skills. Describe your need — Agent writes the spec. 的 skill。描述你的需求，Agent 自动生成结构化的工作流规范并保存。

Use Cases

Literature review · paper writing · experiments · code review · data preprocessing — any repeatable workflow

Skills = reusable workflows — define once, invoke repeatedly

Examples: /paper-review review workflow · /experiment-log experiment record · /debug-cluster GPU cluster debug · /weekly-report auto weekly report

claude — ~/research

❯

Research · Skill in Action

ai-research-paper

Pipeline

🔧

Phase 0 · Setup

venue · topic · compute → config.md

↓

↻ LOOP

Idea Loop · Phase 1–5

📚

Literature Review

ArXiv MCP · WebSearch · gap identification

💡

Idea Generation

Generate 4 candidate ideas

🤖

6-Agent Debate ← subagents

Critic · Champion · Devil's Advocate…

🚪

AC Gate

REVISE → loop · REJECT → drop · ACCEPT ↓

🔬

Pilot Experiment

Quick feasibility check · PASS to continue

↓ ACCEPT

⚡

Full Experiments · GPU Auto

SSH · gnvitop scheduling · autonomous execution

📊

6-Agent Result Debate ← subagents

Result interpretation · Contribution positioning

✍️

Paper Writing + Figures

seaborn figures · parallel section writing

📱

Review ← subagent → Submit

Revisions · Telegram notification

claude — ~/CalibrationAGT

❯

Research · Going Further

Just the Beginning

/ai-research-paper is just a starting point —
any repetitive workflow can be packaged as a skill.

更自动化的 Research Pipeline

AutoResearch · SibylSystem

github.com/Sibyl-Research-Team/
AutoResearch-SibylSystem

Star

…

扫码访问

Research · Example

LUMI-lab · Cell 2026

AI Agent × Wet Lab: Self-Driving Molecular Discovery

Figure 1: LUMI-lab overview — foundation model + active learning + robotic lab → 1,700 LNPs → 20.3% lung gene editing in vivo

Research · More Examples

Skills in Action

/brainstorm — ~/CalibrationAGT

❯

/frontend-slides — ~/talks

❯

Research · Usage Tips

Tips

Tips for Using Claude Code

1

Run in tmux on a server — session persists across disconnects

VS Code plugin cannot guarantee session continuity

2

VS Code plugin is great for instant tasks — e.g. making slides

Good for short, focused tasks that finish quickly

3

Terminal: use Ghostty — officially recommended by Anthropic

Claude Code is still early-stage — Ghostty has the fewest bugs

4

Chrome extension worth trying — can directly control the browser UI

Better compatibility than the local desktop app

Research · Usage Tips

Pricing

Claude Code

Pro $20 / mo

Light use — ~4 hours of focused work per day

Max · 5× $100 / mo

Sufficient for most — daily research + project work covered

Max · 20× $200 / mo

Run 5–10 projects in parallel simultaneously

Codex (OpenAI)

ChatGPT Plus $20 / mo

Limited Codex access included

ChatGPT Pro $200 / mo

Parallel tasks, 5h rolling cap + weekly limit — not truly unlimited

03

Part Three · Technical Anatomy

Technical Anatomy

How Does an AI Agent Actually Work?

Reference: Hung-yi Lee Machine Learning Course 2025 · NTU

Tech · Sub-agent

Context Engineering

Sub-agent: Solving the Context Bottleneck

Task too big to fit in one context?
The main agent spawns multiple sub-agents that handle subtasks in parallel, returning only summarized results to the main context.，只把摘要结果返回主 context。

Main Agent

→ spawn

Sub A: Read Paper A

→ spawn

Sub B: Read Paper B

← Returns summary only, not full text

Sub-agent has a lean system prompt, focused on a single subtask，专注单一子任务
Multiple sub-agents run in parallel — faster
Context window contains only task-relevant info — more accurate reasoning

Tech · Context Compression

From Hongyi Lee's Course

Context Overflow — How to Compress? (1/2)

Why Does Context Grow So Fast?

Tool outputs (tool observations) account for 84% of context
The model's own words are only ~10%
模型自己说的话只占 ~10%

Method 1 of 2

1

Observation Masking

Replace tool output with a single line:
“There used to be tool output here”」

Looks brutal — but experiments show it works about as well as LLM summarization

Tech · Context Compression

From Hongyi Lee's Course

Context Overflow — How to Compress? (2/2)

2

LLM Summarization

History too long → compress it with LLM summarization
Claude Code has this compaction mechanism built in

Sub-agent = Automatic Compression

Main Agent spawns Sub-agent → Sub-agent accumulates its own Context

↓ sub-agent return

Sub-agent's entire conversation vanishes, only the returned sentence remains，只留 return 的一句话

Language models dislike compressing their own memory
So compression is usually enforced by the Agent framework压缩自己的记忆
所以压缩通常是 Agent 框架强制执行的

Tech · Tool Use

Function Calling

Agent 怎么"动手"：Tool Use 循环

Agent receives task

→

Sent to LLM (with tool list)

→

[tool_use] Read("main.py")

↓

在电脑上执行 Read("main.py")

→

Return file contents

→

Add to context, call LLM again

↓ Loop until task complete

[tool_use] Write("fix.py", ...)

→

Write to file

→

"done" [END]

Why it's powerful: shell commands are text, and text is exactly what LLM excels at

Tech · Memory

Memory Architecture

System Prompt 里装着 Agent 的"灵魂"

CLAUDE.md: project rules, coding conventions, notes：项目规则、代码约定、注意事项
memory/*.md: user preferences, past decisions, long-term memory：用户偏好、历史决策、长期记忆
skills/*.md: reusable workflows (skill library)：可复用的工作流（技能库）
Tool list: Read, Write, Bash, WebSearch…：Read、Write、Bash、WebSearch……
At every LLM call, all of this is stuffed into the system prompt

# CLAUDE.md example
## Rules
- Reply in English
- Check GPU usage before running experiments
- Do not commit untested code

## Project Context
Top-venue paper, ddl: 5/31
## Rules
- 用中文回复
- 运行实验前先检查 GPU 占用
- 不要 commit 未测试的代码

## Project Context
顶刊投稿，ddl: 5/31
                

一次对话可能塞入 4000+ tokens 的系统信息——
这就是 agent 能持续"记住"项目状态的原因

Tech · Persistence

Hooks · Cron · Memory

🪝

Hooks

Run shell commands before / after any tool call — validate, log, notify

# settings.json
hooks:
  PostToolUse:
    - command: "notify.sh"
      match: Bash
hooks:
  PostToolUse:
    - command: "notify.sh"
      match: Bash

⏱

Cron

Schedule agent runs at fixed intervals — monitor experiments, track papers, send digests

# every 30 min
CronCreate(
  "*/30 * * * *",
  "check experiments"
)
CronCreate(
  "*/30 * * * *",
  "check experiments"
)

🧠

Auto-Memory

Agent writes structured memory files — persisted across all future sessions

# ~/.claude/projects/
MEMORY.md       # index
memory/
  user.md       # preferences
  project.md    # current goals
MEMORY.md       # 索引
memory/
  user.md       # 用户偏好
  project.md    # 当前目标

✦

Agent as a persistent teammate — remembers context across sessions, reacts to events automatically, and schedules its own check-ins.——跨会话记住上下文、对事件自动响应、自行安排检查点。

04

Part Four · Discussion

What Has AI Agent
Changed?

Research · Engineering · Human Value

Discussion · New Mindset

Working in the Agent Era

Work is about being accountable for outcomes负责
不是对过程负责，不是对代码行数负责，不是对"我自己写的"负责

Andrej Karpathy

"Vibe coding — fully give in to the vibes, embrace exponentials, and forget that the code even exists."

X · Feb 2025 → now: "agentic engineering"

Sam Altman · OpenAI

"We may see the first AI agents join the workforce and materially change the output of companies."

Blog · Jan 2025

Dario Amodei · Anthropic

"AI could soon compress decades of scientific progress into just a few years."

Machines of Loving Grace · Oct 2024

Discussion · Research

Research

The Paper Inflation Era

Agent lets anyone churn out papers fast — hundreds per day on arxiv, reviewers already can't keep up

arXiv annual submissions · arxiv.org/stats (2025: exceeded 28,000/month)

arXiv new submissions per year by subject area (1991–2021)

The era of publishing “baseline +1% accuracy” as a paper may soon be over
“How many papers published” as a metric will become increasingly meaningless

Discussion · Research

Research

How Will the Evaluation System Shift?

From quantity to impact — citation, real-world adoption 转向 影响力——citation、real-world adoption
Greater weight on genuinely new questions, not solving known benchmarks
Reproducibility and open-source become more important — agents can reproduce your work from docs alone, this may become a mandatory standard

Maybe: asking the right question is the real core competency，才是真正的核心竞争力

Discussion · Research

Agents4Science 2025

The First AI-Only Academic Conference

Stanford · Oct 2025 · agents4science.stanford.edu

Agents4Science 2025 conference statistics figure

314 submitted · 48 accepted (~16%)
AI as sole first author + reviewer on every paper
Accepted papers: more human input on hypothesis & experimental design
AI 审稿：一致性高，但洞察力不如人类
~44% of submissions had hallucinated references

AI can generate papers — but asking the right question still needs humans仍然需要人

Discussion · AI Agent's Impact

Academic Research

Will AI Agent Replace Researchers?

Ref: Hung-yi Lee · AI Agent (3/3) · NTU 2026

Andrew Hall (Stanford) · 「100x Research Assistant」

Cost: PhD student 16h $1040 vs Claude Code 1h $10

PhD student: 16h / $1,040 vs Claude Code: 1h / $10 (104× cheaper)

But: humans haven't been replaced

AI ideas look novel — but don't outperform humans after execution
Accepted papers: humans intervene more on idea and experiment design
AI 擅长执行，但"做什么"仍需人来判断

Discussion · Engineering

Engineering

Senior Engineers 10× Productive — Why Still Hire Juniors?

People vs. Agents

Juniors: onboard · mentor · review
Agent: follows orders · always on · no salary
1 person + agent = team of 5
Junior's repetitive work? Agent handles it all
Headcount logic is being repriced

Do juniors still have a shot?

Yes — but find your irreplaceable part earlier
Systems judgment · user empathy · defining the real problem
Direct agents = new era's 10× engineer
Solo full-stack: 5 years → 1 year

This isn't pessimism — it's a new starting line

Discussion · Engineering

Engineering

Anthropic Labor Market Study (2026)

Programmers: AI covers 75% of job tasks 的工作任务
AI-exposed jobs: avg. earnings 47% higher than unexposed
Workers aged 22–25 in high-AI-exposure occupations: 13% drop in employment (Stanford, 2025)（Stanford, 2025）
Global workforce: 3.3 billion workers worldwide
Affected workers: 1.1 billion — roughly 1 in 3 globally——约占全球劳动力的 1/3

Discussion · Engineering

Agent Harness

Harness > Model

Agent Harness = scaffolding around the model: tool definitions, prompts, workflows, context management. Same model. Different harness. Completely different results. = 包裹在模型外的脚手架：工具定义、Prompt、工作流、上下文管理。
同一个模型，换个 harness，结果天壤之别。

Anthropic · CORE-Bench

42% → 78%

Same model (Claude Opus 4.5). Switched from a generic scaffold to Claude Code harness. +36 points.

LangChain · Terminal Bench 2.0

52.8% → 66.5%

Model unchanged. Added self-verification loops, context engineering, loop detection. Top 30 → Top 5.

OpenAI · Codex Harness

0 lines hand-written code

Built a 1M-line production app in 5 months with 3 engineers — 1/10th normal dev time. Every line written by Codex agents.

💡

Progressive Disclosure: show the model only what it needs, when it needs it — this single mechanism explains most harness improvements.：每一步只让模型看到它需要的信息，其余全部隐藏。这是大多数 harness 提升的根本原因。

Discussion · Engineering

Agent Harness

Three Principles of Harness Engineering

Precise Tool Descriptions

Tell the agent exactly when NOT to use a tool调用

One engineer connected 12 tools. The agent kept calling the same endpoint twice with different params. He cut to 4 tools with precise descriptions — 40% fewer useless calls.

Fewer Tools, Better Results

One rule in a config file beats adding a new tool

Vercel's agent had a huge toolbox. Agent got confused, made redundant calls. They stripped it down to bare bash access. Success rate: 100%. Speed: 3.5× faster.

Prune Regularly

The harness you built for last year's model may be hurting today's

Manus rewrote their harness 5 times in 6 months, stripping complexity each time. One engineer deleted an entire memory system on a Thursday — by Friday, response latency dropped 2.3s.

"模型越强，我们越应该让开，别挡着它。" — Peak Ji，Manus 首席科学家

🎁 Bonus

gnvitop
Global GPU Monitor

A side project built during research — one command to monitor all your lab GPUs

🎁 Bonus

Tool Recommendation

gnvitop — Global GPU Monitor

Real-time GPU monitoring across all lab servers —
see which cards are idle, check if you're hogging too many.

pip install gnvitop  # 安装

gnvitop              # 使用

Auto-reads ~/.ssh/config, SSHs into all servers
Live web dashboard showing all GPU status
Supports ProxyJump bastion hosts
Current user's processes highlighted in blue

Thanks

Q & A

github.com/Linwei94/talks

AI Agent for Research & Engineering

You've Probably Heard These Names

The Essence ofAI Agent

Most LLMs Generate One Token at a Time

Agent Has No Memory — It Only Sees the Context

Context Window Is Growing — But Still Not Enough

Agent's Core Job: Decide What to Feed

生产力正在被重新定价

You Know My GF Runs a Cat Boarding Business 🐱

Arigato · Cat Boarding Web App

How Did This Paper Start?

Breaking Down Research in AI — Every Step Is Just Text

From LLM to Agent

Skills — Custom Workflows for Your Agent

AI Agent × Wet Lab: Self-Driving Molecular Discovery

Tips for Using Claude Code

Pricing

Sub-agent: Solving the Context Bottleneck

Context Overflow — How to Compress? (1/2)

Context Overflow — How to Compress? (2/2)

Agent 怎么"动手"：Tool Use 循环

System Prompt 里装着 Agent 的"灵魂"

The Paper Inflation Era

How Will the Evaluation System Shift?

The First AI-Only Academic Conference

Will AI Agent Replace Researchers?

Senior Engineers 10× Productive — Why Still Hire Juniors?

Anthropic Labor Market Study (2026)

Harness > Model

Three Principles of Harness Engineering

gnvitop — Global GPU Monitor

Q & A

The Essence of
AI Agent

生产力正在被
重新定价