Seminar · 2026.03.24
Linwei Tao
AI Agent in Practice

AI Agent for Research & Engineering

Web App · Top-venue Paper · Technical Deep Dive · Discussion
The Steam Engine Moment of the Information Age

claude — ~/CalibrationAGT
Today's Two Projects
Live Demo
Arigato · Cat Boarding Web App
9:41
CalibrationAGT · Top-venue Paper
Agenda
Agenda
🐱
Part 1 · Engineering
Cat Boarding Web App — a production app built with Claude Code in the gaps between research
~5 min
🔬
Part 2 · Research
CalibrationAGT — how a top-venue paper was completed using AI Agent
~20 min
⚙️
Part 3 · Technical Anatomy
How does AI Agent actually work? Tool use, Memory, Sub-agent
~10 min
💭
Part 4 · Discussion
Paper inflation, Senior productivity ×10 — how will research and engineering change?
~20 min
🎁
Bonus
Stay tuned
You've Heard of These
AI Coding Agents

You've Probably Heard These Names

GitHub Copilot
Cursor
Windsurf
Gemini CLI
Claude Code ✦
OpenClaw
Codex
O
OpenClaw
  • General personal assistant, not a coding tool
  • Powerful memory design — but burns tokens
  • Chat via WhatsApp / Telegram / 15+ apps
Claude Code ✦ Today's tool
  • Rich ecosystem: Skills, MCP, Hook — highly extensible
  • Best for vibe coding & vibe research
Cx
Codex GPT-5.4
  • Claude Code's main rival — model keeps improving
  • Ecosystem still catching up — fewer skills & integrations
Core difference: developer & research Agent vs personal life assistant开发者 & 科研 Agent vs 个人生活助理

The Essence of
AI Agent

Foundation · How LLMs Work
Autoregressive Generation

Most LLMs Generate One Token at a Time

Think of it like a Markov chain: given all previous tokens, predict the next one — then repeat. The model never “plans ahead”; it only ever answers: what comes next?

Input
The cat sat on the mat
Loop Back
The cat sat on the mat .
↑ The newly generated token is appended to the input, and the model runs again.

Context window = working memory

Everything the model sees is in one long sequence — no persistent state between calls.

Output = probability distribution over vocabulary

The model outputs a distribution P(next token | all previous). Temperature controls how sharp that distribution is.

What Is an AI Agent?
Context Engineering

Agent Has No Memory — It Only Sees the Context

Everything LLM sees on each call:

[ sys rules ]
CLAUDE.md · tool list · memory files
[ history ]
你: "帮我分析..." → Claude: "正在读文件..."
[ tool out ]
Read(main.py) → 500 lines of code
[ input ▶ ]
"现在帮我写测试"
// packed together → sent to LLM output = LLM(context)  // predict next token

Agent has no memory — it can only see what's in the context——它只能看到 context 里有什么

What Is an AI Agent?
Context Engineering

Context Window Is Growing — But Still Not Enough

Context Window Evolution

ModelReleased
GPT-4o
128K
2024.05
Claude Sonnet 4
200K
2025.07
GPT-5.2
400K
2025.12
GPT-5.4
1M
2026.03
Claude Sonnet 4.6
1M
2026.03

But How Large Is Your Project?

A top-venue paper ~8K tokens ✓
A PhD thesis ~100K tokens ✓
CalibrationAGT codebase ~400K tokens ⚠
Medium-sized production codebase ~2–5M tokens ✗
Your habits + env config + docs Can't fit at all ✗
What Is an AI Agent?
Context Engineering

Agent's Core Job: Decide What to Feed

What Agent does:

For every LLM call, decide which information goes into context —
which files to read, which preferences to remember, which history to drop信息塞进 context——
读哪些文件、记忆哪些偏好、丢掉哪些历史

📂

Which files to read

🧠

Which prefs to keep

🗑️

Which history to drop

Today's Core Argument

生产力正在被
重新定价

A structural shift in efficiency, not just a new tool.

01
Part One · Engineering
Cat Boarding
Web App
Zero to Production, Built End-to-End with Claude Code
Engineering · Background
Side Project

Cat Boarding Management System 🐱

We had no system at all before — everything was in our heads or on paper, total chaos.
In between research sessions, I spent a few days building her a system with Claude Code.,记录全靠脑子和纸,非常乱。
research之余的小项目

📋 Before
Notebooks, WeChat group messages, kept in memory — check-in/out times were constantly getting mixed up
✅ Now
Boarding management, home visits, booking links, e-signatures, revenue tracking — complete feature set
Engineering · System Demo
Desktop

Arigato · Cat Boarding Web App

taolinwei.com/cat-boarding-system
Engineering · Build Process
Claude Code in Action

Traditional Web Development

Break Down Requirements Requirements doc
feature breakdown + schedule
Database Design schema.sql
hand-written migrations
UI Design Wireframes → components
hand-coded HTML + CSS
Frontend Dev Pure HTML + ES Modules
no framework, no build step
Debug & Iterate Stack Overflow + docs
manual debugging
Production Launch Real users
Real business

With Claude Code ✦

App Development

You: I want to build a cat boarding app.

✓ schema.sql generated schema.sql 已生成

✓ HTML + CSS written HTML + CSS 已写好

✓ errors auto-fixed 报错已自动修复

✓ ready to deploy 可以上线了

Engineering · Implications
What This Means
🏢
Solo Company
One person doing what once needed a full team
Pieter Levels — $3.1M ARR · 70+ products
Lovable — $100M ARR in 8 months — 0 名员工 · $3.1M ARR · 70+ 产品
Lovable — 1 人创立 · 8 个月达 $100M ARR
🔬
Researcher's Toolbox
Build whatever tool you're missing
Komodo Health — 6–8 weeks → hours
Patrick Mineault — NeuroAI · analysis pipelines — 6–8 周分析缩至数小时
Patrick Mineault — NeuroAI 研究员,用 Claude Code 写分析流水线
💡
Startup MVP
Validate an idea in days, not months
Indragie — macOS app · 19k lines by Claude
Invoice system — 1 day · $3.65 API cost — macOS app · 19k/20k 行由 Claude 写
发票管理系统 — 1 天构建 · API 成本 $3.65
🛠️
Self-Sufficient Living
Build what you need, when you need it
Home Assistant community — smart home w/ Claude Code
Non-programmer — finance app · 0 lines of code — 用 Claude Code 从零搭智能家居
非程序员 — Claude Opus 4 + Cursor · 0 行代码写出理财 app

The execution barrier has dropped — your ceiling is how many worthwhile things you can think to do.——跨领域不再是瓶颈。你的上限,是你能想到多少值得做的事。

02
Part Two · Research
CalibrationAGT
Confidence Calibration Under Annotation Ambiguity
Research · Motivation
CalibrationAGT

How Did This Paper Start?

标注歧义示例
01

Conformal Prediction × Annotation Ambiguity

When ground truth is inherently ambiguous (annotators disagree), do CP's coverage guarantees still hold?

02

Calibration in This Setting Hasn't Been Done

Annotation ambiguity + calibration — even “how to evaluate” has no consensus

03

Pivot → CalibrationAGT

Calibration in this setting is equally worth systematic study.

Research · Workflow
Every Step Is Text

Breaking Down Research in AI — Every Step Is Just Text

📚

Literature Review — papers, abstracts, notes · All text

💡

Research Proposal — hypothesis, design, rationale · Text

🔬

Experiment Scripts — code is text

⚠️

Errors & Results — stack traces, CSV, logs · Text with instant feedback

📊

Figures — matplotlib / seaborn script · Also text

📄

Paper (LaTeX) — everything converges · Into text

Research · Background
LLM → Agent

From LLM to Agent

Before: ChatGPT

# Error → copy, paste, ask, copy back Error: KeyError: 'ece_true' ↓ manually copy → paste → ask ChatGPT Fix: metrics[key] = ece_fn(logits, labels) Error: KeyError: 'ece_true' ↓ 手动 copy → paste → 问 ChatGPT Fix: metrics[key] = ece_fn(logits, labels)
You are the middleman — manually ferrying information back and forth

Now: Agent

# Error → Agent handles it end-to-end Error: KeyError: 'ece_true' ↓ Agent reads, locates, fixes, runs tests ✓ All tests passed. Fix pushed. Error: KeyError: 'ece_true' ↓ Agent 读文件、定位原因、改好、跑测试 ✓ All tests passed. Fix pushed.

You set the direction — Agent reads, executes, verifies
→ You're the Director, not the middleman

Research · Claude Code in Action
Live Interaction
claude — ~/CalibrationAGT
Research · Toolchain
Skills & Skill-Creator

Skills — Custom Workflows for Your Agent

What is a Skill?

A Markdown file telling the Agent how to handle a task category. Trigger it — the Agent runs the full workflow.

What is Skill-Creator?

A skill for writing skills. Describe your need — Agent writes the spec. 的 skill。描述你的需求,Agent 自动生成结构化的工作流规范并保存。

Use Cases

Literature review · paper writing · experiments · code review · data preprocessing — any repeatable workflow

Skills = reusable workflows — define once, invoke repeatedly

Examples: /paper-review review workflow · /experiment-log experiment record · /debug-cluster GPU cluster debug · /weekly-report auto weekly report

claude — ~/research
Research · Skill in Action
ai-research-paper

Pipeline

🔧

Phase 0 · Setup

venue · topic · compute → config.md

↻ LOOP

Idea Loop · Phase 1–5

📚

Literature Review

ArXiv MCP · WebSearch · gap identification

💡

Idea Generation

Generate 4 candidate ideas

🤖

6-Agent Debate ← subagents

Critic · Champion · Devil's Advocate…

🚪

AC Gate

REVISE → loop · REJECT → drop · ACCEPT

🔬

Pilot Experiment

Quick feasibility check · PASS to continue

↓ ACCEPT

Full Experiments · GPU Auto

SSH · gnvitop scheduling · autonomous execution

📊

6-Agent Result Debate ← subagents

Result interpretation · Contribution positioning

✍️

Paper Writing + Figures

seaborn figures · parallel section writing

📱

Review ← subagent → Submit

Revisions · Telegram notification

claude — ~/CalibrationAGT
Research · Going Further
Just the Beginning

/ai-research-paper is just a starting point —
any repetitive workflow can be packaged as a skill.

更自动化的 Research Pipeline

AutoResearch · SibylSystem

github.com/Sibyl-Research-Team/
AutoResearch-SibylSystem

Star

扫码访问

Research · Example
LUMI-lab · Cell 2026

AI Agent × Wet Lab: Self-Driving Molecular Discovery

LUMI-lab Figure 1

Figure 1: LUMI-lab overview — foundation model + active learning + robotic lab → 1,700 LNPs → 20.3% lung gene editing in vivo

Research · More Examples
Skills in Action
/brainstorm — ~/CalibrationAGT
/frontend-slides — ~/talks
Research · Usage Tips
Tips

Tips for Using Claude Code

1

Run in tmux on a server — session persists across disconnects

VS Code plugin cannot guarantee session continuity

2

VS Code plugin is great for instant tasks — e.g. making slides

Good for short, focused tasks that finish quickly

3

Terminal: use Ghostty — officially recommended by Anthropic

Claude Code is still early-stage — Ghostty has the fewest bugs

4

Chrome extension worth trying — can directly control the browser UI

Better compatibility than the local desktop app

Research · Usage Tips
Pricing

Pricing

Claude Code

Pro $20 / mo

Light use — ~4 hours of focused work per day

Max · 5× $100 / mo

Sufficient for most — daily research + project work covered

Max · 20× $200 / mo

Run 5–10 projects in parallel simultaneously

Codex (OpenAI)

ChatGPT Plus $20 / mo

Limited Codex access included

ChatGPT Pro $200 / mo

Parallel tasks, 5h rolling cap + weekly limit — not truly unlimited

03
Part Three · Technical Anatomy
Technical Anatomy
How Does an AI Agent Actually Work?
Reference: Hung-yi Lee Machine Learning Course 2025 · NTU
Tech · Sub-agent
Context Engineering

Sub-agent: Solving the Context Bottleneck

Task too big to fit in one context?
The main agent spawns multiple sub-agents that handle subtasks in parallel, returning only summarized results to the main context.,只把摘要结果返回主 context。

Main Agent
→ spawn
Sub A: Read Paper A
→ spawn
Sub B: Read Paper B
← Returns summary only, not full text
  • Sub-agent has a lean system prompt, focused on a single subtask,专注单一子任务
  • Multiple sub-agents run in parallel — faster
  • Context window contains only task-relevant info — more accurate reasoning
Tech · Context Compression
From Hongyi Lee's Course

Context Overflow — How to Compress? (1/2)

Why Does Context Grow So Fast?

Tool outputs (tool observations) account for 84% of context
The model's own words are only ~10%
模型自己说的话只占 ~10%

Method 1 of 2

1

Observation Masking

Replace tool output with a single line:
There used to be tool output here”」

Looks brutal — but experiments show it works about as well as LLM summarization

Tech · Context Compression
From Hongyi Lee's Course

Context Overflow — How to Compress? (2/2)

2

LLM Summarization

History too long → compress it with LLM summarization
Claude Code has this compaction mechanism built in

Sub-agent = Automatic Compression

Main Agent spawns Sub-agent → Sub-agent accumulates its own Context
↓ sub-agent return
Sub-agent's entire conversation vanishes, only the returned sentence remains,只留 return 的一句话

Language models dislike compressing their own memory
So compression is usually enforced by the Agent framework压缩自己的记忆
所以压缩通常是 Agent 框架强制执行的

Tech · Tool Use
Function Calling

Agent 怎么"动手":Tool Use 循环

Agent receives task
Sent to LLM (with tool list)
[tool_use] Read("main.py")
在电脑上执行 Read("main.py")
Return file contents
Add to context, call LLM again
↓ Loop until task complete
[tool_use] Write("fix.py", ...)
Write to file
"done" [END]

Why it's powerful: shell commands are text, and text is exactly what LLM excels at

Tech · Memory
Memory Architecture

System Prompt 里装着 Agent 的"灵魂"

  • CLAUDE.md: project rules, coding conventions, notes:项目规则、代码约定、注意事项
  • memory/*.md: user preferences, past decisions, long-term memory:用户偏好、历史决策、长期记忆
  • skills/*.md: reusable workflows (skill library):可复用的工作流(技能库)
  • Tool list: Read, Write, Bash, WebSearch…:Read、Write、Bash、WebSearch……
  • At every LLM call, all of this is stuffed into the system prompt
# CLAUDE.md example ## Rules - Reply in English - Check GPU usage before running experiments - Do not commit untested code ## Project Context Top-venue paper, ddl: 5/31 ## Rules - 用中文回复 - 运行实验前先检查 GPU 占用 - 不要 commit 未测试的代码 ## Project Context 顶刊投稿,ddl: 5/31

一次对话可能塞入 4000+ tokens 的系统信息——
这就是 agent 能持续"记住"项目状态的原因

Tech · Persistence
Hooks · Cron · Memory
🪝
Hooks
Run shell commands before / after any tool call — validate, log, notify
# settings.json hooks: PostToolUse: - command: "notify.sh" match: Bash hooks: PostToolUse: - command: "notify.sh" match: Bash
Cron
Schedule agent runs at fixed intervals — monitor experiments, track papers, send digests
# every 30 min CronCreate( "*/30 * * * *", "check experiments" ) CronCreate( "*/30 * * * *", "check experiments" )
🧠
Auto-Memory
Agent writes structured memory files — persisted across all future sessions
# ~/.claude/projects/ MEMORY.md # index memory/ user.md # preferences project.md # current goals MEMORY.md # 索引 memory/ user.md # 用户偏好 project.md # 当前目标

Agent as a persistent teammate — remembers context across sessions, reacts to events automatically, and schedules its own check-ins.——跨会话记住上下文、对事件自动响应、自行安排检查点。

04
Part Four · Discussion
What Has AI Agent
Changed?
Research · Engineering · Human Value
Discussion · New Mindset
Working in the Agent Era

Work is about being accountable for outcomes负责
不是对过程负责,不是对代码行数负责,不是对"我自己写的"负责

Andrej Karpathy
Andrej Karpathy

"Vibe coding — fully give in to the vibes, embrace exponentials, and forget that the code even exists."

X · Feb 2025 → now: "agentic engineering"

Sam Altman
Sam Altman · OpenAI

"We may see the first AI agents join the workforce and materially change the output of companies."

Blog · Jan 2025

Dario Amodei
Dario Amodei · Anthropic

"AI could soon compress decades of scientific progress into just a few years."

Machines of Loving Grace · Oct 2024

Discussion · Research
Research

The Paper Inflation Era

Agent lets anyone churn out papers fast — hundreds per day on arxiv, reviewers already can't keep up

arXiv annual submissions · arxiv.org/stats (2025: exceeded 28,000/month)

arXiv new submissions per year by subject area (1991–2021)
Discussion · Research
Research

How Will the Evaluation System Shift?

Maybe: asking the right question is the real core competency,才是真正的核心竞争力

Discussion · Research
Agents4Science 2025

The First AI-Only Academic Conference

Stanford · Oct 2025 · agents4science.stanford.edu

Agents4Science 2025 conference statistics figure
  • 314 submitted · 48 accepted (~16%)
  • AI as sole first author + reviewer on every paper
  • Accepted papers: more human input on hypothesis & experimental design
  • AI 审稿:一致性高,但洞察力不如人类
  • ~44% of submissions had hallucinated references

AI can generate papers — but asking the right question still needs humans仍然需要人

Discussion · AI Agent's Impact
Academic Research

Will AI Agent Replace Researchers?

Ref: Hung-yi Lee · AI Agent (3/3) · NTU 2026

Andrew Hall (Stanford) · 「100x Research Assistant」

Cost: PhD student 16h $1040 vs Claude Code 1h $10

PhD student: 16h / $1,040  vs  Claude Code: 1h / $10 (104× cheaper)

But: humans haven't been replaced

  • AI ideas look novel — but don't outperform humans after execution
  • Accepted papers: humans intervene more on idea and experiment design
  • AI 擅长执行,但"做什么"仍需人来判断
Discussion · Engineering
Engineering

Senior Engineers 10× Productive — Why Still Hire Juniors?

People vs. Agents

  • Juniors: onboard · mentor · review
  • Agent: follows orders · always on · no salary
  • 1 person + agent = team of 5
  • Junior's repetitive work? Agent handles it all
  • Headcount logic is being repriced

Do juniors still have a shot?

  • Yes — but find your irreplaceable part earlier
  • Systems judgment · user empathy · defining the real problem
  • Direct agents = new era's 10× engineer
  • Solo full-stack: 5 years → 1 year

This isn't pessimism — it's a new starting line

Discussion · Engineering
Engineering

Anthropic Labor Market Study (2026)

  • Programmers: AI covers 75% of job tasks 的工作任务
  • AI-exposed jobs: avg. earnings 47% higher than unexposed
  • Workers aged 22–25 in high-AI-exposure occupations: 13% drop in employment (Stanford, 2025)(Stanford, 2025)
  • Global workforce: 3.3 billion workers worldwide
  • Affected workers: 1.1 billion — roughly 1 in 3 globally——约占全球劳动力的 1/3
Claude职业报告(中文) Claude Career Report (English)
Discussion · Engineering
Agent Harness

Harness > Model

Agent Harness = scaffolding around the model: tool definitions, prompts, workflows, context management. Same model. Different harness. Completely different results. = 包裹在模型外的脚手架:工具定义、Prompt、工作流、上下文管理。
同一个模型,换个 harness,结果天壤之别。

Anthropic · CORE-Bench

42% 78%

Same model (Claude Opus 4.5). Switched from a generic scaffold to Claude Code harness. +36 points.

LangChain · Terminal Bench 2.0

52.8% 66.5%

Model unchanged. Added self-verification loops, context engineering, loop detection. Top 30 → Top 5.

OpenAI · Codex Harness

0 lines hand-written code

Built a 1M-line production app in 5 months with 3 engineers — 1/10th normal dev time. Every line written by Codex agents.

💡

Progressive Disclosure: show the model only what it needs, when it needs it — this single mechanism explains most harness improvements.:每一步只让模型看到它需要的信息,其余全部隐藏。这是大多数 harness 提升的根本原因。

Discussion · Engineering
Agent Harness

Three Principles of Harness Engineering

Precise Tool Descriptions

Tell the agent exactly when NOT to use a tool调用

One engineer connected 12 tools. The agent kept calling the same endpoint twice with different params. He cut to 4 tools with precise descriptions — 40% fewer useless calls.

Fewer Tools, Better Results

One rule in a config file beats adding a new tool

Vercel's agent had a huge toolbox. Agent got confused, made redundant calls. They stripped it down to bare bash access. Success rate: 100%. Speed: 3.5× faster.

Prune Regularly

The harness you built for last year's model may be hurting today's

Manus rewrote their harness 5 times in 6 months, stripping complexity each time. One engineer deleted an entire memory system on a Thursday — by Friday, response latency dropped 2.3s.

"模型越强,我们越应该让开,别挡着它。" — Peak Ji,Manus 首席科学家

🎁 Bonus
gnvitop
Global GPU Monitor
A side project built during research — one command to monitor all your lab GPUs
🎁 Bonus
Tool Recommendation

gnvitop — Global GPU Monitor

Real-time GPU monitoring across all lab servers —
see which cards are idle, check if you're hogging too many.

pip install gnvitop # 安装
gnvitop # 使用
  • Auto-reads ~/.ssh/config, SSHs into all servers
  • Live web dashboard showing all GPU status
  • Supports ProxyJump bastion hosts
  • Current user's processes highlighted in blue
gnvitop dashboard

Thanks

Q & A

github.com/Linwei94/talks