feat: skills i18n 改造(schemaVersion 1.1,零向后兼容) (#1)

* feat: skills i18n 改造 — schemaVersion 1.1,零向后兼容

把 21 个 skills + 1 个 agent + manifest/categories 全量迁移到 schemaVersion 1.1
的 i18n 结构,配套 CI AI 翻译流水线(GitHub Models)与本地工具链。

## 关键变更

### 数据结构(破坏性,schemaVersion 1.0 → 1.1)
- SKILL.md: 顶层 name 改为 ASCII slug(== 目录名,符合 agentskills.io 规范);
  中文显示名/short_desc/description 全部迁入 metadata.i18n.<locale>
- agents/<id>/agent.json: shortDesc/fullDesc/tags/persona.{role,traits} 迁入
  i18n.<locale>;changelog[].changes 改为 { <locale>: string[] } 对象
- categories.json: 每个分类的 label/description 迁入 i18n.<locale>,顶层只剩
  color/icon
- manifest.json: 加 supportedLocales / defaultLocale;顶层 description 迁入
  i18n.<locale>

### Body 文件结构
- 根 SKILL.md = frontmatter + default_locale (en-US) body
- SKILL.<locale>.md = 各 locale 的 markdown body(首行 <!-- locale: xx --> 自校验)

### 工具链(scripts/i18n/)
- glossary.json: zh→en 术语表 + do_not_translate 白名单
- schema/skill-frontmatter.schema.json: i18n frontmatter JSON Schema
- validate-i18n.py: 8 条校验规则(name 合规 / locale 完整性 / hash 一致性等)
- translate.py: GitHub Models / Anthropic 双 backend,sha256 增量翻译
- migrate.py: 一次性迁移脚本(旧格式 → i18n 结构)

### CI(.github/workflows/)
- i18n-validate.yml: PR 触发跑 validate + translate --check
- i18n-translate.yml: PR 触发用 GitHub Models(默认 openai/gpt-5-mini)翻译缺失
  locale,自动追加 commit;可切到 ANTHROPIC_API_KEY 走 Claude

### 文档
- docs/I18N.md: 作者贡献指南(schema 说明 / 提交流程 / 常见问题)
- README.md: 加多语言段落

## 验证

- uv run scripts/i18n/validate-i18n.py: OK,49 文件 0 错误
- uv run scripts/i18n/translate.py --check: 0 stale locale
- 21 skills 标题数 zh-CN == en-US 严格对齐(最大 66=66)
- skills-ref 规范校验:全部通过(顶层 name ASCII slug + description 单字段)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(i18n): 修复 PR #1 review 反馈的 6 项问题

- schema: translated_by 正则放宽为 ^(human|ai:[A-Za-z0-9._:/-]+)$,接受
  'ai:github:openai/gpt-5-mini' 这类 backend:model 形式(CI 翻译输出格式)
- README + docs/I18N.md: 修正"CI 用 Claude API"误导描述,正确说明默认是
  GitHub Models(openai/gpt-5-mini)+ GITHUB_TOKEN,可选切到 Anthropic
- skills/minimax-tts/SKILL.md & SKILL.zh-CN.md: 删除多余的 ``` 闭合,避免
  Markdown 后续渲染错乱
- skills/docx/SKILL.md: 翻译时丢失的 • Unicode escape 示例已恢复,
  与 zh-CN 版本对齐

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-05 00:26:33 +08:00
committed by GitHub
parent 1c107a9344
commit 1f7c8b9673
59 changed files with 10533 additions and 2014 deletions

View File

@@ -1,5 +1,5 @@
---
name: 联网访问
name: web-access
description: >-
Use this skill whenever the user needs to access information from the internet
— searching for current information, fetching public web pages, browsing
@@ -29,7 +29,28 @@ tags:
- playwright
metadata:
author: desirecore
updated_at: '2026-04-13'
updated_at: '2026-05-03'
i18n:
default_locale: en-US
source_locale: zh-CN
locales:
- zh-CN
- en-US
zh-CN:
name: 联网访问
short_desc: 联网搜索、网页抓取、登录态浏览器访问CDP、研究调研工作流
description: 三层联网访问工具包——搜索公开页面、Jina 优化抓取、CDP 登录态浏览器访问。
body: ./SKILL.zh-CN.md
source_hash: sha256:0ba170b3126a0823
translated_by: human
en-US:
name: Web Access
short_desc: Web search, page fetching, logged-in browser access via CDP, research workflows
description: A three-layer web-access toolkit — search public pages, fetch heavy pages via Jina Reader, and reach logged-in sites via Chrome CDP.
body: ./SKILL.md
source_hash: sha256:0ba170b3126a0823
translated_by: ai:claude-opus-4-7
translated_at: '2026-05-03'
market:
icon: >-
<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0
@@ -46,7 +67,6 @@ market:
stroke="#34C759" stroke-width="1.5" fill="#34C759"
fill-opacity="0.12"/><path d="M20.5 20.5l2 2" stroke="#34C759"
stroke-width="1.8" stroke-linecap="round"/></svg>
short_desc: 联网搜索、网页抓取、登录态浏览器访问CDP、研究调研工作流
category: research
maintainer:
name: DesireCore Official
@@ -54,38 +74,38 @@ market:
channel: latest
---
# web-access 技能
# web-access skill
## L0:一句话摘要
## L0: One-line Summary
三层联网访问工具包——搜索公开页面、Jina 优化抓取、CDP 登录态浏览器访问。
A three-layer web-access toolkit — search public pages, optimize fetches via Jina Reader, and reach login-gated sites via Chrome CDP.
## L1:概述与使用场景
## L1: Overview & Use Cases
### 能力描述
### Capability
web-access 是一个**流程型技能Procedural Skill**,提供三层互补的联网访问能力:Layer 1WebSearch + WebFetch)用于公开页面;Layer 2Jina Reader)用于 JS 渲染的重页面,默认节省 TokenLayer 3Chrome CDP)用于需要登录态的站点(小红书/B站/微博/飞书/Twitter)。
web-access is a **procedural skill** that provides three complementary layers of web access: Layer 1 (WebSearch + WebFetch) for public pages; Layer 2 (Jina Reader) for JS-rendered heavy pages, saving tokens by default; Layer 3 (Chrome CDP) for sites requiring a logged-in session (Xiaohongshu / Bilibili / Weibo / Feishu / Twitter).
### 使用场景
### Use Cases
- 用户需要搜索当前信息或研究特定主题
- 用户需要抓取公开网页内容或技术文档
- 用户需要访问登录态站点小红书、B站、微博、飞书、Twitter 等)
- 用户需要对比产品、聚合新闻或调查 API/库版本
- The user needs to search for current information or research a specific topic
- The user needs to fetch public web content or technical documentation
- The user needs to access logged-in sites (Xiaohongshu, Bilibili, Weibo, Feishu, Twitter, etc.)
- The user needs to compare products, aggregate news, or investigate API/library versions
### 核心价值
### Core Value
- **三层递进**:从轻量搜索到重度 JS 渲染到登录态访问,按需选择
- **Token 优化**Jina Reader 默认减少 50-80% Token 消耗
- **登录态复用**:通过 CDP 连接用户已登录的 Chrome无需重复登录
- **Three-layer progression**: from lightweight search to heavy JS rendering to logged-in access — pick on demand
- **Token optimization**: Jina Reader cuts token usage by 5080% by default
- **Logged-in session reuse**: connect to the user's already-logged-in Chrome via CDP — no re-login required
## L2:详细规范
## L2: Detailed Specification
## Output Rule
When you complete a research task, you **MUST** cite all source URLs in your response. Distinguish between:
- **Quoted facts**: directly from a fetched page → cite the URL
- **Inferences**: your synthesis or analysis → mark as "(分析/推断)"
- **Inferences**: your synthesis or analysis → mark as "(analysis/inference)"
If any fetch fails, explicitly tell the user which URL failed and which fallback you used.
@@ -93,7 +113,7 @@ If any fetch fails, explicitly tell the user which URL failed and which fallback
## Prerequisites: Chrome CDP Setup (for login-gated sites)
**Only required when accessing sites that need the user's login session** (小红书/B站/微博/飞书/Twitter/知乎/公众号).
**Only required when accessing sites that need the user's login session** (Xiaohongshu / Bilibili / Weibo / Feishu / Twitter / Zhihu / WeChat Official Accounts).
### One-time setup
@@ -121,7 +141,7 @@ google-chrome \
```
After launch:
1. Manually log in to the sites you need (小红书、B站、微博、飞书 …)
1. Manually log in to the sites you need (Xiaohongshu, Bilibili, Weibo, Feishu, …)
2. Leave this Chrome window open in the background
3. Verify the debug endpoint: `curl -s http://localhost:9222/json/version` should return JSON
@@ -132,7 +152,7 @@ Before any CDP operation, always run:
curl -s http://localhost:9222/json/version | python3 -c "import sys,json; d=json.load(sys.stdin); print('CDP ready:', d.get('Browser'))"
```
If the command fails, tell the user: "请先启动 Chrome 并开启远程调试端口(见 web-access 技能的 Prerequisites 部分)。"
If the command fails, tell the user: "Please launch Chrome with the remote debugging port enabled (see the Prerequisites section of the web-access skill)."
---
@@ -151,7 +171,7 @@ User intent
│ └─→ Bash: curl -sL "https://r.jina.ai/<original-url>"
│ (Jina Reader = default for JS-rendered content, saves tokens)
├─ "Read this login-gated page" (小红书/B站/微博/飞书/Twitter/知乎/公众号)
├─ "Read this login-gated page" (Xiaohongshu/Bilibili/Weibo/Feishu/Twitter/Zhihu/WeChat)
│ └─→ 1. Verify CDP ready (curl http://localhost:9222/json/version)
│ 2. Bash: python3 script with playwright.connect_over_cdp()
│ 3. Extract content → feed to Jina Reader for clean Markdown
@@ -188,13 +208,13 @@ User intent
| Hacker News, Reddit | L1 WebFetch | Public content |
| Medium, Dev.to | L2 Jina Reader | JS-rendered, member gates |
| Twitter/X | L3 CDP (or L2 Jina with `x.com`) | Login required for full thread |
| 小红书 (xiaohongshu.com) | L3 CDP | 强制登录 |
| B (bilibili.com) | L3 CDP | 视频描述/评论需登录 |
| 微博 (weibo.com) | L3 CDP | 长微博需登录 |
| 知乎 (zhihu.com) | L3 CDP | 长文+评论需登录 |
| 飞书文档 (feishu.cn) | L3 CDP | 必须登录 |
| 公众号 (mp.weixin.qq.com) | L2 Jina Reader | 通常公开Jina 处理更干净 |
| LinkedIn | L3 CDP | 登录墙 |
| Xiaohongshu (xiaohongshu.com) | L3 CDP | Login required |
| Bilibili (bilibili.com) | L3 CDP | Login needed for video desc/comments |
| Weibo (weibo.com) | L3 CDP | Long posts require login |
| Zhihu (zhihu.com) | L3 CDP | Long articles + comments require login |
| Feishu Docs (feishu.cn) | L3 CDP | Login required |
| WeChat Official Accounts (mp.weixin.qq.com) | L2 Jina Reader | Usually public, Jina cleans better |
| LinkedIn | L3 CDP | Login wall |
---
@@ -284,7 +304,7 @@ PY
```
See [references/cdp-browser.md](references/cdp-browser.md) for:
- Per-site selectors (小红书/B站/微博/知乎/飞书)
- Per-site selectors (Xiaohongshu / Bilibili / Weibo / Zhihu / Feishu)
- Scrolling & lazy-load patterns
- Screenshot & form-fill recipes
- Troubleshooting connection issues
@@ -294,12 +314,12 @@ See [references/cdp-browser.md](references/cdp-browser.md) for:
## Common Workflows
Read [references/workflows.md](references/workflows.md) for detailed templates:
- 技术文档查询 (Tech docs lookup)
- 竞品对比研究 (Competitor research)
- 新闻聚合与时间线 (News aggregation)
- API/库版本调查 (Library version investigation)
- Tech docs lookup
- Competitor research
- News aggregation & timelines
- API/library version investigation
Read [references/cdp-browser.md](references/cdp-browser.md) for login-gated site recipes (小红书/B站/微博/知乎/飞书).
Read [references/cdp-browser.md](references/cdp-browser.md) for login-gated site recipes (Xiaohongshu / Bilibili / Weibo / Zhihu / Feishu).
Read [references/jina-reader.md](references/jina-reader.md) for Jina Reader positioning, rate limits, and advanced endpoints.
@@ -321,7 +341,7 @@ Read [references/jina-reader.md](references/jina-reader.md) for Jina Reader posi
## Anti-Patterns (Avoid)
-**Using WebFetch on obviously heavy sites** — Medium, Twitter, 小红书 will waste tokens or fail. Jump straight to L2/L3.
-**Using WebFetch on obviously heavy sites** — Medium, Twitter, Xiaohongshu will waste tokens or fail. Jump straight to L2/L3.
-**Launching headless Chrome instead of CDP attach** — loses user's login state, triggers anti-bot, slow cold start. Always use `connect_over_cdp()` to attach to the user's existing session.
-**Fetching one URL at a time when you need 5** — batch in a single message.
-**Trusting a single source** — cross-check ≥ 2 sources for non-trivial claims.
@@ -336,20 +356,20 @@ Read [references/jina-reader.md](references/jina-reader.md) for Jina Reader posi
## Example Interaction
**User**: "帮我抓一下这条小红书笔记的内容:https://www.xiaohongshu.com/explore/abc123"
**User**: "Grab the contents of this Xiaohongshu note for me: https://www.xiaohongshu.com/explore/abc123"
**Agent workflow**:
```
1. 识别 → 小红书是 L3 登录态站点
2. 检查 CDPcurl -s http://localhost:9222/json/version
├─ 失败 → 提示用户启动 Chrome 调试模式,终止
└─ 成功 → 继续
3. Bash: python3 connect_over_cdp 脚本 → page.goto(url) → page.content()
4. BeautifulSoup 提取 h1 title.note-content.comments
5. 返回给用户时:
- 引用原 URL
- 若内容很长,用 Jina 清洗一遍节省 token
6. 告知用户:「已通过你的登录态抓取,原链接:[xhs](url)
1. Recognize → Xiaohongshu is an L3 logged-in site
2. Check CDP: curl -s http://localhost:9222/json/version
├─ Failure → prompt the user to launch Chrome in debug mode, abort
└─ Success → continue
3. Bash: python3 connect_over_cdp script → page.goto(url) → page.content()
4. BeautifulSoup extract h1 title, .note-content, .comments
5. When returning to the user:
- Cite the original URL
- If content is long, run it through Jina to save tokens
6. Tell the user: "Fetched via your logged-in session, original link: [xhs](url)"
```
---

View File

@@ -0,0 +1,312 @@
<!-- locale: zh-CN -->
# web-access 技能
## L0一句话摘要
三层联网访问工具包——搜索公开页面、Jina 优化抓取、CDP 登录态浏览器访问。
## L1概述与使用场景
### 能力描述
web-access 是一个**流程型技能Procedural Skill**提供三层互补的联网访问能力Layer 1WebSearch + WebFetch用于公开页面Layer 2Jina Reader用于 JS 渲染的重页面,默认节省 TokenLayer 3Chrome CDP用于需要登录态的站点小红书/B站/微博/飞书/Twitter
### 使用场景
- 用户需要搜索当前信息或研究特定主题
- 用户需要抓取公开网页内容或技术文档
- 用户需要访问登录态站点小红书、B站、微博、飞书、Twitter 等)
- 用户需要对比产品、聚合新闻或调查 API/库版本
### 核心价值
- **三层递进**:从轻量搜索到重度 JS 渲染到登录态访问,按需选择
- **Token 优化**Jina Reader 默认减少 50-80% Token 消耗
- **登录态复用**:通过 CDP 连接用户已登录的 Chrome无需重复登录
## L2详细规范
## Output Rule
When you complete a research task, you **MUST** cite all source URLs in your response. Distinguish between:
- **Quoted facts**: directly from a fetched page → cite the URL
- **Inferences**: your synthesis or analysis → mark as "(分析/推断)"
If any fetch fails, explicitly tell the user which URL failed and which fallback you used.
---
## Prerequisites: Chrome CDP Setup (for login-gated sites)
**Only required when accessing sites that need the user's login session** (小红书/B站/微博/飞书/Twitter/知乎/公众号).
### One-time setup
Launch a dedicated Chrome instance with remote debugging enabled:
**macOS**:
```bash
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
--remote-debugging-port=9222 \
--user-data-dir="$HOME/.desirecore/chrome-profile"
```
**Linux**:
```bash
google-chrome \
--remote-debugging-port=9222 \
--user-data-dir="$HOME/.desirecore/chrome-profile"
```
**Windows (PowerShell)**:
```powershell
& "C:\Program Files\Google\Chrome\Application\chrome.exe" `
--remote-debugging-port=9222 `
--user-data-dir="$env:USERPROFILE\.desirecore\chrome-profile"
```
After launch:
1. Manually log in to the sites you need (小红书、B站、微博、飞书 …)
2. Leave this Chrome window open in the background
3. Verify the debug endpoint: `curl -s http://localhost:9222/json/version` should return JSON
### Verify CDP is ready
Before any CDP operation, always run:
```bash
curl -s http://localhost:9222/json/version | python3 -c "import sys,json; d=json.load(sys.stdin); print('CDP ready:', d.get('Browser'))"
```
If the command fails, tell the user: "请先启动 Chrome 并开启远程调试端口(见 web-access 技能的 Prerequisites 部分)。"
---
## Tool Selection Decision Tree
```
User intent
├─ "Search for information about X" (no specific URL)
│ └─→ WebSearch → pick top 3-5 results → fetch each (see next branches)
├─ "Read this public page" (static HTML, docs, news)
│ └─→ WebFetch(url) directly
├─ "Read this heavy-JS page" (SPA, React/Vue sites, Medium, etc.)
│ └─→ Bash: curl -sL "https://r.jina.ai/<original-url>"
│ (Jina Reader = default for JS-rendered content, saves tokens)
├─ "Read this login-gated page" (小红书/B站/微博/飞书/Twitter/知乎/公众号)
│ └─→ 1. Verify CDP ready (curl http://localhost:9222/json/version)
│ 2. Bash: python3 script with playwright.connect_over_cdp()
│ 3. Extract content → feed to Jina Reader for clean Markdown
│ (or use BeautifulSoup directly on the raw HTML)
├─ "API documentation / GitHub / npm package info"
│ └─→ Prefer official API endpoints over scraping HTML:
│ - GitHub: gh api repos/owner/name
│ - npm: curl https://registry.npmjs.org/<pkg>
│ - PyPI: curl https://pypi.org/pypi/<pkg>/json
└─ "Real-time interactive task" (click, fill form, scroll, screenshot)
└─→ CDP + Playwright (see references/cdp-browser.md)
```
### Three-layer strategy summary
| Layer | Use case | Primary tool | Token cost |
|-------|----------|--------------|------------|
| L1 | Public, static | `WebFetch` | Low |
| L2 | JS-heavy, long articles, token savings | `Bash curl r.jina.ai` | **Lowest** (Markdown pre-cleaned) |
| L3 | Login-gated, interactive | `Bash + Python Playwright CDP` | Medium (raw HTML, then clean via Jina or BS4) |
**Default priority**: L1 for simple public pages → L2 for anything heavy → L3 only when login is required.
---
## Supported Sites Matrix
| Site | Recommended Layer | Notes |
|------|-------------------|-------|
| Wikipedia, MDN, official docs | L1 WebFetch | Static, clean HTML |
| GitHub README, issues, PRs | `gh api` (best) → L1 WebFetch | Prefer API |
| Hacker News, Reddit | L1 WebFetch | Public content |
| Medium, Dev.to | L2 Jina Reader | JS-rendered, member gates |
| Twitter/X | L3 CDP (or L2 Jina with `x.com`) | Login required for full thread |
| 小红书 (xiaohongshu.com) | L3 CDP | 强制登录 |
| B站 (bilibili.com) | L3 CDP | 视频描述/评论需登录 |
| 微博 (weibo.com) | L3 CDP | 长微博需登录 |
| 知乎 (zhihu.com) | L3 CDP | 长文+评论需登录 |
| 飞书文档 (feishu.cn) | L3 CDP | 必须登录 |
| 公众号 (mp.weixin.qq.com) | L2 Jina Reader | 通常公开Jina 处理更干净 |
| LinkedIn | L3 CDP | 登录墙 |
---
## Tool Reference
### Layer 1: WebSearch + WebFetch
**WebSearch** — discover URLs for an unknown topic:
```
WebSearch(query="latest typescript 5.5 features 2026", max_results=5)
```
Tips:
- Include the year for time-sensitive topics
- Use `allowed_domains` / `blocked_domains` to constrain
**WebFetch** — extract clean Markdown from a known URL:
```
WebFetch(url="https://example.com/article")
```
Tips:
- Results cached for 15 min
- Returns cleaned Markdown with title + URL + body
- If body < 200 chars or looks garbled → escalate to Layer 2 (Jina) or Layer 3 (CDP)
### Layer 2: Jina Reader (default for heavy pages)
Jina Reader (`r.jina.ai`) is a free public proxy that renders pages server-side and returns clean Markdown. Use it as the **default** for any page where WebFetch produces garbled or truncated output, and as the **preferred** extractor for JS-heavy SPAs.
```bash
curl -sL "https://r.jina.ai/https://example.com/article"
```
Why Jina is the default token-saver:
- Strips nav/footer/ads automatically
- Handles JS-rendered SPAs
- Returns 50-80% fewer tokens than raw HTML
- No API key needed for basic use (~20 req/min)
See [references/jina-reader.md](references/jina-reader.md) for advanced endpoints and rate limits.
### Layer 3: CDP Browser (login-gated access)
Use Python Playwright's `connect_over_cdp()` to attach to the user's running Chrome (which already has login cookies). **No re-login needed.**
**Minimal template**:
```bash
python3 << 'PY'
from playwright.sync_api import sync_playwright
TARGET_URL = "https://www.xiaohongshu.com/explore/..."
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp("http://localhost:9222")
context = browser.contexts[0] # reuse user's default context (has cookies)
page = context.new_page()
page.goto(TARGET_URL, wait_until="domcontentloaded")
page.wait_for_timeout(2000) # let lazy content load
html = page.content()
page.close()
# Print first 500 chars to verify
print(html[:500])
PY
```
**Extract text via BeautifulSoup** (no Jina round-trip):
```bash
python3 << 'PY'
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp("http://localhost:9222")
page = browser.contexts[0].new_page()
page.goto("https://www.bilibili.com/video/BV...", wait_until="networkidle")
html = page.content()
page.close()
soup = BeautifulSoup(html, "html.parser")
title = soup.select_one("h1.video-title")
desc = soup.select_one(".video-desc")
print("Title:", title.get_text(strip=True) if title else "N/A")
print("Desc:", desc.get_text(strip=True) if desc else "N/A")
PY
```
See [references/cdp-browser.md](references/cdp-browser.md) for:
- Per-site selectors (小红书/B站/微博/知乎/飞书)
- Scrolling & lazy-load patterns
- Screenshot & form-fill recipes
- Troubleshooting connection issues
---
## Common Workflows
Read [references/workflows.md](references/workflows.md) for detailed templates:
- 技术文档查询 (Tech docs lookup)
- 竞品对比研究 (Competitor research)
- 新闻聚合与时间线 (News aggregation)
- API/库版本调查 (Library version investigation)
Read [references/cdp-browser.md](references/cdp-browser.md) for login-gated site recipes (小红书/B站/微博/知乎/飞书).
Read [references/jina-reader.md](references/jina-reader.md) for Jina Reader positioning, rate limits, and advanced endpoints.
---
## Quick Workflow: Multi-Source Research
```
1. WebSearch(query) → 5 candidate URLs
2. Skim titles + snippets → pick 3 most relevant
3. Classify each URL by layer (L1 / L2 / L3)
4. Fetch all in parallel (single message, multiple tool calls)
5. If any fetch returns < 200 chars or garbled → retry via next layer
6. Synthesize: contradictions? consensus? outliers?
7. Report with inline [source](url) citations + a Sources list at the end
```
---
## Anti-Patterns (Avoid)
-**Using WebFetch on obviously heavy sites** — Medium, Twitter, 小红书 will waste tokens or fail. Jump straight to L2/L3.
-**Launching headless Chrome instead of CDP attach** — loses user's login state, triggers anti-bot, slow cold start. Always use `connect_over_cdp()` to attach to the user's existing session.
-**Fetching one URL at a time when you need 5** — batch in a single message.
-**Trusting a single source** — cross-check ≥ 2 sources for non-trivial claims.
-**Fetching the search result page itself** — WebSearch already returns snippets; fetch the actual articles.
-**Ignoring the cache** — WebFetch caches 15 min, reuse freely.
-**Scraping when an API exists** — GitHub, npm, PyPI, Wikipedia all have JSON APIs.
-**Forgetting the year in time-sensitive queries** — "best AI models" returns 2023 results; "best AI models 2026" returns current.
-**Hardcoding login credentials in scripts** — always rely on the user's pre-logged CDP session.
-**Citing only after the fact** — collect URLs as you fetch, not from memory afterwards.
---
## Example Interaction
**User**: "帮我抓一下这条小红书笔记的内容https://www.xiaohongshu.com/explore/abc123"
**Agent workflow**:
```
1. 识别 → 小红书是 L3 登录态站点
2. 检查 CDPcurl -s http://localhost:9222/json/version
├─ 失败 → 提示用户启动 Chrome 调试模式,终止
└─ 成功 → 继续
3. Bash: python3 connect_over_cdp 脚本 → page.goto(url) → page.content()
4. BeautifulSoup 提取 h1 title、.note-content、.comments
5. 返回给用户时:
- 引用原 URL
- 若内容很长,用 Jina 清洗一遍节省 token
6. 告知用户:「已通过你的登录态抓取,原链接:[xhs](url)」
```
---
## Installation Note
CDP features require Python + Playwright installed:
```bash
pip3 install playwright beautifulsoup4
python3 -m playwright install chromium # only needed if user hasn't installed Chrome
```
If `playwright` is not installed when the user requests a login-gated site, run the install commands in Bash and explain you're setting up the browser automation dependency.