mirror of
https://git.openapi.site/https://github.com/desirecore/market.git
synced 2026-04-21 16:10:56 +08:00
feat: 新增 web-access 和 frontend-design 两个内置技能
根据 docs 推荐补齐 5 个内置技能中的 c) 和 e): web-access v1.1.0: - 三层架构:L1 WebSearch/WebFetch + L2 Jina Reader + L3 CDP Browser - 添加 Chrome CDP 前置条件(macOS/Linux/Windows 启动命令) - 支持登录态访问 小红书/B站/微博/知乎/飞书/Twitter/公众号 - Jina Reader 重新定位为默认 token 优化层(非兜底) - 新增 references/cdp-browser.md(Python Playwright 详细操作手册) - 触发词扩充:小红书、B站、微博、飞书、Twitter、推特、X、知乎、公众号 frontend-design v1.0.0: - 从 Claude Code 官方 frontend-design 技能适配 - 保留原版 bold aesthetic 设计理念 - 新增 Project Context Override 章节:在 DesireCore 主仓库内工作时 自动遵循 3+2 色彩体系(Green/Blue/Purple + Orange/Red) - 添加 Output Rule 要求告知用户文件路径 builtin-skills.json: 12 → 14 skills
This commit is contained in:
334
skills/web-access/SKILL.md
Normal file
334
skills/web-access/SKILL.md
Normal file
@@ -0,0 +1,334 @@
|
||||
---
|
||||
name: 联网访问
|
||||
description: >-
|
||||
Use this skill whenever the user needs to access information from the internet
|
||||
— searching for current information, fetching public web pages, browsing
|
||||
login-gated sites (微博/小红书/B站/飞书/Twitter), comparing products,
|
||||
researching topics, gathering documentation, or summarizing news.
|
||||
This skill orchestrates three complementary layers: (1) WebSearch + WebFetch
|
||||
for public pages, (2) Jina Reader as the default token-optimization layer for
|
||||
heavy/JS-rendered pages, and (3) Chrome DevTools Protocol (CDP) via Python
|
||||
Playwright for login-gated sites that require the user's existing browser
|
||||
session. Always cite source URLs. Use when 用户提到 联网搜索、上网查、
|
||||
查资料、抓取网页、研究、调研、最新资讯、文档查询、对比、竞品、技术文档、
|
||||
新闻、网址、URL、找一下、搜一下、查一下、小红书、B站、微博、飞书、Twitter、
|
||||
推特、X、知乎、公众号、已登录、登录状态。
|
||||
license: Complete terms in LICENSE.txt
|
||||
version: 1.1.0
|
||||
type: procedural
|
||||
risk_level: low
|
||||
status: enabled
|
||||
disable-model-invocation: false
|
||||
tags:
|
||||
- web
|
||||
- search
|
||||
- fetch
|
||||
- research
|
||||
- browsing
|
||||
- cdp
|
||||
- playwright
|
||||
metadata:
|
||||
author: desirecore
|
||||
updated_at: '2026-04-07'
|
||||
market:
|
||||
short_desc: 联网搜索、网页抓取、登录态浏览器访问(CDP)、研究调研工作流
|
||||
category: research
|
||||
maintainer:
|
||||
name: DesireCore Official
|
||||
verified: true
|
||||
channel: latest
|
||||
---
|
||||
|
||||
# Web Access Skill
|
||||
|
||||
Three-layer web access toolkit:
|
||||
|
||||
1. **Layer 1 — Search & Fetch**: `WebSearch` + `WebFetch` for public pages
|
||||
2. **Layer 2 — Jina Reader**: default token-optimized extraction for heavy/JS-rendered pages
|
||||
3. **Layer 3 — CDP Browser**: Chrome DevTools Protocol for login-gated sites (小红书/B站/微博/飞书/Twitter)
|
||||
|
||||
---
|
||||
|
||||
## Output Rule
|
||||
|
||||
When you complete a research task, you **MUST** cite all source URLs in your response. Distinguish between:
|
||||
- **Quoted facts**: directly from a fetched page → cite the URL
|
||||
- **Inferences**: your synthesis or analysis → mark as "(分析/推断)"
|
||||
|
||||
If any fetch fails, explicitly tell the user which URL failed and which fallback you used.
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites: Chrome CDP Setup (for login-gated sites)
|
||||
|
||||
**Only required when accessing sites that need the user's login session** (小红书/B站/微博/飞书/Twitter/知乎/公众号).
|
||||
|
||||
### One-time setup
|
||||
|
||||
Launch a dedicated Chrome instance with remote debugging enabled:
|
||||
|
||||
**macOS**:
|
||||
```bash
|
||||
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
|
||||
--remote-debugging-port=9222 \
|
||||
--user-data-dir="$HOME/.desirecore/chrome-profile"
|
||||
```
|
||||
|
||||
**Linux**:
|
||||
```bash
|
||||
google-chrome \
|
||||
--remote-debugging-port=9222 \
|
||||
--user-data-dir="$HOME/.desirecore/chrome-profile"
|
||||
```
|
||||
|
||||
**Windows (PowerShell)**:
|
||||
```powershell
|
||||
& "C:\Program Files\Google\Chrome\Application\chrome.exe" `
|
||||
--remote-debugging-port=9222 `
|
||||
--user-data-dir="$env:USERPROFILE\.desirecore\chrome-profile"
|
||||
```
|
||||
|
||||
After launch:
|
||||
1. Manually log in to the sites you need (小红书、B站、微博、飞书 …)
|
||||
2. Leave this Chrome window open in the background
|
||||
3. Verify the debug endpoint: `curl -s http://localhost:9222/json/version` should return JSON
|
||||
|
||||
### Verify CDP is ready
|
||||
|
||||
Before any CDP operation, always run:
|
||||
```bash
|
||||
curl -s http://localhost:9222/json/version | python3 -c "import sys,json; d=json.load(sys.stdin); print('CDP ready:', d.get('Browser'))"
|
||||
```
|
||||
|
||||
If the command fails, tell the user: "请先启动 Chrome 并开启远程调试端口(见 web-access 技能的 Prerequisites 部分)。"
|
||||
|
||||
---
|
||||
|
||||
## Tool Selection Decision Tree
|
||||
|
||||
```
|
||||
User intent
|
||||
│
|
||||
├─ "Search for information about X" (no specific URL)
|
||||
│ └─→ WebSearch → pick top 3-5 results → fetch each (see next branches)
|
||||
│
|
||||
├─ "Read this public page" (static HTML, docs, news)
|
||||
│ └─→ WebFetch(url) directly
|
||||
│
|
||||
├─ "Read this heavy-JS page" (SPA, React/Vue sites, Medium, etc.)
|
||||
│ └─→ Bash: curl -sL "https://r.jina.ai/<original-url>"
|
||||
│ (Jina Reader = default for JS-rendered content, saves tokens)
|
||||
│
|
||||
├─ "Read this login-gated page" (小红书/B站/微博/飞书/Twitter/知乎/公众号)
|
||||
│ └─→ 1. Verify CDP ready (curl http://localhost:9222/json/version)
|
||||
│ 2. Bash: python3 script with playwright.connect_over_cdp()
|
||||
│ 3. Extract content → feed to Jina Reader for clean Markdown
|
||||
│ (or use BeautifulSoup directly on the raw HTML)
|
||||
│
|
||||
├─ "API documentation / GitHub / npm package info"
|
||||
│ └─→ Prefer official API endpoints over scraping HTML:
|
||||
│ - GitHub: gh api repos/owner/name
|
||||
│ - npm: curl https://registry.npmjs.org/<pkg>
|
||||
│ - PyPI: curl https://pypi.org/pypi/<pkg>/json
|
||||
│
|
||||
└─ "Real-time interactive task" (click, fill form, scroll, screenshot)
|
||||
└─→ CDP + Playwright (see references/cdp-browser.md)
|
||||
```
|
||||
|
||||
### Three-layer strategy summary
|
||||
|
||||
| Layer | Use case | Primary tool | Token cost |
|
||||
|-------|----------|--------------|------------|
|
||||
| L1 | Public, static | `WebFetch` | Low |
|
||||
| L2 | JS-heavy, long articles, token savings | `Bash curl r.jina.ai` | **Lowest** (Markdown pre-cleaned) |
|
||||
| L3 | Login-gated, interactive | `Bash + Python Playwright CDP` | Medium (raw HTML, then clean via Jina or BS4) |
|
||||
|
||||
**Default priority**: L1 for simple public pages → L2 for anything heavy → L3 only when login is required.
|
||||
|
||||
---
|
||||
|
||||
## Supported Sites Matrix
|
||||
|
||||
| Site | Recommended Layer | Notes |
|
||||
|------|-------------------|-------|
|
||||
| Wikipedia, MDN, official docs | L1 WebFetch | Static, clean HTML |
|
||||
| GitHub README, issues, PRs | `gh api` (best) → L1 WebFetch | Prefer API |
|
||||
| Hacker News, Reddit | L1 WebFetch | Public content |
|
||||
| Medium, Dev.to | L2 Jina Reader | JS-rendered, member gates |
|
||||
| Twitter/X | L3 CDP (or L2 Jina with `x.com`) | Login required for full thread |
|
||||
| 小红书 (xiaohongshu.com) | L3 CDP | 强制登录 |
|
||||
| B站 (bilibili.com) | L3 CDP | 视频描述/评论需登录 |
|
||||
| 微博 (weibo.com) | L3 CDP | 长微博需登录 |
|
||||
| 知乎 (zhihu.com) | L3 CDP | 长文+评论需登录 |
|
||||
| 飞书文档 (feishu.cn) | L3 CDP | 必须登录 |
|
||||
| 公众号 (mp.weixin.qq.com) | L2 Jina Reader | 通常公开,Jina 处理更干净 |
|
||||
| LinkedIn | L3 CDP | 登录墙 |
|
||||
|
||||
---
|
||||
|
||||
## Tool Reference
|
||||
|
||||
### Layer 1: WebSearch + WebFetch
|
||||
|
||||
**WebSearch** — discover URLs for an unknown topic:
|
||||
```
|
||||
WebSearch(query="latest typescript 5.5 features 2026", max_results=5)
|
||||
```
|
||||
|
||||
Tips:
|
||||
- Include the year for time-sensitive topics
|
||||
- Use `allowed_domains` / `blocked_domains` to constrain
|
||||
|
||||
**WebFetch** — extract clean Markdown from a known URL:
|
||||
```
|
||||
WebFetch(url="https://example.com/article")
|
||||
```
|
||||
|
||||
Tips:
|
||||
- Results cached for 15 min
|
||||
- Returns cleaned Markdown with title + URL + body
|
||||
- If body < 200 chars or looks garbled → escalate to Layer 2 (Jina) or Layer 3 (CDP)
|
||||
|
||||
### Layer 2: Jina Reader (default for heavy pages)
|
||||
|
||||
Jina Reader (`r.jina.ai`) is a free public proxy that renders pages server-side and returns clean Markdown. Use it as the **default** for any page where WebFetch produces garbled or truncated output, and as the **preferred** extractor for JS-heavy SPAs.
|
||||
|
||||
```bash
|
||||
curl -sL "https://r.jina.ai/https://example.com/article"
|
||||
```
|
||||
|
||||
Why Jina is the default token-saver:
|
||||
- Strips nav/footer/ads automatically
|
||||
- Handles JS-rendered SPAs
|
||||
- Returns 50-80% fewer tokens than raw HTML
|
||||
- No API key needed for basic use (~20 req/min)
|
||||
|
||||
See [references/jina-reader.md](references/jina-reader.md) for advanced endpoints and rate limits.
|
||||
|
||||
### Layer 3: CDP Browser (login-gated access)
|
||||
|
||||
Use Python Playwright's `connect_over_cdp()` to attach to the user's running Chrome (which already has login cookies). **No re-login needed.**
|
||||
|
||||
**Minimal template**:
|
||||
```bash
|
||||
python3 << 'PY'
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
TARGET_URL = "https://www.xiaohongshu.com/explore/..."
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.connect_over_cdp("http://localhost:9222")
|
||||
context = browser.contexts[0] # reuse user's default context (has cookies)
|
||||
page = context.new_page()
|
||||
page.goto(TARGET_URL, wait_until="domcontentloaded")
|
||||
page.wait_for_timeout(2000) # let lazy content load
|
||||
html = page.content()
|
||||
page.close()
|
||||
|
||||
# Print first 500 chars to verify
|
||||
print(html[:500])
|
||||
PY
|
||||
```
|
||||
|
||||
**Extract text via BeautifulSoup** (no Jina round-trip):
|
||||
```bash
|
||||
python3 << 'PY'
|
||||
from playwright.sync_api import sync_playwright
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.connect_over_cdp("http://localhost:9222")
|
||||
page = browser.contexts[0].new_page()
|
||||
page.goto("https://www.bilibili.com/video/BV...", wait_until="networkidle")
|
||||
html = page.content()
|
||||
page.close()
|
||||
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
title = soup.select_one("h1.video-title")
|
||||
desc = soup.select_one(".video-desc")
|
||||
print("Title:", title.get_text(strip=True) if title else "N/A")
|
||||
print("Desc:", desc.get_text(strip=True) if desc else "N/A")
|
||||
PY
|
||||
```
|
||||
|
||||
See [references/cdp-browser.md](references/cdp-browser.md) for:
|
||||
- Per-site selectors (小红书/B站/微博/知乎/飞书)
|
||||
- Scrolling & lazy-load patterns
|
||||
- Screenshot & form-fill recipes
|
||||
- Troubleshooting connection issues
|
||||
|
||||
---
|
||||
|
||||
## Common Workflows
|
||||
|
||||
Read [references/workflows.md](references/workflows.md) for detailed templates:
|
||||
- 技术文档查询 (Tech docs lookup)
|
||||
- 竞品对比研究 (Competitor research)
|
||||
- 新闻聚合与时间线 (News aggregation)
|
||||
- API/库版本调查 (Library version investigation)
|
||||
|
||||
Read [references/cdp-browser.md](references/cdp-browser.md) for login-gated site recipes (小红书/B站/微博/知乎/飞书).
|
||||
|
||||
Read [references/jina-reader.md](references/jina-reader.md) for Jina Reader positioning, rate limits, and advanced endpoints.
|
||||
|
||||
---
|
||||
|
||||
## Quick Workflow: Multi-Source Research
|
||||
|
||||
```
|
||||
1. WebSearch(query) → 5 candidate URLs
|
||||
2. Skim titles + snippets → pick 3 most relevant
|
||||
3. Classify each URL by layer (L1 / L2 / L3)
|
||||
4. Fetch all in parallel (single message, multiple tool calls)
|
||||
5. If any fetch returns < 200 chars or garbled → retry via next layer
|
||||
6. Synthesize: contradictions? consensus? outliers?
|
||||
7. Report with inline [source](url) citations + a Sources list at the end
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Anti-Patterns (Avoid)
|
||||
|
||||
- ❌ **Using WebFetch on obviously heavy sites** — Medium, Twitter, 小红书 will waste tokens or fail. Jump straight to L2/L3.
|
||||
- ❌ **Launching headless Chrome instead of CDP attach** — loses user's login state, triggers anti-bot, slow cold start. Always use `connect_over_cdp()` to attach to the user's existing session.
|
||||
- ❌ **Fetching one URL at a time when you need 5** — batch in a single message.
|
||||
- ❌ **Trusting a single source** — cross-check ≥ 2 sources for non-trivial claims.
|
||||
- ❌ **Fetching the search result page itself** — WebSearch already returns snippets; fetch the actual articles.
|
||||
- ❌ **Ignoring the cache** — WebFetch caches 15 min, reuse freely.
|
||||
- ❌ **Scraping when an API exists** — GitHub, npm, PyPI, Wikipedia all have JSON APIs.
|
||||
- ❌ **Forgetting the year in time-sensitive queries** — "best AI models" returns 2023 results; "best AI models 2026" returns current.
|
||||
- ❌ **Hardcoding login credentials in scripts** — always rely on the user's pre-logged CDP session.
|
||||
- ❌ **Citing only after the fact** — collect URLs as you fetch, not from memory afterwards.
|
||||
|
||||
---
|
||||
|
||||
## Example Interaction
|
||||
|
||||
**User**: "帮我抓一下这条小红书笔记的内容:https://www.xiaohongshu.com/explore/abc123"
|
||||
|
||||
**Agent workflow**:
|
||||
```
|
||||
1. 识别 → 小红书是 L3 登录态站点
|
||||
2. 检查 CDP:curl -s http://localhost:9222/json/version
|
||||
├─ 失败 → 提示用户启动 Chrome 调试模式,终止
|
||||
└─ 成功 → 继续
|
||||
3. Bash: python3 connect_over_cdp 脚本 → page.goto(url) → page.content()
|
||||
4. BeautifulSoup 提取 h1 title、.note-content、.comments
|
||||
5. 返回给用户时:
|
||||
- 引用原 URL
|
||||
- 若内容很长,用 Jina 清洗一遍节省 token
|
||||
6. 告知用户:「已通过你的登录态抓取,原链接:[xhs](url)」
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Installation Note
|
||||
|
||||
CDP features require Python + Playwright installed:
|
||||
|
||||
```bash
|
||||
pip3 install playwright beautifulsoup4
|
||||
python3 -m playwright install chromium # only needed if user hasn't installed Chrome
|
||||
```
|
||||
|
||||
If `playwright` is not installed when the user requests a login-gated site, run the install commands in Bash and explain you're setting up the browser automation dependency.
|
||||
330
skills/web-access/references/cdp-browser.md
Normal file
330
skills/web-access/references/cdp-browser.md
Normal file
@@ -0,0 +1,330 @@
|
||||
# CDP Browser Access — Login-Gated Sites Manual
|
||||
|
||||
Detailed recipes for accessing sites that require the user's login session, via Chrome DevTools Protocol (CDP) + Python Playwright.
|
||||
|
||||
**Precondition**: Chrome is already running with `--remote-debugging-port=9222` and the user has manually logged in to the target sites. See the main SKILL.md `Prerequisites` section for the launch command.
|
||||
|
||||
---
|
||||
|
||||
## Why CDP attach, not headless
|
||||
|
||||
| Approach | Login state | Anti-bot | Speed | Cost |
|
||||
|----------|-------------|----------|-------|------|
|
||||
| Headless Playwright (new context) | ❌ Empty cookies | ❌ Flagged as bot | Slow cold start | Re-login pain |
|
||||
| `playwright.chromium.launch(headless=False)` | ❌ Fresh profile | ⚠ Sometimes flagged | Slow | Same |
|
||||
| **CDP attach (`connect_over_cdp`)** | ✅ User's real cookies | ✅ Looks human | Instant | Zero friction |
|
||||
|
||||
**Rule**: For any login-gated site, always attach to the user's running Chrome.
|
||||
|
||||
---
|
||||
|
||||
## Core Template
|
||||
|
||||
Every CDP script follows this shape:
|
||||
|
||||
```python
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
def fetch_with_cdp(url: str, wait_selector: str | None = None) -> str:
|
||||
"""Attach to user's Chrome via CDP, fetch URL, return HTML."""
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.connect_over_cdp("http://localhost:9222")
|
||||
# browser.contexts[0] is the user's default context (with cookies)
|
||||
context = browser.contexts[0]
|
||||
page = context.new_page()
|
||||
try:
|
||||
page.goto(url, wait_until="domcontentloaded", timeout=30000)
|
||||
if wait_selector:
|
||||
page.wait_for_selector(wait_selector, timeout=10000)
|
||||
else:
|
||||
page.wait_for_timeout(2000) # generic settle
|
||||
return page.content()
|
||||
finally:
|
||||
page.close()
|
||||
# DO NOT call browser.close() — that would close the user's Chrome!
|
||||
|
||||
if __name__ == "__main__":
|
||||
html = fetch_with_cdp("https://example.com")
|
||||
print(html[:1000])
|
||||
```
|
||||
|
||||
**Critical**: Never call `browser.close()` when using CDP attach — you'd kill the user's Chrome. Only close the page you opened.
|
||||
|
||||
---
|
||||
|
||||
## Site Recipes
|
||||
|
||||
### 小红书 (xiaohongshu.com)
|
||||
|
||||
```python
|
||||
from playwright.sync_api import sync_playwright
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
NOTE_URL = "https://www.xiaohongshu.com/explore/XXXXXXXX"
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.connect_over_cdp("http://localhost:9222")
|
||||
page = browser.contexts[0].new_page()
|
||||
page.goto(NOTE_URL, wait_until="domcontentloaded")
|
||||
page.wait_for_selector("#detail-title", timeout=10000)
|
||||
page.wait_for_timeout(1500) # let images/comments load
|
||||
html = page.content()
|
||||
page.close()
|
||||
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
title = (soup.select_one("#detail-title") or {}).get_text(strip=True) if soup.select_one("#detail-title") else None
|
||||
desc = (soup.select_one("#detail-desc") or {}).get_text(" ", strip=True) if soup.select_one("#detail-desc") else None
|
||||
author = soup.select_one(".author-wrapper .username")
|
||||
print("Title:", title)
|
||||
print("Author:", author.get_text(strip=True) if author else None)
|
||||
print("Desc:", desc)
|
||||
```
|
||||
|
||||
**Selectors** (may drift over time — update if they fail):
|
||||
- Title: `#detail-title`
|
||||
- Description: `#detail-desc`
|
||||
- Author: `.author-wrapper .username`
|
||||
- Images: `.swiper-slide img`
|
||||
- Comments: `.parent-comment .content`
|
||||
|
||||
### B站 (bilibili.com)
|
||||
|
||||
```python
|
||||
from playwright.sync_api import sync_playwright
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
VIDEO_URL = "https://www.bilibili.com/video/BVxxxxxxxxx"
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.connect_over_cdp("http://localhost:9222")
|
||||
page = browser.contexts[0].new_page()
|
||||
page.goto(VIDEO_URL, wait_until="networkidle")
|
||||
page.wait_for_timeout(2000)
|
||||
html = page.content()
|
||||
page.close()
|
||||
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
print("Title:", soup.select_one("h1.video-title").get_text(strip=True) if soup.select_one("h1.video-title") else None)
|
||||
print("UP:", soup.select_one(".up-name").get_text(strip=True) if soup.select_one(".up-name") else None)
|
||||
print("Desc:", soup.select_one(".desc-info-text").get_text(" ", strip=True) if soup.select_one(".desc-info-text") else None)
|
||||
```
|
||||
|
||||
**Tip**: For B站 evaluations, the [公开 API](https://api.bilibili.com/x/web-interface/view?bvid=XXXX) often returns JSON without needing CDP. Try it first:
|
||||
|
||||
```bash
|
||||
curl -s "https://api.bilibili.com/x/web-interface/view?bvid=BVxxxxxxxxx" | python3 -m json.tool
|
||||
```
|
||||
|
||||
### 微博 (weibo.com)
|
||||
|
||||
```python
|
||||
WEIBO_URL = "https://weibo.com/u/1234567890" # or /detail/xxx
|
||||
|
||||
# Same CDP template
|
||||
# Selectors:
|
||||
# .Feed_body_3R0rO .detail_wbtext_4CRf9 — post text
|
||||
# .ALink_default_2ibt1 — user link
|
||||
# article[aria-label="微博"] — each feed item
|
||||
```
|
||||
|
||||
**Note**: Weibo uses React + heavy obfuscation. Selectors change frequently. If selectors fail, pipe the HTML through Jina for clean Markdown:
|
||||
|
||||
```python
|
||||
html = fetch_with_cdp(WEIBO_URL)
|
||||
# Save to temp file, then:
|
||||
import subprocess
|
||||
result = subprocess.run(
|
||||
["curl", "-sL", f"https://r.jina.ai/{WEIBO_URL}"],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
print(result.stdout)
|
||||
```
|
||||
|
||||
### 知乎 (zhihu.com)
|
||||
|
||||
```python
|
||||
ANSWER_URL = "https://www.zhihu.com/question/123/answer/456"
|
||||
|
||||
# Selectors:
|
||||
# h1.QuestionHeader-title — question title
|
||||
# .RichContent-inner — answer body
|
||||
# .AuthorInfo-name — author
|
||||
```
|
||||
|
||||
Zhihu works with CDP but often also renders enough metadata server-side for Jina to work:
|
||||
|
||||
```bash
|
||||
curl -sL "https://r.jina.ai/https://www.zhihu.com/question/123/answer/456"
|
||||
```
|
||||
|
||||
Try Jina first, fall back to CDP if content is truncated.
|
||||
|
||||
### 飞书文档 (feishu.cn / larksuite.com)
|
||||
|
||||
```python
|
||||
DOC_URL = "https://xxx.feishu.cn/docs/xxx"
|
||||
|
||||
# Feishu uses heavy virtualization — must scroll to load all content.
|
||||
# Recipe:
|
||||
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.connect_over_cdp("http://localhost:9222")
|
||||
page = browser.contexts[0].new_page()
|
||||
page.goto(DOC_URL, wait_until="domcontentloaded")
|
||||
page.wait_for_selector(".docs-render-unit", timeout=15000)
|
||||
|
||||
# Scroll to bottom repeatedly to load lazy content
|
||||
last_height = 0
|
||||
for _ in range(20):
|
||||
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
page.wait_for_timeout(800)
|
||||
h = page.evaluate("document.body.scrollHeight")
|
||||
if h == last_height:
|
||||
break
|
||||
last_height = h
|
||||
|
||||
# Extract text
|
||||
text = page.evaluate("() => document.body.innerText")
|
||||
page.close()
|
||||
|
||||
print(text)
|
||||
```
|
||||
|
||||
### Twitter / X
|
||||
|
||||
```python
|
||||
TWEET_URL = "https://x.com/username/status/1234567890"
|
||||
|
||||
# Selectors:
|
||||
# article[data-testid="tweet"] — tweet container
|
||||
# div[data-testid="tweetText"] — tweet text
|
||||
# div[data-testid="User-Name"] — author
|
||||
# a[href$="/analytics"] — view count anchor (next sibling has stats)
|
||||
```
|
||||
|
||||
Twitter is aggressive with anti-bot. CDP attach usually works, but set a generous wait:
|
||||
|
||||
```python
|
||||
page.goto(url, wait_until="networkidle", timeout=45000)
|
||||
page.wait_for_selector('article[data-testid="tweet"]', timeout=15000)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Scroll to load lazy content
|
||||
|
||||
```python
|
||||
def scroll_to_bottom(page, max_steps=30, pause_ms=800):
|
||||
last = 0
|
||||
for _ in range(max_steps):
|
||||
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
page.wait_for_timeout(pause_ms)
|
||||
h = page.evaluate("document.body.scrollHeight")
|
||||
if h == last:
|
||||
return
|
||||
last = h
|
||||
```
|
||||
|
||||
### Pattern 2: Screenshot a specific element
|
||||
|
||||
```python
|
||||
element = page.locator("article").first
|
||||
element.screenshot(path="/tmp/article.png")
|
||||
```
|
||||
|
||||
### Pattern 3: Extract structured data via JavaScript
|
||||
|
||||
```python
|
||||
data = page.evaluate("""() => {
|
||||
const items = document.querySelectorAll('.list-item');
|
||||
return Array.from(items).map(el => ({
|
||||
title: el.querySelector('.title')?.innerText,
|
||||
url: el.querySelector('a')?.href,
|
||||
}));
|
||||
}""")
|
||||
print(data)
|
||||
```
|
||||
|
||||
### Pattern 4: Fill a form and click
|
||||
|
||||
```python
|
||||
page.fill("input[name=q]", "search query")
|
||||
page.click("button[type=submit]")
|
||||
page.wait_for_load_state("networkidle")
|
||||
```
|
||||
|
||||
### Pattern 5: Clean HTML via Jina after extraction
|
||||
|
||||
When selectors are unreliable, dump the full page HTML and let Jina do the cleaning:
|
||||
|
||||
```python
|
||||
html = page.content()
|
||||
# Save to file, serve via local HTTP, or just pipe the original URL:
|
||||
import subprocess
|
||||
clean_md = subprocess.run(
|
||||
["curl", "-sL", f"https://r.jina.ai/{url}"],
|
||||
capture_output=True, text=True
|
||||
).stdout
|
||||
print(clean_md)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### `connect_over_cdp` fails with `ECONNREFUSED`
|
||||
|
||||
Chrome is not running with remote debugging. Tell the user:
|
||||
> "请先用下面的命令启动 Chrome:
|
||||
> `/Applications/Google\\ Chrome.app/Contents/MacOS/Google\\ Chrome --remote-debugging-port=9222 --user-data-dir=\"$HOME/.desirecore/chrome-profile\"`
|
||||
> 然后手动登录需要抓取的网站,再让我继续。"
|
||||
|
||||
### `browser.contexts[0]` is empty
|
||||
|
||||
Chrome was launched but no windows are open. Ask the user to open at least one tab and navigate anywhere.
|
||||
|
||||
### Playwright not installed
|
||||
|
||||
```bash
|
||||
pip3 install playwright beautifulsoup4
|
||||
# No need for `playwright install` — we're attaching to existing Chrome, not downloading a new browser
|
||||
```
|
||||
|
||||
### Site detects automation
|
||||
|
||||
Despite CDP attach, some sites (Cloudflare-protected, Instagram) may still detect automation. Options:
|
||||
1. Use Jina Reader instead (`curl -sL https://r.jina.ai/<url>`) — often succeeds where Playwright fails
|
||||
2. Ask the user to manually copy the visible content
|
||||
3. Use the site's public API if available
|
||||
|
||||
### Content is truncated
|
||||
|
||||
The page uses virtualization or lazy loading. Apply Pattern 1 (scroll to bottom) before calling `page.content()`.
|
||||
|
||||
### `page.wait_for_selector` times out
|
||||
|
||||
The selector is stale — the site updated its DOM. Dump `page.content()[:5000]` and inspect manually, or fall back to Jina Reader.
|
||||
|
||||
---
|
||||
|
||||
## Security Notes
|
||||
|
||||
- **Never log or print cookies** from `context.cookies()` even during debugging
|
||||
- **Never extract and store** the user's session tokens to files
|
||||
- **Never use the CDP session** to perform writes (post, comment, like) unless the user explicitly requested it
|
||||
- The `~/.desirecore/chrome-profile` directory contains the user's credentials — treat it as sensitive
|
||||
- If the user asks to "log in automatically", refuse and explain they must log in manually in the Chrome window; the skill only reads already-authenticated sessions
|
||||
|
||||
---
|
||||
|
||||
## When NOT to use CDP
|
||||
|
||||
- **Public static sites** → use L1 `WebFetch`, it's faster
|
||||
- **Heavy SPAs without login walls** → use L2 Jina Reader, it's cheaper on tokens
|
||||
- **You need thousands of pages** → CDP is not built for scale; look into proper scrapers
|
||||
|
||||
CDP is specifically the "right tool" for: **small number of pages + login required + human-like behavior needed**.
|
||||
122
skills/web-access/references/jina-reader.md
Normal file
122
skills/web-access/references/jina-reader.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# Jina Reader — Default Token-Optimization Layer
|
||||
|
||||
[Jina Reader](https://jina.ai/reader) is a free public service that renders any URL server-side and returns clean Markdown. In this skill's three-layer architecture, **Jina is Layer 2: the default extractor for heavy/JS-rendered pages**, not just a fallback.
|
||||
|
||||
---
|
||||
|
||||
## Positioning in the three-layer model
|
||||
|
||||
```
|
||||
L1 WebFetch ── simple public static pages (docs, Wikipedia, HN)
|
||||
│
|
||||
│ WebFetch empty/truncated/garbled
|
||||
▼
|
||||
L2 Jina Reader ── DEFAULT for JS-heavy SPAs, long articles, Medium, Twitter
|
||||
│ Strips nav/ads automatically, saves 50-80% tokens
|
||||
│
|
||||
│ Login required, or Jina also fails
|
||||
▼
|
||||
L3 CDP Browser ── user's logged-in Chrome (小红书/B站/微博/飞书/Twitter)
|
||||
```
|
||||
|
||||
**Key insight**: Don't wait for WebFetch to fail before trying Jina. For any URL you expect to be JS-heavy (any major SPA, Medium, Dev.to, long-form articles), go straight to Jina for the token savings.
|
||||
|
||||
---
|
||||
|
||||
## Basic Usage (no API key)
|
||||
|
||||
```bash
|
||||
curl -sL "https://r.jina.ai/https://example.com/article"
|
||||
```
|
||||
|
||||
The original URL goes after `r.jina.ai/`. The response is plain Markdown — pipe to a file or read directly.
|
||||
|
||||
---
|
||||
|
||||
## When to use each layer
|
||||
|
||||
| Scenario | Primary choice | Why |
|
||||
|----------|---------------|-----|
|
||||
| Wikipedia, MDN, official docs | L1 WebFetch | Static clean HTML, fastest |
|
||||
| GitHub README (public) | L1 WebFetch | Simple markup |
|
||||
| Medium articles | **L2 Jina** | Member walls + heavy JS |
|
||||
| Dev.to, Hashnode | **L2 Jina** | JS-rendered |
|
||||
| Substack, Ghost blogs | **L2 Jina** | Partial JS rendering |
|
||||
| News sites with lazy-load | **L2 Jina** | Scroll-triggered content |
|
||||
| Twitter/X public threads | **L2 Jina** first, L3 CDP if truncated | Sometimes works |
|
||||
| 公众号 (mp.weixin.qq.com) | **L2 Jina** | Clean Markdown extraction |
|
||||
| LinkedIn articles | L3 CDP | Hard login wall |
|
||||
| 小红书, B站, 微博, 飞书 | L3 CDP | 登录强制 |
|
||||
|
||||
---
|
||||
|
||||
## Token savings example
|
||||
|
||||
Raw HTML of a long Medium article: ~150 KB, ~50,000 tokens
|
||||
Same article via Jina Reader: ~20 KB, ~7,000 tokens
|
||||
|
||||
**86% reduction**, with cleaner structure and no ads/nav cruft.
|
||||
|
||||
---
|
||||
|
||||
## Advanced Endpoints (optional)
|
||||
|
||||
If you need more than basic content extraction, Jina also offers:
|
||||
|
||||
- **Search**: `https://s.jina.ai/<query>` — returns top 5 results as Markdown
|
||||
- **Embeddings**: `https://api.jina.ai/v1/embeddings` (requires free API key)
|
||||
- **Reranker**: `https://api.jina.ai/v1/rerank` (requires free API key)
|
||||
|
||||
For DesireCore, prefer the built-in `WebSearch` tool over `s.jina.ai` for consistency.
|
||||
|
||||
---
|
||||
|
||||
## Rate Limits
|
||||
|
||||
- **Free tier**: ~20 requests/minute, no authentication needed
|
||||
- **With free API key**: higher limits, fewer throttles
|
||||
```bash
|
||||
curl -sL "https://r.jina.ai/https://example.com" \
|
||||
-H "Authorization: Bearer YOUR_KEY"
|
||||
```
|
||||
- Get a free key at [jina.ai](https://jina.ai) — stored in env var `JINA_API_KEY` if available
|
||||
|
||||
---
|
||||
|
||||
## Usage tips
|
||||
|
||||
### Cache your own results
|
||||
Jina itself doesn't cache for you. If you call the same URL repeatedly in a session, save the Markdown to a temp file:
|
||||
|
||||
```bash
|
||||
curl -sL "https://r.jina.ai/$URL" > /tmp/jina-cache.md
|
||||
```
|
||||
|
||||
### Handle very long articles
|
||||
Jina returns the full article in one response. For articles > 50K chars, pipe through `head` or extract specific sections with Python/awk before feeding back to the model context.
|
||||
|
||||
### Combine with CDP
|
||||
When you use L3 CDP to fetch a login-gated page, you can pipe the resulting HTML through Jina for clean Markdown instead of parsing with BeautifulSoup:
|
||||
|
||||
```python
|
||||
html = fetch_with_cdp(url) # from references/cdp-browser.md
|
||||
# Now convert via Jina (note: Jina fetches the URL itself, not your HTML)
|
||||
# So this only works if the content is already visible without login:
|
||||
import subprocess
|
||||
md = subprocess.run(["curl", "-sL", f"https://r.jina.ai/{url}"],
|
||||
capture_output=True, text=True).stdout
|
||||
```
|
||||
|
||||
For truly login-gated content, you must parse the HTML directly (BeautifulSoup) since Jina can't log in on your behalf.
|
||||
|
||||
---
|
||||
|
||||
## Failure Mode
|
||||
|
||||
If Jina Reader returns garbage or error:
|
||||
1. **Hard login wall** → escalate to L3 CDP browser
|
||||
2. **Geographically restricted** → tell the user, suggest VPN or manual access
|
||||
3. **Cloudflare challenge** → try L3 CDP (user's browser passes challenges naturally)
|
||||
4. **404 / gone** → confirm the URL is correct
|
||||
|
||||
In all cases, tell the user explicitly which URL failed and what you tried.
|
||||
136
skills/web-access/references/workflows.md
Normal file
136
skills/web-access/references/workflows.md
Normal file
@@ -0,0 +1,136 @@
|
||||
# Common Research Workflows
|
||||
|
||||
Reusable templates for multi-step research tasks. Adapt the queries and URLs to the specific topic.
|
||||
|
||||
---
|
||||
|
||||
## 1. Technical Documentation Lookup
|
||||
|
||||
**Goal**: Find the authoritative answer to a "how do I X with library Y" question.
|
||||
|
||||
```
|
||||
Step 1: WebSearch("<library> <feature> documentation site:<official-domain>")
|
||||
↓ if no results, drop the site: filter
|
||||
Step 2: WebFetch the top 1-2 official doc pages
|
||||
Step 3: If example code is incomplete, also fetch the GitHub README or examples folder:
|
||||
Bash: gh api repos/<owner>/<repo>/contents/examples
|
||||
Step 4: Synthesize a concise answer with one runnable code block
|
||||
```
|
||||
|
||||
**Tip**: Always check the doc version matches the user's installed version. Look for version selectors in the page.
|
||||
|
||||
---
|
||||
|
||||
## 2. Competitor / Product Comparison
|
||||
|
||||
**Goal**: Build a structured comparison of 2-N similar products.
|
||||
|
||||
```
|
||||
Step 1: WebSearch("<product-A> vs <product-B> comparison <year>")
|
||||
Step 2: WebSearch("<product-A> features pricing") ─┐ parallel
|
||||
Step 3: WebSearch("<product-B> features pricing") ─┘
|
||||
Step 4: WebFetch official pricing/features pages for each (parallel)
|
||||
Step 5: WebFetch 1 third-party comparison article (parallel)
|
||||
Step 6: Build markdown table with consistent dimensions:
|
||||
| Dimension | Product A | Product B |
|
||||
|-----------|-----------|-----------|
|
||||
| Pricing | ... | ... |
|
||||
| Features | ... | ... |
|
||||
| License | ... | ... |
|
||||
Step 7: Add a "Recommendation" paragraph based on user's stated needs
|
||||
```
|
||||
|
||||
**Tip**: When dimensions differ between sources, prefer the official source over third-party.
|
||||
|
||||
---
|
||||
|
||||
## 3. News Aggregation & Timeline
|
||||
|
||||
**Goal**: Build a chronological summary of recent events on a topic.
|
||||
|
||||
```
|
||||
Step 1: WebSearch("<topic> news <year>", max_results=10)
|
||||
Step 2: Skim snippets, group by date
|
||||
Step 3: WebFetch the 3-5 most substantive articles (parallel)
|
||||
Step 4: Build timeline:
|
||||
## YYYY-MM-DD - Event headline
|
||||
- Key fact 1 [source](url)
|
||||
- Key fact 2 [source](url)
|
||||
Step 5: End with a "Current State" paragraph
|
||||
```
|
||||
|
||||
**Tip**: Use `allowed_domains` to constrain to authoritative news sources if needed.
|
||||
|
||||
---
|
||||
|
||||
## 4. Library Version Investigation
|
||||
|
||||
**Goal**: Find the latest version, breaking changes, and migration notes.
|
||||
|
||||
```
|
||||
Step 1: Get latest version via API (faster than scraping):
|
||||
Python: curl https://pypi.org/pypi/<package>/json | jq .info.version
|
||||
Node: curl https://registry.npmjs.org/<package>/latest | jq .version
|
||||
Rust: curl https://crates.io/api/v1/crates/<crate> | jq .crate.max_version
|
||||
|
||||
Step 2: Get changelog:
|
||||
gh api repos/<owner>/<repo>/releases/latest
|
||||
|
||||
Step 3: If migration is needed, search:
|
||||
WebSearch("<package> migration guide v<old> to v<new>")
|
||||
WebFetch the official migration doc
|
||||
|
||||
Step 4: Summarize: latest version, breaking changes (bullet list), 1-2 code diffs
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. API Endpoint Discovery
|
||||
|
||||
**Goal**: Find a specific API endpoint and its parameters.
|
||||
|
||||
```
|
||||
Step 1: WebSearch("<service> API <action> reference")
|
||||
Step 2: WebFetch official API reference page
|
||||
Step 3: If response includes "Try it" / "Sandbox" link, mention it
|
||||
Step 4: Extract:
|
||||
- Endpoint URL
|
||||
- HTTP method
|
||||
- Required headers (auth)
|
||||
- Request body schema
|
||||
- Response schema
|
||||
- Example curl command
|
||||
Step 5: Format as a self-contained code block the user can copy-paste
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Quick Fact Check
|
||||
|
||||
**Goal**: Verify a single specific claim.
|
||||
|
||||
```
|
||||
Step 1: WebSearch("<exact claim phrase>")
|
||||
Step 2: If 2+ authoritative sources agree → confirmed
|
||||
Step 3: If sources disagree → report both sides + which is more authoritative
|
||||
Step 4: If no sources found → say "could not verify" — do NOT guess
|
||||
```
|
||||
|
||||
**Tip**: For numeric facts, find the primary source (official report, paper) rather than secondary citations.
|
||||
|
||||
---
|
||||
|
||||
## Parallelization Cheat Sheet
|
||||
|
||||
When you need multiple independent fetches, **always batch them in a single message with multiple tool calls** rather than sequentially. Examples:
|
||||
|
||||
```
|
||||
✅ Single message with:
|
||||
- WebFetch(url1)
|
||||
- WebFetch(url2)
|
||||
- WebFetch(url3)
|
||||
|
||||
❌ Three separate messages, each with one WebFetch
|
||||
```
|
||||
|
||||
This applies equally to WebSearch with different queries, and to mixed Search+Fetch when you have URLs from previous searches.
|
||||
Reference in New Issue
Block a user