mirror of
https://git.openapi.site/https://github.com/desirecore/market.git
synced 2026-04-21 17:30:44 +08:00
feat: 新增 web-access 和 frontend-design 两个内置技能
根据 docs 推荐补齐 5 个内置技能中的 c) 和 e): web-access v1.1.0: - 三层架构:L1 WebSearch/WebFetch + L2 Jina Reader + L3 CDP Browser - 添加 Chrome CDP 前置条件(macOS/Linux/Windows 启动命令) - 支持登录态访问 小红书/B站/微博/知乎/飞书/Twitter/公众号 - Jina Reader 重新定位为默认 token 优化层(非兜底) - 新增 references/cdp-browser.md(Python Playwright 详细操作手册) - 触发词扩充:小红书、B站、微博、飞书、Twitter、推特、X、知乎、公众号 frontend-design v1.0.0: - 从 Claude Code 官方 frontend-design 技能适配 - 保留原版 bold aesthetic 设计理念 - 新增 Project Context Override 章节:在 DesireCore 主仓库内工作时 自动遵循 3+2 色彩体系(Green/Blue/Purple + Orange/Red) - 添加 Output Rule 要求告知用户文件路径 builtin-skills.json: 12 → 14 skills
This commit is contained in:
330
skills/web-access/references/cdp-browser.md
Normal file
330
skills/web-access/references/cdp-browser.md
Normal file
@@ -0,0 +1,330 @@
|
||||
# CDP Browser Access — Login-Gated Sites Manual
|
||||
|
||||
Detailed recipes for accessing sites that require the user's login session, via Chrome DevTools Protocol (CDP) + Python Playwright.
|
||||
|
||||
**Precondition**: Chrome is already running with `--remote-debugging-port=9222` and the user has manually logged in to the target sites. See the main SKILL.md `Prerequisites` section for the launch command.
|
||||
|
||||
---
|
||||
|
||||
## Why CDP attach, not headless
|
||||
|
||||
| Approach | Login state | Anti-bot | Speed | Cost |
|
||||
|----------|-------------|----------|-------|------|
|
||||
| Headless Playwright (new context) | ❌ Empty cookies | ❌ Flagged as bot | Slow cold start | Re-login pain |
|
||||
| `playwright.chromium.launch(headless=False)` | ❌ Fresh profile | ⚠ Sometimes flagged | Slow | Same |
|
||||
| **CDP attach (`connect_over_cdp`)** | ✅ User's real cookies | ✅ Looks human | Instant | Zero friction |
|
||||
|
||||
**Rule**: For any login-gated site, always attach to the user's running Chrome.
|
||||
|
||||
---
|
||||
|
||||
## Core Template
|
||||
|
||||
Every CDP script follows this shape:
|
||||
|
||||
```python
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
def fetch_with_cdp(url: str, wait_selector: str | None = None) -> str:
|
||||
"""Attach to user's Chrome via CDP, fetch URL, return HTML."""
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.connect_over_cdp("http://localhost:9222")
|
||||
# browser.contexts[0] is the user's default context (with cookies)
|
||||
context = browser.contexts[0]
|
||||
page = context.new_page()
|
||||
try:
|
||||
page.goto(url, wait_until="domcontentloaded", timeout=30000)
|
||||
if wait_selector:
|
||||
page.wait_for_selector(wait_selector, timeout=10000)
|
||||
else:
|
||||
page.wait_for_timeout(2000) # generic settle
|
||||
return page.content()
|
||||
finally:
|
||||
page.close()
|
||||
# DO NOT call browser.close() — that would close the user's Chrome!
|
||||
|
||||
if __name__ == "__main__":
|
||||
html = fetch_with_cdp("https://example.com")
|
||||
print(html[:1000])
|
||||
```
|
||||
|
||||
**Critical**: Never call `browser.close()` when using CDP attach — you'd kill the user's Chrome. Only close the page you opened.
|
||||
|
||||
---
|
||||
|
||||
## Site Recipes
|
||||
|
||||
### 小红书 (xiaohongshu.com)
|
||||
|
||||
```python
|
||||
from playwright.sync_api import sync_playwright
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
NOTE_URL = "https://www.xiaohongshu.com/explore/XXXXXXXX"
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.connect_over_cdp("http://localhost:9222")
|
||||
page = browser.contexts[0].new_page()
|
||||
page.goto(NOTE_URL, wait_until="domcontentloaded")
|
||||
page.wait_for_selector("#detail-title", timeout=10000)
|
||||
page.wait_for_timeout(1500) # let images/comments load
|
||||
html = page.content()
|
||||
page.close()
|
||||
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
title = (soup.select_one("#detail-title") or {}).get_text(strip=True) if soup.select_one("#detail-title") else None
|
||||
desc = (soup.select_one("#detail-desc") or {}).get_text(" ", strip=True) if soup.select_one("#detail-desc") else None
|
||||
author = soup.select_one(".author-wrapper .username")
|
||||
print("Title:", title)
|
||||
print("Author:", author.get_text(strip=True) if author else None)
|
||||
print("Desc:", desc)
|
||||
```
|
||||
|
||||
**Selectors** (may drift over time — update if they fail):
|
||||
- Title: `#detail-title`
|
||||
- Description: `#detail-desc`
|
||||
- Author: `.author-wrapper .username`
|
||||
- Images: `.swiper-slide img`
|
||||
- Comments: `.parent-comment .content`
|
||||
|
||||
### B站 (bilibili.com)
|
||||
|
||||
```python
|
||||
from playwright.sync_api import sync_playwright
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
VIDEO_URL = "https://www.bilibili.com/video/BVxxxxxxxxx"
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.connect_over_cdp("http://localhost:9222")
|
||||
page = browser.contexts[0].new_page()
|
||||
page.goto(VIDEO_URL, wait_until="networkidle")
|
||||
page.wait_for_timeout(2000)
|
||||
html = page.content()
|
||||
page.close()
|
||||
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
print("Title:", soup.select_one("h1.video-title").get_text(strip=True) if soup.select_one("h1.video-title") else None)
|
||||
print("UP:", soup.select_one(".up-name").get_text(strip=True) if soup.select_one(".up-name") else None)
|
||||
print("Desc:", soup.select_one(".desc-info-text").get_text(" ", strip=True) if soup.select_one(".desc-info-text") else None)
|
||||
```
|
||||
|
||||
**Tip**: For B站 evaluations, the [公开 API](https://api.bilibili.com/x/web-interface/view?bvid=XXXX) often returns JSON without needing CDP. Try it first:
|
||||
|
||||
```bash
|
||||
curl -s "https://api.bilibili.com/x/web-interface/view?bvid=BVxxxxxxxxx" | python3 -m json.tool
|
||||
```
|
||||
|
||||
### 微博 (weibo.com)
|
||||
|
||||
```python
|
||||
WEIBO_URL = "https://weibo.com/u/1234567890" # or /detail/xxx
|
||||
|
||||
# Same CDP template
|
||||
# Selectors:
|
||||
# .Feed_body_3R0rO .detail_wbtext_4CRf9 — post text
|
||||
# .ALink_default_2ibt1 — user link
|
||||
# article[aria-label="微博"] — each feed item
|
||||
```
|
||||
|
||||
**Note**: Weibo uses React + heavy obfuscation. Selectors change frequently. If selectors fail, pipe the HTML through Jina for clean Markdown:
|
||||
|
||||
```python
|
||||
html = fetch_with_cdp(WEIBO_URL)
|
||||
# Save to temp file, then:
|
||||
import subprocess
|
||||
result = subprocess.run(
|
||||
["curl", "-sL", f"https://r.jina.ai/{WEIBO_URL}"],
|
||||
capture_output=True, text=True
|
||||
)
|
||||
print(result.stdout)
|
||||
```
|
||||
|
||||
### 知乎 (zhihu.com)
|
||||
|
||||
```python
|
||||
ANSWER_URL = "https://www.zhihu.com/question/123/answer/456"
|
||||
|
||||
# Selectors:
|
||||
# h1.QuestionHeader-title — question title
|
||||
# .RichContent-inner — answer body
|
||||
# .AuthorInfo-name — author
|
||||
```
|
||||
|
||||
Zhihu works with CDP but often also renders enough metadata server-side for Jina to work:
|
||||
|
||||
```bash
|
||||
curl -sL "https://r.jina.ai/https://www.zhihu.com/question/123/answer/456"
|
||||
```
|
||||
|
||||
Try Jina first, fall back to CDP if content is truncated.
|
||||
|
||||
### 飞书文档 (feishu.cn / larksuite.com)
|
||||
|
||||
```python
|
||||
DOC_URL = "https://xxx.feishu.cn/docs/xxx"
|
||||
|
||||
# Feishu uses heavy virtualization — must scroll to load all content.
|
||||
# Recipe:
|
||||
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.connect_over_cdp("http://localhost:9222")
|
||||
page = browser.contexts[0].new_page()
|
||||
page.goto(DOC_URL, wait_until="domcontentloaded")
|
||||
page.wait_for_selector(".docs-render-unit", timeout=15000)
|
||||
|
||||
# Scroll to bottom repeatedly to load lazy content
|
||||
last_height = 0
|
||||
for _ in range(20):
|
||||
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
page.wait_for_timeout(800)
|
||||
h = page.evaluate("document.body.scrollHeight")
|
||||
if h == last_height:
|
||||
break
|
||||
last_height = h
|
||||
|
||||
# Extract text
|
||||
text = page.evaluate("() => document.body.innerText")
|
||||
page.close()
|
||||
|
||||
print(text)
|
||||
```
|
||||
|
||||
### Twitter / X
|
||||
|
||||
```python
|
||||
TWEET_URL = "https://x.com/username/status/1234567890"
|
||||
|
||||
# Selectors:
|
||||
# article[data-testid="tweet"] — tweet container
|
||||
# div[data-testid="tweetText"] — tweet text
|
||||
# div[data-testid="User-Name"] — author
|
||||
# a[href$="/analytics"] — view count anchor (next sibling has stats)
|
||||
```
|
||||
|
||||
Twitter is aggressive with anti-bot. CDP attach usually works, but set a generous wait:
|
||||
|
||||
```python
|
||||
page.goto(url, wait_until="networkidle", timeout=45000)
|
||||
page.wait_for_selector('article[data-testid="tweet"]', timeout=15000)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Pattern 1: Scroll to load lazy content
|
||||
|
||||
```python
|
||||
def scroll_to_bottom(page, max_steps=30, pause_ms=800):
|
||||
last = 0
|
||||
for _ in range(max_steps):
|
||||
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
page.wait_for_timeout(pause_ms)
|
||||
h = page.evaluate("document.body.scrollHeight")
|
||||
if h == last:
|
||||
return
|
||||
last = h
|
||||
```
|
||||
|
||||
### Pattern 2: Screenshot a specific element
|
||||
|
||||
```python
|
||||
element = page.locator("article").first
|
||||
element.screenshot(path="/tmp/article.png")
|
||||
```
|
||||
|
||||
### Pattern 3: Extract structured data via JavaScript
|
||||
|
||||
```python
|
||||
data = page.evaluate("""() => {
|
||||
const items = document.querySelectorAll('.list-item');
|
||||
return Array.from(items).map(el => ({
|
||||
title: el.querySelector('.title')?.innerText,
|
||||
url: el.querySelector('a')?.href,
|
||||
}));
|
||||
}""")
|
||||
print(data)
|
||||
```
|
||||
|
||||
### Pattern 4: Fill a form and click
|
||||
|
||||
```python
|
||||
page.fill("input[name=q]", "search query")
|
||||
page.click("button[type=submit]")
|
||||
page.wait_for_load_state("networkidle")
|
||||
```
|
||||
|
||||
### Pattern 5: Clean HTML via Jina after extraction
|
||||
|
||||
When selectors are unreliable, dump the full page HTML and let Jina do the cleaning:
|
||||
|
||||
```python
|
||||
html = page.content()
|
||||
# Save to file, serve via local HTTP, or just pipe the original URL:
|
||||
import subprocess
|
||||
clean_md = subprocess.run(
|
||||
["curl", "-sL", f"https://r.jina.ai/{url}"],
|
||||
capture_output=True, text=True
|
||||
).stdout
|
||||
print(clean_md)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### `connect_over_cdp` fails with `ECONNREFUSED`
|
||||
|
||||
Chrome is not running with remote debugging. Tell the user:
|
||||
> "请先用下面的命令启动 Chrome:
|
||||
> `/Applications/Google\\ Chrome.app/Contents/MacOS/Google\\ Chrome --remote-debugging-port=9222 --user-data-dir=\"$HOME/.desirecore/chrome-profile\"`
|
||||
> 然后手动登录需要抓取的网站,再让我继续。"
|
||||
|
||||
### `browser.contexts[0]` is empty
|
||||
|
||||
Chrome was launched but no windows are open. Ask the user to open at least one tab and navigate anywhere.
|
||||
|
||||
### Playwright not installed
|
||||
|
||||
```bash
|
||||
pip3 install playwright beautifulsoup4
|
||||
# No need for `playwright install` — we're attaching to existing Chrome, not downloading a new browser
|
||||
```
|
||||
|
||||
### Site detects automation
|
||||
|
||||
Despite CDP attach, some sites (Cloudflare-protected, Instagram) may still detect automation. Options:
|
||||
1. Use Jina Reader instead (`curl -sL https://r.jina.ai/<url>`) — often succeeds where Playwright fails
|
||||
2. Ask the user to manually copy the visible content
|
||||
3. Use the site's public API if available
|
||||
|
||||
### Content is truncated
|
||||
|
||||
The page uses virtualization or lazy loading. Apply Pattern 1 (scroll to bottom) before calling `page.content()`.
|
||||
|
||||
### `page.wait_for_selector` times out
|
||||
|
||||
The selector is stale — the site updated its DOM. Dump `page.content()[:5000]` and inspect manually, or fall back to Jina Reader.
|
||||
|
||||
---
|
||||
|
||||
## Security Notes
|
||||
|
||||
- **Never log or print cookies** from `context.cookies()` even during debugging
|
||||
- **Never extract and store** the user's session tokens to files
|
||||
- **Never use the CDP session** to perform writes (post, comment, like) unless the user explicitly requested it
|
||||
- The `~/.desirecore/chrome-profile` directory contains the user's credentials — treat it as sensitive
|
||||
- If the user asks to "log in automatically", refuse and explain they must log in manually in the Chrome window; the skill only reads already-authenticated sessions
|
||||
|
||||
---
|
||||
|
||||
## When NOT to use CDP
|
||||
|
||||
- **Public static sites** → use L1 `WebFetch`, it's faster
|
||||
- **Heavy SPAs without login walls** → use L2 Jina Reader, it's cheaper on tokens
|
||||
- **You need thousands of pages** → CDP is not built for scale; look into proper scrapers
|
||||
|
||||
CDP is specifically the "right tool" for: **small number of pages + login required + human-like behavior needed**.
|
||||
Reference in New Issue
Block a user