market/skills/web-access/references/cdp-browser.md at 3bed41d7a1152020bc5ebb086a8fce121ac85944

mirror of https://git.openapi.site/https://github.com/desirecore/market.git synced 2026-04-21 18:30:40 +08:00

Files

张馨元 98322aa930 feat: 新增 web-access 和 frontend-design 两个内置技能

根据 docs 推荐补齐 5 个内置技能中的 c) 和 e)：

web-access v1.1.0:
- 三层架构：L1 WebSearch/WebFetch + L2 Jina Reader + L3 CDP Browser
- 添加 Chrome CDP 前置条件（macOS/Linux/Windows 启动命令）
- 支持登录态访问 小红书/B站/微博/知乎/飞书/Twitter/公众号
- Jina Reader 重新定位为默认 token 优化层（非兜底）
- 新增 references/cdp-browser.md（Python Playwright 详细操作手册）
- 触发词扩充：小红书、B站、微博、飞书、Twitter、推特、X、知乎、公众号

frontend-design v1.0.0:
- 从 Claude Code 官方 frontend-design 技能适配
- 保留原版 bold aesthetic 设计理念
- 新增 Project Context Override 章节：在 DesireCore 主仓库内工作时
  自动遵循 3+2 色彩体系（Green/Blue/Purple + Orange/Red）
- 添加 Output Rule 要求告知用户文件路径

builtin-skills.json: 12 → 14 skills

2026-04-07 15:35:59 +08:00

11 KiB

Raw Blame History

Detailed recipes for accessing sites that require the user's login session, via Chrome DevTools Protocol (CDP) + Python Playwright.

Precondition: Chrome is already running with --remote-debugging-port=9222 and the user has manually logged in to the target sites. See the main SKILL.md Prerequisites section for the launch command.

Why CDP attach, not headless

Approach	Login state	Anti-bot	Speed	Cost
Headless Playwright (new context)	❌ Empty cookies	❌ Flagged as bot	Slow cold start	Re-login pain
`playwright.chromium.launch(headless=False)`	❌ Fresh profile	⚠ Sometimes flagged	Slow	Same
CDP attach (`connect_over_cdp`)	✅ User's real cookies	✅ Looks human	Instant	Zero friction

Rule: For any login-gated site, always attach to the user's running Chrome.

Core Template

Every CDP script follows this shape:

from playwright.sync_api import sync_playwright

def fetch_with_cdp(url: str, wait_selector: str | None = None) -> str:
    """Attach to user's Chrome via CDP, fetch URL, return HTML."""
    with sync_playwright() as p:
        browser = p.chromium.connect_over_cdp("http://localhost:9222")
        # browser.contexts[0] is the user's default context (with cookies)
        context = browser.contexts[0]
        page = context.new_page()
        try:
            page.goto(url, wait_until="domcontentloaded", timeout=30000)
            if wait_selector:
                page.wait_for_selector(wait_selector, timeout=10000)
            else:
                page.wait_for_timeout(2000)  # generic settle
            return page.content()
        finally:
            page.close()
            # DO NOT call browser.close() — that would close the user's Chrome!

if __name__ == "__main__":
    html = fetch_with_cdp("https://example.com")
    print(html[:1000])

Critical: Never call browser.close() when using CDP attach — you'd kill the user's Chrome. Only close the page you opened.

Site Recipes

小红书 (xiaohongshu.com)

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

NOTE_URL = "https://www.xiaohongshu.com/explore/XXXXXXXX"

with sync_playwright() as p:
    browser = p.chromium.connect_over_cdp("http://localhost:9222")
    page = browser.contexts[0].new_page()
    page.goto(NOTE_URL, wait_until="domcontentloaded")
    page.wait_for_selector("#detail-title", timeout=10000)
    page.wait_for_timeout(1500)  # let images/comments load
    html = page.content()
    page.close()

soup = BeautifulSoup(html, "html.parser")
title = (soup.select_one("#detail-title") or {}).get_text(strip=True) if soup.select_one("#detail-title") else None
desc  = (soup.select_one("#detail-desc") or {}).get_text(" ", strip=True) if soup.select_one("#detail-desc") else None
author = soup.select_one(".author-wrapper .username")
print("Title:",  title)
print("Author:", author.get_text(strip=True) if author else None)
print("Desc:",   desc)

Selectors (may drift over time — update if they fail):

Title: #detail-title
Description: #detail-desc
Author: .author-wrapper .username
Images: .swiper-slide img
Comments: .parent-comment .content

B站 (bilibili.com)

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

VIDEO_URL = "https://www.bilibili.com/video/BVxxxxxxxxx"

with sync_playwright() as p:
    browser = p.chromium.connect_over_cdp("http://localhost:9222")
    page = browser.contexts[0].new_page()
    page.goto(VIDEO_URL, wait_until="networkidle")
    page.wait_for_timeout(2000)
    html = page.content()
    page.close()

soup = BeautifulSoup(html, "html.parser")
print("Title:", soup.select_one("h1.video-title").get_text(strip=True) if soup.select_one("h1.video-title") else None)
print("UP:",    soup.select_one(".up-name").get_text(strip=True) if soup.select_one(".up-name") else None)
print("Desc:",  soup.select_one(".desc-info-text").get_text(" ", strip=True) if soup.select_one(".desc-info-text") else None)

Tip: For B站 evaluations, the 公开 API often returns JSON without needing CDP. Try it first:

curl -s "https://api.bilibili.com/x/web-interface/view?bvid=BVxxxxxxxxx" | python3 -m json.tool

微博 (weibo.com)

WEIBO_URL = "https://weibo.com/u/1234567890"  # or /detail/xxx

# Same CDP template
# Selectors:
#   .Feed_body_3R0rO .detail_wbtext_4CRf9    — post text
#   .ALink_default_2ibt1                      — user link
#   article[aria-label="微博"]                 — each feed item

Note: Weibo uses React + heavy obfuscation. Selectors change frequently. If selectors fail, pipe the HTML through Jina for clean Markdown:

html = fetch_with_cdp(WEIBO_URL)
# Save to temp file, then:
import subprocess
result = subprocess.run(
    ["curl", "-sL", f"https://r.jina.ai/{WEIBO_URL}"],
    capture_output=True, text=True
)
print(result.stdout)

知乎 (zhihu.com)

ANSWER_URL = "https://www.zhihu.com/question/123/answer/456"

# Selectors:
#   h1.QuestionHeader-title      — question title
#   .RichContent-inner            — answer body
#   .AuthorInfo-name              — author

Zhihu works with CDP but often also renders enough metadata server-side for Jina to work:

curl -sL "https://r.jina.ai/https://www.zhihu.com/question/123/answer/456"

Try Jina first, fall back to CDP if content is truncated.

飞书文档 (feishu.cn / larksuite.com)

DOC_URL = "https://xxx.feishu.cn/docs/xxx"

# Feishu uses heavy virtualization — must scroll to load all content.
# Recipe:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.connect_over_cdp("http://localhost:9222")
    page = browser.contexts[0].new_page()
    page.goto(DOC_URL, wait_until="domcontentloaded")
    page.wait_for_selector(".docs-render-unit", timeout=15000)

    # Scroll to bottom repeatedly to load lazy content
    last_height = 0
    for _ in range(20):
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(800)
        h = page.evaluate("document.body.scrollHeight")
        if h == last_height:
            break
        last_height = h

    # Extract text
    text = page.evaluate("() => document.body.innerText")
    page.close()

print(text)

Twitter / X

TWEET_URL = "https://x.com/username/status/1234567890"

# Selectors:
#   article[data-testid="tweet"]         — tweet container
#   div[data-testid="tweetText"]          — tweet text
#   div[data-testid="User-Name"]          — author
#   a[href$="/analytics"]                 — view count anchor (next sibling has stats)

Twitter is aggressive with anti-bot. CDP attach usually works, but set a generous wait:

page.goto(url, wait_until="networkidle", timeout=45000)
page.wait_for_selector('article[data-testid="tweet"]', timeout=15000)

Common Patterns

Pattern 1: Scroll to load lazy content

def scroll_to_bottom(page, max_steps=30, pause_ms=800):
    last = 0
    for _ in range(max_steps):
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(pause_ms)
        h = page.evaluate("document.body.scrollHeight")
        if h == last:
            return
        last = h

Pattern 2: Screenshot a specific element

element = page.locator("article").first
element.screenshot(path="/tmp/article.png")

Pattern 3: Extract structured data via JavaScript

data = page.evaluate("""() => {
    const items = document.querySelectorAll('.list-item');
    return Array.from(items).map(el => ({
        title: el.querySelector('.title')?.innerText,
        url:   el.querySelector('a')?.href,
    }));
}""")
print(data)

Pattern 4: Fill a form and click

page.fill("input[name=q]", "search query")
page.click("button[type=submit]")
page.wait_for_load_state("networkidle")

Pattern 5: Clean HTML via Jina after extraction

When selectors are unreliable, dump the full page HTML and let Jina do the cleaning:

html = page.content()
# Save to file, serve via local HTTP, or just pipe the original URL:
import subprocess
clean_md = subprocess.run(
    ["curl", "-sL", f"https://r.jina.ai/{url}"],
    capture_output=True, text=True
).stdout
print(clean_md)

Troubleshooting

`connect_over_cdp` fails with `ECONNREFUSED`

Chrome is not running with remote debugging. Tell the user:

"请先用下面的命令启动 Chrome： /Applications/Google\\ Chrome.app/Contents/MacOS/Google\\ Chrome --remote-debugging-port=9222 --user-data-dir=\"$HOME/.desirecore/chrome-profile\" 然后手动登录需要抓取的网站，再让我继续。"

`browser.contexts[0]` is empty

Chrome was launched but no windows are open. Ask the user to open at least one tab and navigate anywhere.

Playwright not installed

pip3 install playwright beautifulsoup4
# No need for `playwright install` — we're attaching to existing Chrome, not downloading a new browser

Site detects automation

Despite CDP attach, some sites (Cloudflare-protected, Instagram) may still detect automation. Options:

Use Jina Reader instead (curl -sL https://r.jina.ai/<url>) — often succeeds where Playwright fails
Ask the user to manually copy the visible content
Use the site's public API if available

Content is truncated

The page uses virtualization or lazy loading. Apply Pattern 1 (scroll to bottom) before calling page.content().

`page.wait_for_selector` times out

The selector is stale — the site updated its DOM. Dump page.content()[:5000] and inspect manually, or fall back to Jina Reader.

Security Notes

Never log or print cookies from context.cookies() even during debugging
Never extract and store the user's session tokens to files
Never use the CDP session to perform writes (post, comment, like) unless the user explicitly requested it
The ~/.desirecore/chrome-profile directory contains the user's credentials — treat it as sensitive
If the user asks to "log in automatically", refuse and explain they must log in manually in the Chrome window; the skill only reads already-authenticated sessions

When NOT to use CDP

Public static sites → use L1 WebFetch, it's faster
Heavy SPAs without login walls → use L2 Jina Reader, it's cheaper on tokens
You need thousands of pages → CDP is not built for scale; look into proper scrapers

CDP is specifically the "right tool" for: small number of pages + login required + human-like behavior needed.

11 KiB Raw Blame History Unescape Escape

CDP Browser Access — Login-Gated Sites Manual