Files
market/skills/web-access/references/cdp-browser.md
张馨元 98322aa930 feat: 新增 web-access 和 frontend-design 两个内置技能
根据 docs 推荐补齐 5 个内置技能中的 c) 和 e):

web-access v1.1.0:
- 三层架构:L1 WebSearch/WebFetch + L2 Jina Reader + L3 CDP Browser
- 添加 Chrome CDP 前置条件(macOS/Linux/Windows 启动命令)
- 支持登录态访问 小红书/B站/微博/知乎/飞书/Twitter/公众号
- Jina Reader 重新定位为默认 token 优化层(非兜底)
- 新增 references/cdp-browser.md(Python Playwright 详细操作手册)
- 触发词扩充:小红书、B站、微博、飞书、Twitter、推特、X、知乎、公众号

frontend-design v1.0.0:
- 从 Claude Code 官方 frontend-design 技能适配
- 保留原版 bold aesthetic 设计理念
- 新增 Project Context Override 章节:在 DesireCore 主仓库内工作时
  自动遵循 3+2 色彩体系(Green/Blue/Purple + Orange/Red)
- 添加 Output Rule 要求告知用户文件路径

builtin-skills.json: 12 → 14 skills
2026-04-07 15:35:59 +08:00

11 KiB
Raw Blame History

CDP Browser Access — Login-Gated Sites Manual

Detailed recipes for accessing sites that require the user's login session, via Chrome DevTools Protocol (CDP) + Python Playwright.

Precondition: Chrome is already running with --remote-debugging-port=9222 and the user has manually logged in to the target sites. See the main SKILL.md Prerequisites section for the launch command.


Why CDP attach, not headless

Approach Login state Anti-bot Speed Cost
Headless Playwright (new context) Empty cookies Flagged as bot Slow cold start Re-login pain
playwright.chromium.launch(headless=False) Fresh profile ⚠ Sometimes flagged Slow Same
CDP attach (connect_over_cdp) User's real cookies Looks human Instant Zero friction

Rule: For any login-gated site, always attach to the user's running Chrome.


Core Template

Every CDP script follows this shape:

from playwright.sync_api import sync_playwright

def fetch_with_cdp(url: str, wait_selector: str | None = None) -> str:
    """Attach to user's Chrome via CDP, fetch URL, return HTML."""
    with sync_playwright() as p:
        browser = p.chromium.connect_over_cdp("http://localhost:9222")
        # browser.contexts[0] is the user's default context (with cookies)
        context = browser.contexts[0]
        page = context.new_page()
        try:
            page.goto(url, wait_until="domcontentloaded", timeout=30000)
            if wait_selector:
                page.wait_for_selector(wait_selector, timeout=10000)
            else:
                page.wait_for_timeout(2000)  # generic settle
            return page.content()
        finally:
            page.close()
            # DO NOT call browser.close() — that would close the user's Chrome!

if __name__ == "__main__":
    html = fetch_with_cdp("https://example.com")
    print(html[:1000])

Critical: Never call browser.close() when using CDP attach — you'd kill the user's Chrome. Only close the page you opened.


Site Recipes

小红书 (xiaohongshu.com)

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

NOTE_URL = "https://www.xiaohongshu.com/explore/XXXXXXXX"

with sync_playwright() as p:
    browser = p.chromium.connect_over_cdp("http://localhost:9222")
    page = browser.contexts[0].new_page()
    page.goto(NOTE_URL, wait_until="domcontentloaded")
    page.wait_for_selector("#detail-title", timeout=10000)
    page.wait_for_timeout(1500)  # let images/comments load
    html = page.content()
    page.close()

soup = BeautifulSoup(html, "html.parser")
title = (soup.select_one("#detail-title") or {}).get_text(strip=True) if soup.select_one("#detail-title") else None
desc  = (soup.select_one("#detail-desc") or {}).get_text(" ", strip=True) if soup.select_one("#detail-desc") else None
author = soup.select_one(".author-wrapper .username")
print("Title:",  title)
print("Author:", author.get_text(strip=True) if author else None)
print("Desc:",   desc)

Selectors (may drift over time — update if they fail):

  • Title: #detail-title
  • Description: #detail-desc
  • Author: .author-wrapper .username
  • Images: .swiper-slide img
  • Comments: .parent-comment .content

B站 (bilibili.com)

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

VIDEO_URL = "https://www.bilibili.com/video/BVxxxxxxxxx"

with sync_playwright() as p:
    browser = p.chromium.connect_over_cdp("http://localhost:9222")
    page = browser.contexts[0].new_page()
    page.goto(VIDEO_URL, wait_until="networkidle")
    page.wait_for_timeout(2000)
    html = page.content()
    page.close()

soup = BeautifulSoup(html, "html.parser")
print("Title:", soup.select_one("h1.video-title").get_text(strip=True) if soup.select_one("h1.video-title") else None)
print("UP:",    soup.select_one(".up-name").get_text(strip=True) if soup.select_one(".up-name") else None)
print("Desc:",  soup.select_one(".desc-info-text").get_text(" ", strip=True) if soup.select_one(".desc-info-text") else None)

Tip: For B站 evaluations, the 公开 API often returns JSON without needing CDP. Try it first:

curl -s "https://api.bilibili.com/x/web-interface/view?bvid=BVxxxxxxxxx" | python3 -m json.tool

微博 (weibo.com)

WEIBO_URL = "https://weibo.com/u/1234567890"  # or /detail/xxx

# Same CDP template
# Selectors:
#   .Feed_body_3R0rO .detail_wbtext_4CRf9    — post text
#   .ALink_default_2ibt1                      — user link
#   article[aria-label="微博"]                 — each feed item

Note: Weibo uses React + heavy obfuscation. Selectors change frequently. If selectors fail, pipe the HTML through Jina for clean Markdown:

html = fetch_with_cdp(WEIBO_URL)
# Save to temp file, then:
import subprocess
result = subprocess.run(
    ["curl", "-sL", f"https://r.jina.ai/{WEIBO_URL}"],
    capture_output=True, text=True
)
print(result.stdout)

知乎 (zhihu.com)

ANSWER_URL = "https://www.zhihu.com/question/123/answer/456"

# Selectors:
#   h1.QuestionHeader-title      — question title
#   .RichContent-inner            — answer body
#   .AuthorInfo-name              — author

Zhihu works with CDP but often also renders enough metadata server-side for Jina to work:

curl -sL "https://r.jina.ai/https://www.zhihu.com/question/123/answer/456"

Try Jina first, fall back to CDP if content is truncated.

飞书文档 (feishu.cn / larksuite.com)

DOC_URL = "https://xxx.feishu.cn/docs/xxx"

# Feishu uses heavy virtualization — must scroll to load all content.
# Recipe:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.connect_over_cdp("http://localhost:9222")
    page = browser.contexts[0].new_page()
    page.goto(DOC_URL, wait_until="domcontentloaded")
    page.wait_for_selector(".docs-render-unit", timeout=15000)

    # Scroll to bottom repeatedly to load lazy content
    last_height = 0
    for _ in range(20):
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(800)
        h = page.evaluate("document.body.scrollHeight")
        if h == last_height:
            break
        last_height = h

    # Extract text
    text = page.evaluate("() => document.body.innerText")
    page.close()

print(text)

Twitter / X

TWEET_URL = "https://x.com/username/status/1234567890"

# Selectors:
#   article[data-testid="tweet"]         — tweet container
#   div[data-testid="tweetText"]          — tweet text
#   div[data-testid="User-Name"]          — author
#   a[href$="/analytics"]                 — view count anchor (next sibling has stats)

Twitter is aggressive with anti-bot. CDP attach usually works, but set a generous wait:

page.goto(url, wait_until="networkidle", timeout=45000)
page.wait_for_selector('article[data-testid="tweet"]', timeout=15000)

Common Patterns

Pattern 1: Scroll to load lazy content

def scroll_to_bottom(page, max_steps=30, pause_ms=800):
    last = 0
    for _ in range(max_steps):
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(pause_ms)
        h = page.evaluate("document.body.scrollHeight")
        if h == last:
            return
        last = h

Pattern 2: Screenshot a specific element

element = page.locator("article").first
element.screenshot(path="/tmp/article.png")

Pattern 3: Extract structured data via JavaScript

data = page.evaluate("""() => {
    const items = document.querySelectorAll('.list-item');
    return Array.from(items).map(el => ({
        title: el.querySelector('.title')?.innerText,
        url:   el.querySelector('a')?.href,
    }));
}""")
print(data)

Pattern 4: Fill a form and click

page.fill("input[name=q]", "search query")
page.click("button[type=submit]")
page.wait_for_load_state("networkidle")

Pattern 5: Clean HTML via Jina after extraction

When selectors are unreliable, dump the full page HTML and let Jina do the cleaning:

html = page.content()
# Save to file, serve via local HTTP, or just pipe the original URL:
import subprocess
clean_md = subprocess.run(
    ["curl", "-sL", f"https://r.jina.ai/{url}"],
    capture_output=True, text=True
).stdout
print(clean_md)

Troubleshooting

connect_over_cdp fails with ECONNREFUSED

Chrome is not running with remote debugging. Tell the user:

"请先用下面的命令启动 Chrome /Applications/Google\\ Chrome.app/Contents/MacOS/Google\\ Chrome --remote-debugging-port=9222 --user-data-dir=\"$HOME/.desirecore/chrome-profile\" 然后手动登录需要抓取的网站,再让我继续。"

browser.contexts[0] is empty

Chrome was launched but no windows are open. Ask the user to open at least one tab and navigate anywhere.

Playwright not installed

pip3 install playwright beautifulsoup4
# No need for `playwright install` — we're attaching to existing Chrome, not downloading a new browser

Site detects automation

Despite CDP attach, some sites (Cloudflare-protected, Instagram) may still detect automation. Options:

  1. Use Jina Reader instead (curl -sL https://r.jina.ai/<url>) — often succeeds where Playwright fails
  2. Ask the user to manually copy the visible content
  3. Use the site's public API if available

Content is truncated

The page uses virtualization or lazy loading. Apply Pattern 1 (scroll to bottom) before calling page.content().

page.wait_for_selector times out

The selector is stale — the site updated its DOM. Dump page.content()[:5000] and inspect manually, or fall back to Jina Reader.


Security Notes

  • Never log or print cookies from context.cookies() even during debugging
  • Never extract and store the user's session tokens to files
  • Never use the CDP session to perform writes (post, comment, like) unless the user explicitly requested it
  • The ~/.desirecore/chrome-profile directory contains the user's credentials — treat it as sensitive
  • If the user asks to "log in automatically", refuse and explain they must log in manually in the Chrome window; the skill only reads already-authenticated sessions

When NOT to use CDP

  • Public static sites → use L1 WebFetch, it's faster
  • Heavy SPAs without login walls → use L2 Jina Reader, it's cheaper on tokens
  • You need thousands of pages → CDP is not built for scale; look into proper scrapers

CDP is specifically the "right tool" for: small number of pages + login required + human-like behavior needed.