mirror of
https://git.openapi.site/https://github.com/desirecore/market.git
synced 2026-04-21 17:30:44 +08:00
feat: 新增 web-access 和 frontend-design 两个内置技能
根据 docs 推荐补齐 5 个内置技能中的 c) 和 e): web-access v1.1.0: - 三层架构:L1 WebSearch/WebFetch + L2 Jina Reader + L3 CDP Browser - 添加 Chrome CDP 前置条件(macOS/Linux/Windows 启动命令) - 支持登录态访问 小红书/B站/微博/知乎/飞书/Twitter/公众号 - Jina Reader 重新定位为默认 token 优化层(非兜底) - 新增 references/cdp-browser.md(Python Playwright 详细操作手册) - 触发词扩充:小红书、B站、微博、飞书、Twitter、推特、X、知乎、公众号 frontend-design v1.0.0: - 从 Claude Code 官方 frontend-design 技能适配 - 保留原版 bold aesthetic 设计理念 - 新增 Project Context Override 章节:在 DesireCore 主仓库内工作时 自动遵循 3+2 色彩体系(Green/Blue/Purple + Orange/Red) - 添加 Output Rule 要求告知用户文件路径 builtin-skills.json: 12 → 14 skills
This commit is contained in:
122
skills/web-access/references/jina-reader.md
Normal file
122
skills/web-access/references/jina-reader.md
Normal file
@@ -0,0 +1,122 @@
|
||||
# Jina Reader — Default Token-Optimization Layer
|
||||
|
||||
[Jina Reader](https://jina.ai/reader) is a free public service that renders any URL server-side and returns clean Markdown. In this skill's three-layer architecture, **Jina is Layer 2: the default extractor for heavy/JS-rendered pages**, not just a fallback.
|
||||
|
||||
---
|
||||
|
||||
## Positioning in the three-layer model
|
||||
|
||||
```
|
||||
L1 WebFetch ── simple public static pages (docs, Wikipedia, HN)
|
||||
│
|
||||
│ WebFetch empty/truncated/garbled
|
||||
▼
|
||||
L2 Jina Reader ── DEFAULT for JS-heavy SPAs, long articles, Medium, Twitter
|
||||
│ Strips nav/ads automatically, saves 50-80% tokens
|
||||
│
|
||||
│ Login required, or Jina also fails
|
||||
▼
|
||||
L3 CDP Browser ── user's logged-in Chrome (小红书/B站/微博/飞书/Twitter)
|
||||
```
|
||||
|
||||
**Key insight**: Don't wait for WebFetch to fail before trying Jina. For any URL you expect to be JS-heavy (any major SPA, Medium, Dev.to, long-form articles), go straight to Jina for the token savings.
|
||||
|
||||
---
|
||||
|
||||
## Basic Usage (no API key)
|
||||
|
||||
```bash
|
||||
curl -sL "https://r.jina.ai/https://example.com/article"
|
||||
```
|
||||
|
||||
The original URL goes after `r.jina.ai/`. The response is plain Markdown — pipe to a file or read directly.
|
||||
|
||||
---
|
||||
|
||||
## When to use each layer
|
||||
|
||||
| Scenario | Primary choice | Why |
|
||||
|----------|---------------|-----|
|
||||
| Wikipedia, MDN, official docs | L1 WebFetch | Static clean HTML, fastest |
|
||||
| GitHub README (public) | L1 WebFetch | Simple markup |
|
||||
| Medium articles | **L2 Jina** | Member walls + heavy JS |
|
||||
| Dev.to, Hashnode | **L2 Jina** | JS-rendered |
|
||||
| Substack, Ghost blogs | **L2 Jina** | Partial JS rendering |
|
||||
| News sites with lazy-load | **L2 Jina** | Scroll-triggered content |
|
||||
| Twitter/X public threads | **L2 Jina** first, L3 CDP if truncated | Sometimes works |
|
||||
| 公众号 (mp.weixin.qq.com) | **L2 Jina** | Clean Markdown extraction |
|
||||
| LinkedIn articles | L3 CDP | Hard login wall |
|
||||
| 小红书, B站, 微博, 飞书 | L3 CDP | 登录强制 |
|
||||
|
||||
---
|
||||
|
||||
## Token savings example
|
||||
|
||||
Raw HTML of a long Medium article: ~150 KB, ~50,000 tokens
|
||||
Same article via Jina Reader: ~20 KB, ~7,000 tokens
|
||||
|
||||
**86% reduction**, with cleaner structure and no ads/nav cruft.
|
||||
|
||||
---
|
||||
|
||||
## Advanced Endpoints (optional)
|
||||
|
||||
If you need more than basic content extraction, Jina also offers:
|
||||
|
||||
- **Search**: `https://s.jina.ai/<query>` — returns top 5 results as Markdown
|
||||
- **Embeddings**: `https://api.jina.ai/v1/embeddings` (requires free API key)
|
||||
- **Reranker**: `https://api.jina.ai/v1/rerank` (requires free API key)
|
||||
|
||||
For DesireCore, prefer the built-in `WebSearch` tool over `s.jina.ai` for consistency.
|
||||
|
||||
---
|
||||
|
||||
## Rate Limits
|
||||
|
||||
- **Free tier**: ~20 requests/minute, no authentication needed
|
||||
- **With free API key**: higher limits, fewer throttles
|
||||
```bash
|
||||
curl -sL "https://r.jina.ai/https://example.com" \
|
||||
-H "Authorization: Bearer YOUR_KEY"
|
||||
```
|
||||
- Get a free key at [jina.ai](https://jina.ai) — stored in env var `JINA_API_KEY` if available
|
||||
|
||||
---
|
||||
|
||||
## Usage tips
|
||||
|
||||
### Cache your own results
|
||||
Jina itself doesn't cache for you. If you call the same URL repeatedly in a session, save the Markdown to a temp file:
|
||||
|
||||
```bash
|
||||
curl -sL "https://r.jina.ai/$URL" > /tmp/jina-cache.md
|
||||
```
|
||||
|
||||
### Handle very long articles
|
||||
Jina returns the full article in one response. For articles > 50K chars, pipe through `head` or extract specific sections with Python/awk before feeding back to the model context.
|
||||
|
||||
### Combine with CDP
|
||||
When you use L3 CDP to fetch a login-gated page, you can pipe the resulting HTML through Jina for clean Markdown instead of parsing with BeautifulSoup:
|
||||
|
||||
```python
|
||||
html = fetch_with_cdp(url) # from references/cdp-browser.md
|
||||
# Now convert via Jina (note: Jina fetches the URL itself, not your HTML)
|
||||
# So this only works if the content is already visible without login:
|
||||
import subprocess
|
||||
md = subprocess.run(["curl", "-sL", f"https://r.jina.ai/{url}"],
|
||||
capture_output=True, text=True).stdout
|
||||
```
|
||||
|
||||
For truly login-gated content, you must parse the HTML directly (BeautifulSoup) since Jina can't log in on your behalf.
|
||||
|
||||
---
|
||||
|
||||
## Failure Mode
|
||||
|
||||
If Jina Reader returns garbage or error:
|
||||
1. **Hard login wall** → escalate to L3 CDP browser
|
||||
2. **Geographically restricted** → tell the user, suggest VPN or manual access
|
||||
3. **Cloudflare challenge** → try L3 CDP (user's browser passes challenges naturally)
|
||||
4. **404 / gone** → confirm the URL is correct
|
||||
|
||||
In all cases, tell the user explicitly which URL failed and what you tried.
|
||||
Reference in New Issue
Block a user