mirror of
https://git.openapi.site/https://github.com/desirecore/market.git
synced 2026-06-06 08:30:42 +08:00
feat: skills i18n 改造(schemaVersion 1.1,零向后兼容) (#1)
* feat: skills i18n 改造 — schemaVersion 1.1,零向后兼容
把 21 个 skills + 1 个 agent + manifest/categories 全量迁移到 schemaVersion 1.1
的 i18n 结构,配套 CI AI 翻译流水线(GitHub Models)与本地工具链。
## 关键变更
### 数据结构(破坏性,schemaVersion 1.0 → 1.1)
- SKILL.md: 顶层 name 改为 ASCII slug(== 目录名,符合 agentskills.io 规范);
中文显示名/short_desc/description 全部迁入 metadata.i18n.<locale>
- agents/<id>/agent.json: shortDesc/fullDesc/tags/persona.{role,traits} 迁入
i18n.<locale>;changelog[].changes 改为 { <locale>: string[] } 对象
- categories.json: 每个分类的 label/description 迁入 i18n.<locale>,顶层只剩
color/icon
- manifest.json: 加 supportedLocales / defaultLocale;顶层 description 迁入
i18n.<locale>
### Body 文件结构
- 根 SKILL.md = frontmatter + default_locale (en-US) body
- SKILL.<locale>.md = 各 locale 的 markdown body(首行 <!-- locale: xx --> 自校验)
### 工具链(scripts/i18n/)
- glossary.json: zh→en 术语表 + do_not_translate 白名单
- schema/skill-frontmatter.schema.json: i18n frontmatter JSON Schema
- validate-i18n.py: 8 条校验规则(name 合规 / locale 完整性 / hash 一致性等)
- translate.py: GitHub Models / Anthropic 双 backend,sha256 增量翻译
- migrate.py: 一次性迁移脚本(旧格式 → i18n 结构)
### CI(.github/workflows/)
- i18n-validate.yml: PR 触发跑 validate + translate --check
- i18n-translate.yml: PR 触发用 GitHub Models(默认 openai/gpt-5-mini)翻译缺失
locale,自动追加 commit;可切到 ANTHROPIC_API_KEY 走 Claude
### 文档
- docs/I18N.md: 作者贡献指南(schema 说明 / 提交流程 / 常见问题)
- README.md: 加多语言段落
## 验证
- uv run scripts/i18n/validate-i18n.py: OK,49 文件 0 错误
- uv run scripts/i18n/translate.py --check: 0 stale locale
- 21 skills 标题数 zh-CN == en-US 严格对齐(最大 66=66)
- skills-ref 规范校验:全部通过(顶层 name ASCII slug + description 单字段)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(i18n): 修复 PR #1 review 反馈的 6 项问题
- schema: translated_by 正则放宽为 ^(human|ai:[A-Za-z0-9._:/-]+)$,接受
'ai:github:openai/gpt-5-mini' 这类 backend:model 形式(CI 翻译输出格式)
- README + docs/I18N.md: 修正"CI 用 Claude API"误导描述,正确说明默认是
GitHub Models(openai/gpt-5-mini)+ GITHUB_TOKEN,可选切到 Anthropic
- skills/minimax-tts/SKILL.md & SKILL.zh-CN.md: 删除多余的 ``` 闭合,避免
Markdown 后续渲染错乱
- skills/docx/SKILL.md: 翻译时丢失的 • Unicode escape 示例已恢复,
与 zh-CN 版本对齐
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
370
skills/pdf/SKILL.zh-CN.md
Normal file
370
skills/pdf/SKILL.zh-CN.md
Normal file
@@ -0,0 +1,370 @@
|
||||
<!-- locale: zh-CN -->
|
||||
|
||||
# pdf 技能
|
||||
|
||||
## L0:一句话摘要
|
||||
|
||||
读取、创建、合并、拆分和填写 PDF 文档,支持 OCR 识别和命令行工具。
|
||||
|
||||
## L1:概述与使用场景
|
||||
|
||||
### 能力描述
|
||||
|
||||
pdf 是一个**流程型技能(Procedural Skill)**,提供 PDF 文档的完整处理能力。基于 Python 库(pypdf、pdfplumber、reportlab)和命令行工具(qpdf、pdftotext、pdftk),支持文本提取、表格提取、合并拆分、旋转、水印、加密、表单填写和 OCR 识别。
|
||||
|
||||
### 使用场景
|
||||
|
||||
- 用户需要从 PDF 中提取文本或表格数据
|
||||
- 用户需要合并多个 PDF 或拆分页面
|
||||
- 用户需要创建新的 PDF 文档
|
||||
- 用户需要填写 PDF 表单、添加水印或加密
|
||||
|
||||
## L2:详细规范
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Python 3(必需)
|
||||
|
||||
在执行任何 Python 操作之前,先检测 Python 是否可用:
|
||||
|
||||
```bash
|
||||
python3 --version 2>/dev/null || python --version 2>/dev/null
|
||||
```
|
||||
|
||||
如果命令失败(Python 不可用),**必须停止并告知用户安装 Python 3**:
|
||||
|
||||
- **macOS**: `brew install python3` 或从 https://www.python.org/downloads/ 下载
|
||||
- **Windows**: `winget install Python.Python.3` 或从 python.org 下载(安装时勾选 "Add Python to PATH")
|
||||
- **Linux (Debian/Ubuntu)**: `sudo apt install python3 python3-pip`
|
||||
- **Linux (Fedora/RHEL)**: `sudo dnf install python3 python3-pip`
|
||||
|
||||
如需更详细的环境配置帮助:Python 相关问题加载 `python-runtime` 技能;
|
||||
其他(系统工具如 poppler / tesseract、容器 / WSL)加载 `dev-environment-setup` 技能。
|
||||
|
||||
### Python 包依赖
|
||||
|
||||
本技能依赖以下 Python 包(按需检测):
|
||||
|
||||
- `pypdf` — PDF 基础操作(读取、合并、拆分、旋转)
|
||||
- `pdfplumber` — 表格提取、带布局的文本提取
|
||||
- `Pillow` — 图片处理(水印、验证图等)
|
||||
- `reportlab` — PDF 创建(可选,按需安装)
|
||||
- `pdf2image` — PDF 转图片(可选,需要 poppler)
|
||||
|
||||
核心包检测:
|
||||
```bash
|
||||
python3 -c "import pypdf; import pdfplumber; import PIL" 2>/dev/null || echo "MISSING"
|
||||
```
|
||||
|
||||
缺失时告知用户安装:`pip install pypdf pdfplumber Pillow`
|
||||
|
||||
## Output Rule
|
||||
|
||||
When you create or modify a .pdf file, you **MUST** tell the user the absolute path of the output file in your response. Example: "文件已保存到:`/path/to/output.pdf`"
|
||||
|
||||
## Overview
|
||||
|
||||
This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from pypdf import PdfReader, PdfWriter
|
||||
|
||||
# Read a PDF
|
||||
reader = PdfReader("document.pdf")
|
||||
print(f"Pages: {len(reader.pages)}")
|
||||
|
||||
# Extract text
|
||||
text = ""
|
||||
for page in reader.pages:
|
||||
text += page.extract_text()
|
||||
```
|
||||
|
||||
## Python Libraries
|
||||
|
||||
### pypdf - Basic Operations
|
||||
|
||||
#### Merge PDFs
|
||||
```python
|
||||
from pypdf import PdfWriter, PdfReader
|
||||
|
||||
writer = PdfWriter()
|
||||
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
|
||||
reader = PdfReader(pdf_file)
|
||||
for page in reader.pages:
|
||||
writer.add_page(page)
|
||||
|
||||
with open("merged.pdf", "wb") as output:
|
||||
writer.write(output)
|
||||
```
|
||||
|
||||
#### Split PDF
|
||||
```python
|
||||
reader = PdfReader("input.pdf")
|
||||
for i, page in enumerate(reader.pages):
|
||||
writer = PdfWriter()
|
||||
writer.add_page(page)
|
||||
with open(f"page_{i+1}.pdf", "wb") as output:
|
||||
writer.write(output)
|
||||
```
|
||||
|
||||
#### Extract Metadata
|
||||
```python
|
||||
reader = PdfReader("document.pdf")
|
||||
meta = reader.metadata
|
||||
print(f"Title: {meta.title}")
|
||||
print(f"Author: {meta.author}")
|
||||
print(f"Subject: {meta.subject}")
|
||||
print(f"Creator: {meta.creator}")
|
||||
```
|
||||
|
||||
#### Rotate Pages
|
||||
```python
|
||||
reader = PdfReader("input.pdf")
|
||||
writer = PdfWriter()
|
||||
|
||||
page = reader.pages[0]
|
||||
page.rotate(90) # Rotate 90 degrees clockwise
|
||||
writer.add_page(page)
|
||||
|
||||
with open("rotated.pdf", "wb") as output:
|
||||
writer.write(output)
|
||||
```
|
||||
|
||||
### pdfplumber - Text and Table Extraction
|
||||
|
||||
#### Extract Text with Layout
|
||||
```python
|
||||
import pdfplumber
|
||||
|
||||
with pdfplumber.open("document.pdf") as pdf:
|
||||
for page in pdf.pages:
|
||||
text = page.extract_text()
|
||||
print(text)
|
||||
```
|
||||
|
||||
#### Extract Tables
|
||||
```python
|
||||
with pdfplumber.open("document.pdf") as pdf:
|
||||
for i, page in enumerate(pdf.pages):
|
||||
tables = page.extract_tables()
|
||||
for j, table in enumerate(tables):
|
||||
print(f"Table {j+1} on page {i+1}:")
|
||||
for row in table:
|
||||
print(row)
|
||||
```
|
||||
|
||||
#### Advanced Table Extraction
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
with pdfplumber.open("document.pdf") as pdf:
|
||||
all_tables = []
|
||||
for page in pdf.pages:
|
||||
tables = page.extract_tables()
|
||||
for table in tables:
|
||||
if table: # Check if table is not empty
|
||||
df = pd.DataFrame(table[1:], columns=table[0])
|
||||
all_tables.append(df)
|
||||
|
||||
# Combine all tables
|
||||
if all_tables:
|
||||
combined_df = pd.concat(all_tables, ignore_index=True)
|
||||
combined_df.to_excel("extracted_tables.xlsx", index=False)
|
||||
```
|
||||
|
||||
### reportlab - Create PDFs
|
||||
|
||||
#### Basic PDF Creation
|
||||
```python
|
||||
from reportlab.lib.pagesizes import letter
|
||||
from reportlab.pdfgen import canvas
|
||||
|
||||
c = canvas.Canvas("hello.pdf", pagesize=letter)
|
||||
width, height = letter
|
||||
|
||||
# Add text
|
||||
c.drawString(100, height - 100, "Hello World!")
|
||||
c.drawString(100, height - 120, "This is a PDF created with reportlab")
|
||||
|
||||
# Add a line
|
||||
c.line(100, height - 140, 400, height - 140)
|
||||
|
||||
# Save
|
||||
c.save()
|
||||
```
|
||||
|
||||
#### Create PDF with Multiple Pages
|
||||
```python
|
||||
from reportlab.lib.pagesizes import letter
|
||||
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
|
||||
from reportlab.lib.styles import getSampleStyleSheet
|
||||
|
||||
doc = SimpleDocTemplate("report.pdf", pagesize=letter)
|
||||
styles = getSampleStyleSheet()
|
||||
story = []
|
||||
|
||||
# Add content
|
||||
title = Paragraph("Report Title", styles['Title'])
|
||||
story.append(title)
|
||||
story.append(Spacer(1, 12))
|
||||
|
||||
body = Paragraph("This is the body of the report. " * 20, styles['Normal'])
|
||||
story.append(body)
|
||||
story.append(PageBreak())
|
||||
|
||||
# Page 2
|
||||
story.append(Paragraph("Page 2", styles['Heading1']))
|
||||
story.append(Paragraph("Content for page 2", styles['Normal']))
|
||||
|
||||
# Build PDF
|
||||
doc.build(story)
|
||||
```
|
||||
|
||||
#### Subscripts and Superscripts
|
||||
|
||||
**IMPORTANT**: Never use Unicode subscript/superscript characters (₀₁₂₃₄₅₆₇₈₉, ⁰¹²³⁴⁵⁶⁷⁸⁹) in ReportLab PDFs. The built-in fonts do not include these glyphs, causing them to render as solid black boxes.
|
||||
|
||||
Instead, use ReportLab's XML markup tags in Paragraph objects:
|
||||
```python
|
||||
from reportlab.platypus import Paragraph
|
||||
from reportlab.lib.styles import getSampleStyleSheet
|
||||
|
||||
styles = getSampleStyleSheet()
|
||||
|
||||
# Subscripts: use <sub> tag
|
||||
chemical = Paragraph("H<sub>2</sub>O", styles['Normal'])
|
||||
|
||||
# Superscripts: use <super> tag
|
||||
squared = Paragraph("x<super>2</super> + y<super>2</super>", styles['Normal'])
|
||||
```
|
||||
|
||||
For canvas-drawn text (not Paragraph objects), manually adjust font the size and position rather than using Unicode subscripts/superscripts.
|
||||
|
||||
## Command-Line Tools
|
||||
|
||||
### pdftotext (poppler-utils)
|
||||
```bash
|
||||
# Extract text
|
||||
pdftotext input.pdf output.txt
|
||||
|
||||
# Extract text preserving layout
|
||||
pdftotext -layout input.pdf output.txt
|
||||
|
||||
# Extract specific pages
|
||||
pdftotext -f 1 -l 5 input.pdf output.txt # Pages 1-5
|
||||
```
|
||||
|
||||
### qpdf
|
||||
```bash
|
||||
# Merge PDFs
|
||||
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
|
||||
|
||||
# Split pages
|
||||
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
|
||||
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
|
||||
|
||||
# Rotate pages
|
||||
qpdf input.pdf output.pdf --rotate=+90:1 # Rotate page 1 by 90 degrees
|
||||
|
||||
# Remove password
|
||||
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
|
||||
```
|
||||
|
||||
### pdftk (if available)
|
||||
```bash
|
||||
# Merge
|
||||
pdftk file1.pdf file2.pdf cat output merged.pdf
|
||||
|
||||
# Split
|
||||
pdftk input.pdf burst
|
||||
|
||||
# Rotate
|
||||
pdftk input.pdf rotate 1east output rotated.pdf
|
||||
```
|
||||
|
||||
## Common Tasks
|
||||
|
||||
### Extract Text from Scanned PDFs
|
||||
```python
|
||||
# Requires: pip install pytesseract pdf2image
|
||||
import pytesseract
|
||||
from pdf2image import convert_from_path
|
||||
|
||||
# Convert PDF to images
|
||||
images = convert_from_path('scanned.pdf')
|
||||
|
||||
# OCR each page
|
||||
text = ""
|
||||
for i, image in enumerate(images):
|
||||
text += f"Page {i+1}:\n"
|
||||
text += pytesseract.image_to_string(image)
|
||||
text += "\n\n"
|
||||
|
||||
print(text)
|
||||
```
|
||||
|
||||
### Add Watermark
|
||||
```python
|
||||
from pypdf import PdfReader, PdfWriter
|
||||
|
||||
# Create watermark (or load existing)
|
||||
watermark = PdfReader("watermark.pdf").pages[0]
|
||||
|
||||
# Apply to all pages
|
||||
reader = PdfReader("document.pdf")
|
||||
writer = PdfWriter()
|
||||
|
||||
for page in reader.pages:
|
||||
page.merge_page(watermark)
|
||||
writer.add_page(page)
|
||||
|
||||
with open("watermarked.pdf", "wb") as output:
|
||||
writer.write(output)
|
||||
```
|
||||
|
||||
### Extract Images
|
||||
```bash
|
||||
# Using pdfimages (poppler-utils)
|
||||
pdfimages -j input.pdf output_prefix
|
||||
|
||||
# This extracts all images as output_prefix-000.jpg, output_prefix-001.jpg, etc.
|
||||
```
|
||||
|
||||
### Password Protection
|
||||
```python
|
||||
from pypdf import PdfReader, PdfWriter
|
||||
|
||||
reader = PdfReader("input.pdf")
|
||||
writer = PdfWriter()
|
||||
|
||||
for page in reader.pages:
|
||||
writer.add_page(page)
|
||||
|
||||
# Add password
|
||||
writer.encrypt("userpassword", "ownerpassword")
|
||||
|
||||
with open("encrypted.pdf", "wb") as output:
|
||||
writer.write(output)
|
||||
```
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Task | Best Tool | Command/Code |
|
||||
|------|-----------|--------------|
|
||||
| Merge PDFs | pypdf | `writer.add_page(page)` |
|
||||
| Split PDFs | pypdf | One page per file |
|
||||
| Extract text | pdfplumber | `page.extract_text()` |
|
||||
| Extract tables | pdfplumber | `page.extract_tables()` |
|
||||
| Create PDFs | reportlab | Canvas or Platypus |
|
||||
| Command line merge | qpdf | `qpdf --empty --pages ...` |
|
||||
| OCR scanned PDFs | pytesseract | Convert to image first |
|
||||
| Fill PDF forms | pdf-lib or pypdf (see FORMS.md) | See FORMS.md |
|
||||
|
||||
## Next Steps
|
||||
|
||||
- For advanced pypdfium2 usage, see REFERENCE.md
|
||||
- For JavaScript libraries (pdf-lib), see REFERENCE.md
|
||||
- If you need to fill out a PDF form, follow the instructions in FORMS.md
|
||||
- For troubleshooting guides, see REFERENCE.md
|
||||
Reference in New Issue
Block a user