## 背景 / Background #16 (4f7037a) 将 16 个 SKILL.md 的 `~/.desirecore` 路径批量替换为 `${DESIRECORE_ROOT}`,但只升了 `manifest.json`,**未升任何 per-skill version**。 客户端按 SKILL.md frontmatter 的 per-skill `version` 做 semver 同步:version 不变即判定「无更新」而永久跳过,导致已升级用户的全局技能正文停留在替换前的旧内容(与线上不同步)。 #16 (4f7037a) bulk-replaced `~/.desirecore` with `${DESIRECORE_ROOT}` in 16 SKILL.md files but only bumped `manifest.json`, leaving every per-skill `version` untouched. Clients sync by per-skill semver, so an unchanged version is treated as "no update" and skipped forever — upgraded users' global skills stay frozen on pre-replacement content. ## 改动 / Changes - 对 #16 触及且至今仍未升号的 **14 个在册技能** 各 patch +1 - `manifest.json` 1.2.2 → 1.2.3(沿用 #16「内容改动同步升 manifest」的约定) - 退役技能 `minimax-image-gen` / `minimax-tts`(不在 builtin-skills.json,不下发)跳过 - diff 为纯 version 行,未触动正文 Bumps the 14 in-manifest skills changed by #16 that were never version-bumped; manifest 1.2.2 → 1.2.3; retired skills skipped. Version-line-only diff.
9.2 KiB
name, description, license, version, type, risk_level, status, disable-model-invocation, provider, tags, requires, metadata, market
| name | description | license | version | type | risk_level | status | disable-model-invocation | provider | tags | requires | metadata | market | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| xiaomi-tts | Use this skill when the user wants to convert text to speech using Xiaomi MiMo's TTS models (mimo-v2.5-tts). Uses OpenAI-compatible chat/completions API with audio response. Supports multiple preset voices and custom voice design. Use when 用户提到 语音合成、文字转语音、TTS、朗读、读出来、生成语音、 生成音频、文本转音频、配音、念出来、小米语音、MiMo 语音、小米 TTS。 | Complete terms in LICENSE.txt | 1.0.1 | procedural | low | enabled | false | xiaomi |
|
|
|
|
xiaomi-tts Skill
Mandatory Rules (violations cause failure)
- Must access agent-service over HTTPS — use
https://127.0.0.1:${PORT}with-kto skip certificate verification - Must upload to media-store via
/api/media/upload—/tmpis only a transient download/decode location, never use a local path as the final output - Must use the
dc-media://protocol to display audio — the only form the frontend can render correctly - Use Bash curl throughout — do not use the HttpRequest tool or Python
- Use the
/chat/completionsendpoint — Xiaomi MiMo TTS speaks OpenAI-compatible chat format
Model Selection
| Model | Characteristics | When to use |
|---|---|---|
| mimo-v2.5-tts | Standard TTS, multiple preset voices | Default, regular speech synthesis |
| mimo-v2.5-tts-voicedesign | Custom voice design | When you need a voice generated from a description |
| mimo-v2.5-tts-voiceclone | Voice cloning | When you need to clone a specific voice (reference audio required) |
Default rule: if the user does not specify a model, use mimo-v2.5-tts.
Voice Selection
Preset Voices
| voice_id | Name | Characteristics |
|---|---|---|
| default_zh | Default Chinese | General-purpose Chinese female voice |
| default_en | Default English | General-purpose English female voice |
| mimo_default | MiMo Default | MiMo's signature voice |
| Bingtang | Bingtang | Sweet female voice |
| Moli | Moli | Soft, gentle female voice |
| Suda | Suda | Young male voice |
| Baihua | Baihua | Mature male voice |
| Mia | Mia | English female voice |
| Chloe | Chloe | English female voice |
| Milo | Milo | English male voice |
| Dean | Dean | English male voice |
Default rule: use Bingtang for Chinese text and Mia for English text; if the user doesn't specify, pick automatically by content language.
Full Execution Flow (strictly three steps)
Prerequisites
- The user has configured a Xiaomi MiMo provider in Resource Manager → Compute and filled in an API Key
- agent-service is running
Step 1: Call the TTS API
Generate speech via media-proxy's /chat/completions endpoint.
Important: messages must use the assistant role (not user); the text to synthesize goes in the assistant message's content.
PORT=$(cat ${DESIRECORE_ROOT}/agent-service.port)
curl -sk -X POST "https://127.0.0.1:${PORT}/api/media-proxy" \
-H "Content-Type: application/json" \
-d '{
"provider": "xiaomi",
"serviceType": "tts",
"endpoint": "/chat/completions",
"body": {
"model": "mimo-v2.5-tts",
"messages": [
{
"role": "assistant",
"content": "Replace this with the text to synthesize"
}
],
"voice": "Bingtang",
"audio": {"format": "mp3"}
},
"responseType": "json"
}'
Example response:
{
"success": true,
"data": {
"id": "chatcmpl-...",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"audio": {
"data": "base64-encoded audio data...",
"format": "mp3"
}
},
"finish_reason": "stop"
}
]
},
"statusCode": 200
}
Pull the base64-encoded audio data from data.choices[0].message.audio.data.
Step 2: Decode and upload to media-store
The audio comes back as base64; decode it and save to the local media-store.
Recommended approach (write the full response to a file first to avoid overlong shell arguments):
PORT=$(cat ${DESIRECORE_ROOT}/agent-service.port)
# Save the full request and response to a file
curl -sk -X POST "https://127.0.0.1:${PORT}/api/media-proxy" \
-H "Content-Type: application/json" \
-d '{
"provider": "xiaomi",
"serviceType": "tts",
"endpoint": "/chat/completions",
"body": {
"model": "mimo-v2.5-tts",
"messages": [{"role": "assistant", "content": "Text to synthesize"}],
"voice": "Bingtang",
"audio": {"format": "mp3"}
},
"responseType": "json"
}' > /tmp/xiaomi-tts-response.json
# Extract and decode the base64 audio data
cat /tmp/xiaomi-tts-response.json | jq -r '.data.choices[0].message.audio.data' | base64 -d > /tmp/xiaomi-tts.mp3
# Upload to media-store
curl -sk -X POST "https://127.0.0.1:${PORT}/api/media/upload" \
-F "file=@/tmp/xiaomi-tts.mp3;type=audio/mpeg"
Pick the mediaId field from the JSON response (format xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.mp3).
Step 3: Render the audio via the dc-media protocol
In your reply text, write Markdown syntax directly:

For example: 
The frontend detects the .mp3 extension and renders an audio player.
Parameter Mapping
Request body parameters (inside body)
| Parameter | Description | Default |
|---|---|---|
model |
Model name | "mimo-v2.5-tts" |
messages[0].role |
Must be "assistant" | "assistant" (fixed) |
messages[0].content |
Text to synthesize | required |
voice |
Voice ID | "Bingtang" (Chinese) / "Mia" (English) |
audio.format |
Audio format | "mp3" (also accepts "wav") |
User intent mapping
| User intent | Parameter |
|---|---|
| Sweet / cute | voice: "Bingtang" |
| Gentle / refined | voice: "Moli" |
| Young male | voice: "Suda" |
| Mature male | voice: "Baihua" |
| English female | voice: "Mia" or "Chloe" |
| English male | voice: "Milo" or "Dean" |
| High fidelity / lossless | audio.format: "wav" |
Error Handling
success: false+error: "No matching provider": Xiaomi MiMo provider not configured or disabledsuccess: false+error: "API Key not configured": API Key missingstatusCode: 401: API Key invalid or expiredstatusCode: 429: rate limited, retry laterstatusCode: 400: bad parameters (e.g. unknown voice, empty text)statusCode: 403: model not activated or insufficient permission
Notes
- Calls are synchronous, typically 3–15 seconds depending on text length
- Audio is returned as base64, so URL expiry is not a concern, but watch shell argument length on long responses
- For long text, split into segments (no more than ~500 chars each), then upload and render each segment
- When the user doesn't specify, default to
mimo-v2.5-tts+ auto-selected voice by language +mp3 - Token Plan keys (prefix
tp-) use thehttps://token-plan-cn.xiaomimimo.com/v1endpoint - Pay-as-you-go keys use the
https://api.xiaomimimo.com/v1endpoint - media-proxy picks the correct endpoint based on configuration; the skill does not need to differentiate