Files
market/skills/xiaomi-tts/SKILL.md
xyx 0cb3758669 fix: 补全 dashscope-image-gen 和 xiaomi-tts 的 i18n CI 校验 (#4)
## 变更说明

修复 dashscope-image-gen 和 xiaomi-tts 的 i18n CI 校验、补全英文翻译,并连带修复其他 stale
skill 的 source_hash 漂移问题。

### dashscope-image-gen / xiaomi-tts(PR 主线)
- `name` 字段从中文改为目录名(CI rule-1 要求 lowercase ASCII + hyphens)。
- 补全 `metadata.i18n` 块:`locales`、`zh-CN` (含 body 指向
SKILL.zh-CN.md)、`en-US`(含 description / body=./SKILL.md)。
- 新增 `SKILL.zh-CN.md`(zh-CN body 文件)。
- **root SKILL.md 改写为英文 body**(与 SKILL.zh-CN.md 内容对应),由本 PR
手工翻译;`default_locale=en-US`、`source_locale=zh-CN`,与 docs/I18N.md
约定一致:root SKILL.md = default_locale body (en-US)、SKILL.zh-CN.md =
source_locale body (zh-CN)。
- 两 locale 锁为 `translated_by: human` + 正确 `source_hash`。
- 内容质量修复:流程标题 "严格按此两步执行" 改为 "严格按此三步执行";强制规则 2 措辞精确化(/tmp
仅作中转);xiaomi-tts 用户意图映射表中 `response_format` 改为 `audio.format`
与请求体参数表一致;zh-CN.description 改为纯中文。
- locale header 由 shell 转义残留 `<\!--` 修正为标准 `<!-- locale: zh-CN -->`。

### 连带:6 个 main 上已 stale 的 skill(避免 translate workflow 失败)
- `manage-skills` / `minimax-music-gen` / `minimax-video-gen` /
`skill-creator` / `web-access`:`en-US.source_hash` 重新计算为当前 zh-CN source
实际 hash;`translated_by` 由 `ai:claude-opus-4-7` 改为 `human`
以锁定现有翻译不被自动重译覆盖。
- `markdown`:补正 `en-US.source_hash`(之前是占位 `sha256:0000000000000000`)。
- 这些 skill 的 `en-US` 翻译内容保持不变,仅修正元数据。

### scripts/i18n/translate.py 容错增强
- 413 Payload Too Large 时不再 retry(payload 不会变小,retry 浪费时间)。
- 主循环 catch RuntimeError,把单个 skill 的失败写入 `plan["errors"]` 后继续处理下一个
skill,避免一个大文件 fail 整个 workflow。
- `--check` 模式下 plans 含 errors 也 exit 1(之前仅看 needs_translation,broad
except 会把异常吃掉导致误报通过)。

## Test plan

- [x] `i18n-validate` 通过
- [x] `i18n-translate --check` 显示所有 skill `up-to-date` 或 `human-locked,
skipping`
- [x] CI 上 `validate` / `translate` / `wait-for-copilot-review` 全绿
- [ ] Copilot 评审 conversation 全部 resolve
- [ ] Squash merge

---------

Co-authored-by: yi-ge <a@wyr.me>
2026-05-13 12:57:25 +08:00

9.1 KiB
Raw Blame History

name, description, license, version, type, risk_level, status, disable-model-invocation, provider, tags, requires, metadata, market
name description license version type risk_level status disable-model-invocation provider tags requires metadata market
xiaomi-tts Use this skill when the user wants to convert text to speech using Xiaomi MiMo's TTS models (mimo-v2.5-tts). Uses OpenAI-compatible chat/completions API with audio response. Supports multiple preset voices and custom voice design. Use when 用户提到 语音合成、文字转语音、TTS、朗读、读出来、生成语音、 生成音频、文本转音频、配音、念出来、小米语音、MiMo 语音、小米 TTS。 Complete terms in LICENSE.txt 1.0.0 procedural low enabled false xiaomi
media
audio
tts
speech
xiaomi
mimo
tools
Bash
author updated_at i18n
desirecore 2026-05-08
default_locale source_locale locales zh-CN en-US
en-US zh-CN
zh-CN
en-US
name short_desc description body source_hash translated_by
小米 MiMo 语音合成 基于小米 MiMo 的文本转语音技能 当用户希望使用小米 MiMo 的 TTS 模型mimo-v2.5-tts将文本转为语音时使用此技能。基于 OpenAI 兼容的 chat/completions API响应中携带音频。支持多种预置音色和自定义音色设计。用户提到 语音合成、文字转语音、TTS、朗读、读出来、生成语音、生成音频、文本转音频、配音、念出来、小米语音、MiMo 语音、小米 TTS。 ./SKILL.zh-CN.md sha256:2dd06b13152349e5 human
name short_desc description body source_hash translated_by
Xiaomi MiMo TTS Text-to-speech synthesis using Xiaomi MiMo models Use this skill when the user wants to convert text to speech using Xiaomi MiMo's TTS models (mimo-v2.5-tts). Built on the OpenAI-compatible chat/completions API with audio response, supporting multiple preset voices and custom voice design. Trigger keywords: text-to-speech, TTS, read aloud, narrate, generate audio, voice synthesis, MiMo voice, Xiaomi TTS. ./SKILL.md sha256:2dd06b13152349e5 human
icon short_desc category maintainer channel
<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none"><rect x="3" y="3" width="18" height="18" rx="3" stroke="#FF6900" stroke-width="1.5" fill="#FF6900" fill-opacity="0.1"/><path d="M8 9v6M11 7v10M14 10v4M17 8v8" stroke="#FF6900" stroke-width="2" stroke-linecap="round"/></svg> 基于小米 MiMo 的文本转语音技能 media
name verified
DesireCore Official true
latest

xiaomi-tts Skill

Mandatory Rules (violations cause failure)

  1. Must access agent-service over HTTPS — use https://127.0.0.1:${PORT} with -k to skip certificate verification
  2. Must upload to media-store via /api/media/upload/tmp is only a transient download/decode location, never use a local path as the final output
  3. Must use the dc-media:// protocol to display audio — the only form the frontend can render correctly
  4. Use Bash curl throughout — do not use the HttpRequest tool or Python
  5. Use the /chat/completions endpoint — Xiaomi MiMo TTS speaks OpenAI-compatible chat format

Model Selection

Model Characteristics When to use
mimo-v2.5-tts Standard TTS, multiple preset voices Default, regular speech synthesis
mimo-v2.5-tts-voicedesign Custom voice design When you need a voice generated from a description
mimo-v2.5-tts-voiceclone Voice cloning When you need to clone a specific voice (reference audio required)

Default rule: if the user does not specify a model, use mimo-v2.5-tts.

Voice Selection

Preset Voices

voice_id Name Characteristics
default_zh Default Chinese General-purpose Chinese female voice
default_en Default English General-purpose English female voice
mimo_default MiMo Default MiMo's signature voice
Bingtang Bingtang Sweet female voice
Moli Moli Soft, gentle female voice
Suda Suda Young male voice
Baihua Baihua Mature male voice
Mia Mia English female voice
Chloe Chloe English female voice
Milo Milo English male voice
Dean Dean English male voice

Default rule: use Bingtang for Chinese text and Mia for English text; if the user doesn't specify, pick automatically by content language.

Full Execution Flow (strictly three steps)

Prerequisites

  • The user has configured a Xiaomi MiMo provider in Resource Manager → Compute and filled in an API Key
  • agent-service is running

Step 1: Call the TTS API

Generate speech via media-proxy's /chat/completions endpoint.

Important: messages must use the assistant role (not user); the text to synthesize goes in the assistant message's content.

PORT=$(cat ~/.desirecore/agent-service.port)
curl -sk -X POST "https://127.0.0.1:${PORT}/api/media-proxy" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "xiaomi",
    "serviceType": "tts",
    "endpoint": "/chat/completions",
    "body": {
      "model": "mimo-v2.5-tts",
      "messages": [
        {
          "role": "assistant",
          "content": "Replace this with the text to synthesize"
        }
      ],
      "voice": "Bingtang",
      "audio": {"format": "mp3"}
    },
    "responseType": "json"
  }'

Example response:

{
  "success": true,
  "data": {
    "id": "chatcmpl-...",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "audio": {
            "data": "base64-encoded audio data...",
            "format": "mp3"
          }
        },
        "finish_reason": "stop"
      }
    ]
  },
  "statusCode": 200
}

Pull the base64-encoded audio data from data.choices[0].message.audio.data.

Step 2: Decode and upload to media-store

The audio comes back as base64; decode it and save to the local media-store.

Recommended approach (write the full response to a file first to avoid overlong shell arguments):

PORT=$(cat ~/.desirecore/agent-service.port)
# Save the full request and response to a file
curl -sk -X POST "https://127.0.0.1:${PORT}/api/media-proxy" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "xiaomi",
    "serviceType": "tts",
    "endpoint": "/chat/completions",
    "body": {
      "model": "mimo-v2.5-tts",
      "messages": [{"role": "assistant", "content": "Text to synthesize"}],
      "voice": "Bingtang",
      "audio": {"format": "mp3"}
    },
    "responseType": "json"
  }' > /tmp/xiaomi-tts-response.json

# Extract and decode the base64 audio data
cat /tmp/xiaomi-tts-response.json | jq -r '.data.choices[0].message.audio.data' | base64 -d > /tmp/xiaomi-tts.mp3

# Upload to media-store
curl -sk -X POST "https://127.0.0.1:${PORT}/api/media/upload" \
  -F "file=@/tmp/xiaomi-tts.mp3;type=audio/mpeg"

Pick the mediaId field from the JSON response (format xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.mp3).

Step 3: Render the audio via the dc-media protocol

In your reply text, write Markdown syntax directly:

![TTS result](dc-media://replace-with-mediaId)

For example: ![TTS: Hello world](dc-media://a1b2c3d4-e5f6-47a8-b9c0-d1e2f3a4b5c6.mp3)

The frontend detects the .mp3 extension and renders an audio player.

Parameter Mapping

Request body parameters (inside body)

Parameter Description Default
model Model name "mimo-v2.5-tts"
messages[0].role Must be "assistant" "assistant" (fixed)
messages[0].content Text to synthesize required
voice Voice ID "Bingtang" (Chinese) / "Mia" (English)
audio.format Audio format "mp3" (also accepts "wav")

User intent mapping

User intent Parameter
Sweet / cute voice: "Bingtang"
Gentle / refined voice: "Moli"
Young male voice: "Suda"
Mature male voice: "Baihua"
English female voice: "Mia" or "Chloe"
English male voice: "Milo" or "Dean"
High fidelity / lossless audio.format: "wav"

Error Handling

  • success: false + error: "No matching provider": Xiaomi MiMo provider not configured or disabled
  • success: false + error: "API Key not configured": API Key missing
  • statusCode: 401: API Key invalid or expired
  • statusCode: 429: rate limited, retry later
  • statusCode: 400: bad parameters (e.g. unknown voice, empty text)
  • statusCode: 403: model not activated or insufficient permission

Notes

  • Calls are synchronous, typically 315 seconds depending on text length
  • Audio is returned as base64, so URL expiry is not a concern, but watch shell argument length on long responses
  • For long text, split into segments (no more than ~500 chars each), then upload and render each segment
  • When the user doesn't specify, default to mimo-v2.5-tts + auto-selected voice by language + mp3
  • Token Plan keys (prefix tp-) use the https://token-plan-cn.xiaomimimo.com/v1 endpoint
  • Pay-as-you-go keys use the https://api.xiaomimimo.com/v1 endpoint
  • media-proxy picks the correct endpoint based on configuration; the skill does not need to differentiate