Files
market/skills/xiaomi-tts/SKILL.md
xyx dfd0cb62cf fix(i18n): sync dashscope-image-gen & xiaomi-tts zh-CN body with local (#13)
## Summary

- 用本地 `defaults/global-skills/` 中的最新中文 body 内容更新远程 `SKILL.zh-CN.md`
- 更新 zh-CN `source_hash` 以反映最新内容
- 解锁 en-US `translated_by`,允许 CI 自动重新翻译英文 body

## 变更文件

- `skills/dashscope-image-gen/SKILL.zh-CN.md` — 中文 body 同步至本地最新版本
- `skills/dashscope-image-gen/SKILL.md` — 更新 source_hash + 解锁 en-US
- `skills/xiaomi-tts/SKILL.zh-CN.md` — 中文 body 同步至本地最新版本
- `skills/xiaomi-tts/SKILL.md` — 更新 source_hash + 解锁 en-US
2026-05-13 19:38:31 +08:00

9.1 KiB
Raw Blame History

name, description, license, version, type, risk_level, status, disable-model-invocation, provider, tags, requires, metadata, market
name description license version type risk_level status disable-model-invocation provider tags requires metadata market
xiaomi-tts Use this skill when the user wants to convert text to speech using Xiaomi MiMo's TTS models (mimo-v2.5-tts). Uses OpenAI-compatible chat/completions API with audio response. Supports multiple preset voices and custom voice design. Use when 用户提到 语音合成、文字转语音、TTS、朗读、读出来、生成语音、 生成音频、文本转音频、配音、念出来、小米语音、MiMo 语音、小米 TTS。 Complete terms in LICENSE.txt 1.0.0 procedural low enabled false xiaomi
media
audio
tts
speech
xiaomi
mimo
tools
Bash
author updated_at i18n
desirecore 2026-05-08
default_locale source_locale locales zh-CN en-US
en-US zh-CN
zh-CN
en-US
name short_desc description body source_hash translated_by
小米 MiMo 语音合成 基于小米 MiMo 的文本转语音技能 当用户希望使用小米 MiMo 的 TTS 模型mimo-v2.5-tts将文本转为语音时使用此技能。基于 OpenAI 兼容的 chat/completions API响应中携带音频。支持多种预置音色和自定义音色设计。用户提到 语音合成、文字转语音、TTS、朗读、读出来、生成语音、生成音频、文本转音频、配音、念出来、小米语音、MiMo 语音、小米 TTS。 ./SKILL.zh-CN.md sha256:afa1138c9b2cbd20 human
name short_desc description body source_hash translated_by
Xiaomi MiMo TTS Text-to-speech synthesis using Xiaomi MiMo models Use this skill when the user wants to convert text to speech using Xiaomi MiMo's TTS models (mimo-v2.5-tts). Built on the OpenAI-compatible chat/completions API with audio response, supporting multiple preset voices and custom voice design. Trigger keywords: text-to-speech, TTS, read aloud, narrate, generate audio, voice synthesis, MiMo voice, Xiaomi TTS. ./SKILL.md sha256:afa1138c9b2cbd20 human
icon short_desc category maintainer channel
<svg xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none"><rect x="3" y="3" width="18" height="18" rx="3" stroke="#FF6900" stroke-width="1.5" fill="#FF6900" fill-opacity="0.1"/><path d="M8 9v6M11 7v10M14 10v4M17 8v8" stroke="#FF6900" stroke-width="2" stroke-linecap="round"/></svg> 基于小米 MiMo 的文本转语音技能 media
name verified
DesireCore Official true
latest

xiaomi-tts Skill

Mandatory Rules (violations cause failure)

  1. Must access agent-service over HTTPS — use https://127.0.0.1:${PORT} with -k to skip certificate verification
  2. Must upload to media-store via /api/media/upload/tmp is only a transient download/decode location, never use a local path as the final output
  3. Must use the dc-media:// protocol to display audio — the only form the frontend can render correctly
  4. Use Bash curl throughout — do not use the HttpRequest tool or Python
  5. Use the /chat/completions endpoint — Xiaomi MiMo TTS speaks OpenAI-compatible chat format

Model Selection

Model Characteristics When to use
mimo-v2.5-tts Standard TTS, multiple preset voices Default, regular speech synthesis
mimo-v2.5-tts-voicedesign Custom voice design When you need a voice generated from a description
mimo-v2.5-tts-voiceclone Voice cloning When you need to clone a specific voice (reference audio required)

Default rule: if the user does not specify a model, use mimo-v2.5-tts.

Voice Selection

Preset Voices

voice_id Name Characteristics
default_zh Default Chinese General-purpose Chinese female voice
default_en Default English General-purpose English female voice
mimo_default MiMo Default MiMo's signature voice
Bingtang Bingtang Sweet female voice
Moli Moli Soft, gentle female voice
Suda Suda Young male voice
Baihua Baihua Mature male voice
Mia Mia English female voice
Chloe Chloe English female voice
Milo Milo English male voice
Dean Dean English male voice

Default rule: use Bingtang for Chinese text and Mia for English text; if the user doesn't specify, pick automatically by content language.

Full Execution Flow (strictly three steps)

Prerequisites

  • The user has configured a Xiaomi MiMo provider in Resource Manager → Compute and filled in an API Key
  • agent-service is running

Step 1: Call the TTS API

Generate speech via media-proxy's /chat/completions endpoint.

Important: messages must use the assistant role (not user); the text to synthesize goes in the assistant message's content.

PORT=$(cat ~/.desirecore/agent-service.port)
curl -sk -X POST "https://127.0.0.1:${PORT}/api/media-proxy" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "xiaomi",
    "serviceType": "tts",
    "endpoint": "/chat/completions",
    "body": {
      "model": "mimo-v2.5-tts",
      "messages": [
        {
          "role": "assistant",
          "content": "Replace this with the text to synthesize"
        }
      ],
      "voice": "Bingtang",
      "audio": {"format": "mp3"}
    },
    "responseType": "json"
  }'

Example response:

{
  "success": true,
  "data": {
    "id": "chatcmpl-...",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "audio": {
            "data": "base64-encoded audio data...",
            "format": "mp3"
          }
        },
        "finish_reason": "stop"
      }
    ]
  },
  "statusCode": 200
}

Pull the base64-encoded audio data from data.choices[0].message.audio.data.

Step 2: Decode and upload to media-store

The audio comes back as base64; decode it and save to the local media-store.

Recommended approach (write the full response to a file first to avoid overlong shell arguments):

PORT=$(cat ~/.desirecore/agent-service.port)
# Save the full request and response to a file
curl -sk -X POST "https://127.0.0.1:${PORT}/api/media-proxy" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "xiaomi",
    "serviceType": "tts",
    "endpoint": "/chat/completions",
    "body": {
      "model": "mimo-v2.5-tts",
      "messages": [{"role": "assistant", "content": "Text to synthesize"}],
      "voice": "Bingtang",
      "audio": {"format": "mp3"}
    },
    "responseType": "json"
  }' > /tmp/xiaomi-tts-response.json

# Extract and decode the base64 audio data
cat /tmp/xiaomi-tts-response.json | jq -r '.data.choices[0].message.audio.data' | base64 -d > /tmp/xiaomi-tts.mp3

# Upload to media-store
curl -sk -X POST "https://127.0.0.1:${PORT}/api/media/upload" \
  -F "file=@/tmp/xiaomi-tts.mp3;type=audio/mpeg"

Pick the mediaId field from the JSON response (format xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.mp3).

Step 3: Render the audio via the dc-media protocol

In your reply text, write Markdown syntax directly:

![TTS result](dc-media://replace-with-mediaId)

For example: ![TTS: Hello world](dc-media://a1b2c3d4-e5f6-47a8-b9c0-d1e2f3a4b5c6.mp3)

The frontend detects the .mp3 extension and renders an audio player.

Parameter Mapping

Request body parameters (inside body)

Parameter Description Default
model Model name "mimo-v2.5-tts"
messages[0].role Must be "assistant" "assistant" (fixed)
messages[0].content Text to synthesize required
voice Voice ID "Bingtang" (Chinese) / "Mia" (English)
audio.format Audio format "mp3" (also accepts "wav")

User intent mapping

User intent Parameter
Sweet / cute voice: "Bingtang"
Gentle / refined voice: "Moli"
Young male voice: "Suda"
Mature male voice: "Baihua"
English female voice: "Mia" or "Chloe"
English male voice: "Milo" or "Dean"
High fidelity / lossless audio.format: "wav"

Error Handling

  • success: false + error: "No matching provider": Xiaomi MiMo provider not configured or disabled
  • success: false + error: "API Key not configured": API Key missing
  • statusCode: 401: API Key invalid or expired
  • statusCode: 429: rate limited, retry later
  • statusCode: 400: bad parameters (e.g. unknown voice, empty text)
  • statusCode: 403: model not activated or insufficient permission

Notes

  • Calls are synchronous, typically 315 seconds depending on text length
  • Audio is returned as base64, so URL expiry is not a concern, but watch shell argument length on long responses
  • For long text, split into segments (no more than ~500 chars each), then upload and render each segment
  • When the user doesn't specify, default to mimo-v2.5-tts + auto-selected voice by language + mp3
  • Token Plan keys (prefix tp-) use the https://token-plan-cn.xiaomimimo.com/v1 endpoint
  • Pay-as-you-go keys use the https://api.xiaomimimo.com/v1 endpoint
  • media-proxy picks the correct endpoint based on configuration; the skill does not need to differentiate