SOUND / Whisper
Whisper
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
Choose the device you're using, the set up guide and documentation will update accordingly.
Getting Started
sudo docker run -it --rm --pull always --runtime=nvidia \
--network host ghcr.io/seeed-studio/vllm:latest-rk3588 \
vllm serve whisperREST API
Use the REST API to run inference. Copy the commands below.
curl -X POST "http://127.0.0.1:8000/api/models/whisper/task" -F "file=@/home/user/audio/long_podcast.wav" -F "language=zh"
import requests
import json
resp = requests.post(
"http://127.0.0.1:8000/api/models/whisper/predict",
files={"file": open("/home/user/audio/test_en.wav", "rb")},
data={"language": "en"},
timeout=30
)
result = resp.json()
print(json.dumps(result, indent=2, ensure_ascii=False))
Model Details
Quick Start
1. Install Docker
Run the following commands on the development board to install Docker:
# Download installation script
curl -fsSL https://get.docker.com -o get-docker.sh
# Install using Aliyun mirror source
sudo sh get-docker.sh --mirror Aliyun
# Start Docker and enable auto-start on boot
sudo systemctl enable docker
sudo systemctl start docker2. Run the Project (One command, dual-mode preview)
This project supports access via Web Browser. The program automatically serves a web interface for speech recognition.
Step A: Pull Images
sudo docker pull ghcr.io/Seeed-Projects/recomputer-rk-cv/rk3588-whisper:latest
sudo docker pull ghcr.io/Seeed-Projects/recomputer-rk-cv/rk3576-whisper:latestStep B: Run with One Click
For RK3588:
sudo docker run --rm --privileged --net=host \
-e PYTHONUNBUFFERED=1 \
-e RKNN_LOG_LEVEL=0 \
-v /proc/device-tree/compatible:/proc/device-tree/compatible \
ghcr.io/seeed-projects/recomputer-rk-cv/rk3588-whisper:latest \
python3 web_service.pyAccess via: http://<Board_IP>:8000
For RK3576:
sudo docker run --rm --privileged --net=host \
-e PYTHONUNBUFFERED=1 \
-e RKNN_LOG_LEVEL=0 \
-v /proc/device-tree/compatible:/proc/device-tree/compatible \
ghcr.io/seeed-projects/recomputer-rk-cv/rk3576-whisper:latest \
python3 web_service.pyAccess via: http://<Board_IP>:8000
๐ API Documentation
This project provides RESTful interfaces for ASR tasks, supporting synchronous and asynchronous transcription of audio files.
1. Synchronous Transcription Interface (Short Audio)
Endpoint: POST /api/models/whisper/predict
Suitable for audio files under 20 seconds.
Request Parameters (Multipart/Form-Data):
file: (Required) Audio file to be transcribed (e.g., .wav, .mp3).language: (Optional) Target language code (e.g.,en,zh). If different from the current model, it will hot-swap the tokenizer.
Usage Examples:
curl -X POST "http://127.0.0.1:8000/api/models/whisper/predict" \
-F "file=@/home/user/audio/test_en.wav" \
-F "language=en"Response Format (JSON):
{
"status": "success",
"data": {
"text": "Hello world, this is a test.",
"language": "en",
"duration": 3.5,
"inference_time": 0.8
}
}2. Asynchronous Transcription Interface (Long Audio)
Endpoint: POST /api/models/whisper/task
Creates an asynchronous task for processing longer audio/video files.
Usage Examples:
curl -X POST "http://127.0.0.1:8000/api/models/whisper/task" \
-F "file=@/home/user/audio/long_podcast.wav" \
-F "language=zh"Response Format (JSON):
{
"status": "success",
"data": {
"task_id": "29c7b932-a77f-480c-a18b-8a958c7911c3",
"message": "Task created successfully. Poll /api/models/whisper/task/{task_id} for status."
}
}3. Task Status Polling
Endpoint: GET /api/models/whisper/task/{task_id}
Usage Examples:
curl "http://127.0.0.1:8000/api/models/whisper/task/29c7b932-a77f-480c-a18b-8a958c7911c3"4. System Configuration Interface (Config)
Used to dynamically switch models and languages.
Get Current System Status
- Endpoint:
GET /api/system/status - Response:
{"status": "success", "data": {"model_size": "base", "language": "en", "max_tokens": 12, "rknn_lite_available": true}}
Update System Configuration (Hot-Swap)
- Endpoint:
POST /api/system/config - Request Parameters (Form-Data):
model_size=base,language=zh - Response:
{"status": "success", "message": "Successfully loaded base model."}
๐ ๏ธ Developer Guide (Production Recommendations)
Code Description
web_service.py:- Web API: Integrates FastAPI, supporting audio upload, async task queuing, and model hot-swapping.
- RKNN Inference: Encapsulates RKNN initialization for both Encoder and Decoder models. Implements autoregressive generation loop.
py_utils/whisper_utils.py:- Audio Processing: Calculates Log-Mel spectrograms aligned with OpenAI's implementation.
- Tokenizer: Handles BPE tokenization and vocabulary management.
Modifying Models
- Place the trained and converted
.rknnencoder and decoder models into themodel/directory. - The service automatically loads models based on the size parameter (e.g.,
whisper_encoder_base_20s.rknn). Ensure file naming conventions are maintained.