API reference

This section provides a reference for the Arduino API used in the examples. It includes details on the functions, classes, and methods available for use in your Arduino projects.

Keyword Spotting

class ApiKws

struct ApiKwsSetupConfig_t

struct ApiKwsSetupConfig_t {
    String kws                = "HELLO";
    String model              = "sherpa-onnx-kws-zipformer-gigaspeech-3.3M-2024-01-01";
    String response_format    = "kws.bool";
    std::vector<String> input = {"sys.pcm"};
    bool enoutput             = true;
    bool enaudio              = true;
};

kws: The wake-up keyword to listen for. Must be capitalized.
model: The model to use for keyword spotting. Default is sherpa-onnx-kws-zipformer-gigaspeech-3.3M-2024-01-01 for English. see the available models: sherpa-onnx-kws-zipformer.
input: The input format for the KWS module. Default is sys.pcm, which means it will process Onboard microphone audio data.
enoutput: If true, the KWS module will return the boolean result in the response. Default is true.
enaudio: If true, the KWS module will play the wake-up audio. Default is true.

function setup

String setup(ApiKwsSetupConfig_t config = ApiKwsSetupConfig_t(), String request_id = "kws_setup",
             String language = "en_US");

config: The configuration for the KWS module. You can use the ApiKwsSetupConfig_t to set the model name and other parameters.
request_id: The request ID for the setup. You can use any string as the request ID.
language: The language for the KWS module. You can use en_US for English.
return: The work ID for the KWS module setup. This ID is used for subscriptions in other modules.

Voice activity detection

class ApiVad

struct ApiVadSetupConfig_t

struct ApiVadSetupConfig_t {
    String model              = "silero-vad";
    String response_format    = "vad.bool";
    std::vector<String> input = {"sys.pcm"};
    bool enoutput             = true;
};

model: The model name for voice activity detection. Default is silero-vad. see the available models: silero-vad.
response_format: The response format for the VAD module. Default is vad.bool, which returns a boolean indicating whether speech is detected.
input: The input format for the VAD module. Default is sys.pcm, which means it will process Onboard microphone audio data.
enoutput: If true, the VAD module will return the boolean result in the response. Default is true.

function setup

String setup(ApiVadSetupConfig_t config = ApiVadSetupConfig_t(), String request_id = "vad_setup");

config: The configuration for the VAD module. You can use the ApiVadSetupConfig_t struct to set the model name and other parameters.
request_id: The request ID for the setup. You can use any string as the request ID.
return: The work ID for the VAD module. This ID is used for subscriptions in other modules.

Automatic Speech Recognition

class ApiAsr

struct ApiAsrSetupConfig_t

struct ApiAsrSetupConfig_t {
    String model              = "sherpa-ncnn-streaming-zipformer-20M-2023-02-17";
    String response_format    = "asr.utf-8.stream";
    std::vector<String> input = {"sys.pcm"};
    bool enoutput             = true;
    bool enkws                = true;
    float rule1               = 2.4;
    float rule2               = 1.2;
    float rule3               = 30.0;
};

model: The model name for automatic speech recognition. Default is sherpa-ncnn-streaming-zipformer-20M-2023-02-17 for English. see the available models: sherpa-ncnn-streaming-zipformer.
response_format: The response format for the ASR module. Default is asr.utf-8.stream, which returns the transcribed text in a streaming format.
input: The input format for the ASR module. Default is sys.pcm, which means it will process PCM audio data.
enoutput: If true, the ASR module will return the transcribed text in utf-8 format. Default is true.
enkws: This parameter has been deprecated.
rule1: Times out after 2.4 seconds of silence, even if we decoded nothing.
rule2: Times out after 1.2 seconds of silence after decoding something.
rule3: Times out after the utterance is 30 seconds long, regardless of anything else.

function setup

String setup(ApiAsrSetupConfig_t config = ApiAsrSetupConfig_t(), String request_id = "asr_setup",
            String language = "en_US");

config: The configuration for the ASR module. You can use the ApiAsrSetupConfig_t struct to set the model name and other parameters.
request_id: The request ID for the setup. You can use any string as the request ID.
language: The language for the ASR module. You can use en_US for English or zh_CN for Chinese.
return: The work ID for the ASR module. This ID is used for subscriptions in other modules.

Transcription

class ApiWhisper

struct ApiWhisperSetupConfig_t

struct ApiWhisperSetupConfig_t {
    String model              = "whisper-tiny";
    String response_format    = "asr.utf-8";
    String language           = "en";
    std::vector<String> input = {"sys.pcm"};
    bool enoutput             = true;
};

model: The model name. default is whisper-tiny. You can use whisper-base or whisper-small for larger models. see the available models: whisper-tiny | whisper-base | whisper-small
response_format is the response format, default is asr.utf-8. whisper only supports non-streaming response.
input: The input format for the Whisper module. Default is sys.pcm, which means it will process Onboard microphone audio data.
language: The language for the Whisper module. You can use en for English or ja for Japanese.
enoutput: If true, the Whisper module will return the transcribe text in utf-8 format.

function setup

String setup(ApiWhisperSetupConfig_t config = ApiWhisperSetupConfig_t(), String request_id = "asr_setup",
             String language = "en_US");

config: The configuration for the Whisper module. You can use the ApiWhisperSetupConfig_t struct to set the model name and other parameters.
request_id: The request ID for the setup. You can use any string as the request ID.
language: This parameter has been deprecated.
return: The work ID for the Whisper module. This ID is used for subscriptions in other modules.

Text-to-speech

class ApiMelotts

struct ApiMelottsSetupConfig_t

struct ApiMelottsSetupConfig_t {
    String model              = "melotts-en-us";
    String response_format    = "sys.pcm";
    std::vector<String> input = {"tts.utf-8.stream"};
    bool enoutput             = false;
    bool enaudio              = true;
};

model: The model name. You can use melotts-en-default for English or melotts-ja-jp for Japanese. see the available models: English | Japanese | Chinese.
response_format: The response format for the TTS module. You can use sys.pcm for PCM audio data. The generated audio can be played through the onboard speakers.
input: The input format for the TTS module. You can use tts.utf-8.stream for UTF-8 encoded text streaming input.
enoutput: If true, the TTS module will return the base64 encoding pcm data in utf-8 format. Default is false.
enaudio: If true, the TTS module will play the synthesized audio. Default is true.

function setup

String setup(ApiMelottsSetupConfig_t config = ApiMelottsSetupConfig_t(),
             String request_id = "melotts_setup",
             String language = "en_US");

config: The configuration for the TTS module. You can use the ApiMelottsSetupConfig_t struct to set the model name and other parameters.
request_id: The request ID for the setup. You can use any string as the request ID.
language: This parameter has been deprecated.
return: The work ID for the TTS module. You need to use this work ID for the inference function.

function inference

int inference(String work_id, String input, uint32_t timeout = 0, String request_id = "tts_inference");

work_id: The work ID for the TTS module. You need to use the work ID returned by the setup function.
input: The text to be synthesized. You can use any string as the input.
timeout: Wait response timeout, default 0 (do not wait response)
request_id: The request ID for the inference. You can use any string as the request ID.

Large Language Model

class ApiLlm

struct ApiLLMSetupConfig_t

struct ApiLlmSetupConfig_t {
    String prompt;
    String model              = "qwen2.5-0.5B-prefill-20e";
    String response_format    = "llm.utf-8.stream";
    std::vector<String> input = {"llm.utf-8.stream"};
    bool enoutput             = true;
    bool enkws                = true;
    int max_token_len         = 127;
    // int max_token_len      = 512;
};

prompt: The prompt for the LLM model. The prompt is used to initialize the model and can be used to set the context for the model.
model: The model name. You can use qwen2.5-0.5B-prefill-20e for the Qwen2.5 model. see the available models: Reasoning models | Flagship chat models
response_format: The response format.
max_token_len: The maximum number of tokens to generate. The default is 127. You can set it to 512 for larger models.

function setup

String setup(ApiLlmSetupConfig_t config = ApiLlmSetupConfig_t(), String request_id = "llm_setup");

config: The configuration for the LLM model. You can use the ApiLLMSetupConfig_t struct to set the model name and other parameters.
request_id: The request ID for the setup. You can use any string as the request ID.
return: The work ID for the LLM model. You need to use this work ID for the inference function.

function inference

int inference(String work_id, String input, String request_id = "llm_inference");

work_id: The work ID for the LLM model. You need to use the work ID returned by the setup function.
input: The input text for the LLM model. You can use any string as the input.
request_id: The request ID for the inference. You can use any string as the request ID.

Visual Language Model

class ApiVlm

struct ApiVlmSetupConfig_t

struct ApiVlmSetupConfig_t {
    String prompt;
    String model              = "internvl2.5-1B-ax630c";
    String response_format    = "vlm.utf-8.stream";
    std::vector<String> input = {"vlm.utf-8.stream"};
    bool enoutput             = true;
    bool enkws                = true;
    // int max_token_len         = 127;
    int max_token_len = 255;
};

prompt: The prompt for the VLM model. The prompt is used to initialize the model and can be used to set the context for the model.
model: The model name. You can use internvl2.5-1B-ax630c for the InternVL2.5 model. see the available models: Multimodal models
response_format: The response format.
max_token_len: The maximum number of tokens to generate. The default is 255. You can set it to 512 for larger models.

function setup

String setup(ApiVlmSetupConfig_t config = ApiVlmSetupConfig_t(), String request_id = "vlm_setup");

config: The configuration for the VLM model. You can use the ApiVlmSetupConfig_t struct to set the model name and other parameters.
request_id: The request ID for the setup. You can use any string as the request ID.
return: The work ID for the VLM model. You need to use this work ID for the inference function.

function inference

int inference(String work_id, String input, String request_id = "vlm_inference");

work_id: The work ID for the VLM model. You need to use the work ID returned by the setup function.
input: The input text for the VLM model. You can use any string as the input.
request_id: The request ID for the inference. You can use any string as the request ID.

Vision

class ApiDepthAnything

struct ApiDepthAnythingSetupConfig_t

struct ApiDepthAnythingSetupConfig_t {
    String model              = "depth-anything-ax630c";
    String response_format    = "jpeg.base64.stream";
    std::vector<String> input = {"depth_anything.jpeg.raw"};
    bool enoutput             = true;
};

model: The model name for depth estimation. Default is depth-anything-ax630c. see the available models: depth-anything-ax630c.
response_format: The response format for the depth estimation module. Default is jpeg.base64.stream, which returns the depth map as a JPEG image in base64 format.
input: The input format for the depth estimation module. Default is depth_anything.jpeg.raw, which means it will process raw JPEG images.
enoutput: If true, the depth estimation module will return the depth map in base64 format. Default is true.

function setup

String setup(ApiDepthAnythingSetupConfig_t config = ApiDepthAnythingSetupConfig_t(),
            String request_id = "depth_anything_setup");

config: The configuration for the depth estimation module. You can use the ApiDepthAnythingSetupConfig_t struct to set the model name and other parameters.
request_id: The request ID for the setup. You can use any string as the request ID.
return: The work ID for the depth estimation module. You need to use this work ID for the inference function.

function inference

int inference(String& work_id, uint8_t* input, size_t& raw_len, String request_id = "depth_anything_inference");

work_id: The work ID for the depth estimation module. You need to use the work ID returned by the setup function.
input: The input image for depth estimation. You can use the path to the raw JPEG image.
raw_len: The length of the raw image data.
request_id: The request ID for the inference. You can use any string as the request ID.

class ApiYolo

struct ApiYoloSetupConfig_t

struct ApiYoloSetupConfig_t {
    String model              = "yolo11n";
    String response_format    = "yolo.box.stream";
    std::vector<String> input = {"yolo.jpeg.base64"};
    bool enoutput             = true;
};

model: The model name for object detection. Default is yolo11n. see the available models: yolo11n.
response_format: The response format for the object detection module. Default is yolo.box.stream, which returns the detected bounding boxes in a streaming format.
input: The input format for the object detection module. Default is yolo.jpeg.base64, which means it will process JPEG images in base64 format.
enoutput: If true, the object detection module will return the detected bounding boxes in the response. Default is true.

function setup

String setup(ApiYoloSetupConfig_t config = ApiYoloSetupConfig_t(), String request_id = "yolo_setup");

config: The configuration for the object detection module. You can use the ApiYoloSetupConfig_t struct to set the model name and other parameters.
request_id: The request ID for the setup. You can use any string as the request ID.
return: The work ID for the object detection module. You need to use this work ID for the inference function.

function inference

int inference(String& work_id, uint8_t* input, size_t& raw_len, String request_id = "yolo_inference");

work_id: The work ID for the object detection module. You need to use the work ID returned by the setup function.
input: The input image for object detection. You can use the path to the raw JPEG image.
raw_len: The length of the raw image data.
request_id: The request ID for the inference. You can use any string as the request ID.