Server Side

The server side handles communication and data processing between the IAT/ASR, LLM/RAG, and TTS services, and sends the final audio result to the client so that the client can directly play the audio data.

The core of the server side is written in Node.js. It provides plugin development extensions for programming languages including but not limited to Node.js.

The server side mainly offers various configuration options; you only need to modify these configuration options.

Free Service

ESP-AI Developer Platform provides a free service. It is recommended to use the developer platform directly for visual configuration and service usage. URL: https://dev.espai.fun/

Installation and Environment

Details are omitted here. Please refer to the Quick Start section under: Server Side Installation

Basic Code

const espAi = require("esp-ai");
const config = {
    gen_client_config: () => ({
        // Specific configurations...
    })
};
espAi(config);

Configuration Options

gen_client_config

Description: Assigns a set of configurations to the client (hardware), including which LLM/TTS/IAT services to use, etc.
Default: -
Required: Yes
Usage Example

const config = {
    /**
     * You can request configurations from the library based on business needs using this method...
     * Client configuration generation, primarily generates IAT/LLM/TTS configurations. This function is executed when the client first connects or at some idle moment, as there is an internal automatic update strategy.
     * @param {object} params Parameters configured by the client, parsed into a literal object, developers can directly use the key to reference them.
     */
    gen_client_config: (params) => {
        return {
            iat_server: "xun_fei",
            // iat_server: "esp-ai-plugin-iat-example", // Plugin
            iat_config: {
                // Xunfei: <https://console.xfyun.cn/services/iat>. After opening the website, copy the three fields in the top right corner here.
                appid: "xxx",
                apiSecret: "xxx",
                apiKey: "xxx",
                // Silence duration, how long it takes to consider speech ended when no speech is detected, in milliseconds
                vad_eos: 1500,

                // For custom plugins:
                // Refer to the specific plugin documentation for available configurations...
            },

            llm_server: "xun_fei",
            // llm_server: "dashscope",   // Built-in Qwen
            // llm_server: "volcengine",  // Built-in VolcEngine
            // llm_server: "esp-ai-plugin-llm-example", // Plugin
            llm_config: {
                // Xunfei: <https://console.xfyun.cn/services/iat>. After opening the website, copy the three fields in the top right corner here.
                appid: "xxx",
                apiSecret: "xxx",
                apiKey: "xxx",
                llm: "v4.0",

                /****************/
                // Alibaba Cloud Qwen (Qwen, etc.): <https://dashscope.console.aliyun.com/apiKey>
                // apiKey: "sk-xxx",
                // // LLM version
                // llm: "qwen-turbo",


                /******* VolcEngine *********/
                // 1. Register: <https://console.volcengine.com/ark>
                // 1. Enable: <https://console.volcengine.com/ark/region:ark+cn-beijing/openManagement?LLM=%7B%7D&tab=LLM>
                // 2. Create Endpoint: <https://console.volcengine.com/ark/region:ark+cn-beijing/endpoint>
                // apiKey: "xxx",
                // epId: "ep-xxx", // Endpoint ID


                // For custom plugins:
                // Refer to the specific plugin documentation for available configurations...
            },

            tts_server: "xun_fei",
            // tts_server: "volcengine", // Built-in VolcEngine TTS service
            // tts_server: "esp-ai-plugin-tts-ttson",   // Plugin
            // tts_server: "esp-ai-plugin-tts-aliyun",  // Plugin
            tts_config: {
                // Xunfei: <https://console.xfyun.cn/services/iat>. After opening the website, copy the three fields in the top right corner here.
                // appid: "xxx",
                // apiSecret: "xxx",
                // apiKey: "xx",

                /******* VolcEngine *********/
                // 1. Register: <https://console.volcengine.com/speech/app>
                // 2. Voice Authorization: <https://console.volcengine.com/speech/service/8?AppID=6359932705>
                // 3. Authorization: xxx
                // Service interface authentication information
                // appid: "xxx",
                // accessToken: "xxx",
                // voice_type: "BV007_streaming", // Clear Female Voice
                // voice_type: "BV051_streaming", // Cute Child Voice


                /****************/
                // Dolphin TTS
                // url: "https://ht.ttson.cn:37284/flashsummary/tts",
                // token: "",


                /******* Alibaba Cloud TTS *********/
                // // Obtain from: <https://nls-portal.console.aliyun.com/applist>
                // appkey: "xxx",
                // // Obtain from: <https://ram.console.aliyun.com/manage/ak>
                // AccessKeyID: "xxx",
                // AccessKeySecret: "xxx",


                // For custom plugins:
                // Refer to the specific plugin documentation for available configurations...
            },

            /**
             * Initial prompt for LLM
             */
            llm_init_messages: [
                { role: 'system', content: 'You are Xiao Ming, an all-powerful intelligent assistant.' },
            ],

            /**
             * Intention Table: After the user wakes up Xiao Ming, Xiao Ming can perform the following tasks
             */
            intention: [
                {
                    // Keywords
                    key: ["Help me turn on the light", "Turn on the light", "Turn on the lights"],
                    // Instruction sent to the client
                    instruct: "device_open_001",
                    message: "Turned on! Is there anything else I can help with?"
                },
                {
                    // Keywords
                    key: ["Help me turn off the light", "Turn off the light", "Turn off the lights"],
                    // Instruction sent to the client
                    instruct: "device_close_001",
                    message: "Turned off! Is there anything else I can help with?"
                },
                {
                    // Keywords
                    key: ["Retire", "Step back"],
                    // Built-in sleep instruction
                    instruct: "__sleep__",
                    message: "I'll step back now, call me if you need anything."
                }, 
                {
                    /**
                     * Regular expression matching
                     * For example: Play the last stubborn song
                     * Returns the matched string as a successful match
                     */
                    key: async (text = "", llm_historys) => {
                        const regex = /^(Play music)(.*)$/;
                        const match = text.match(regex);
                        if (match) {
                            const songName = match[2];
                            console.log("Song Name:", songName);
                            return songName;
                        } else {
                            return false;
                        }
                    },
                    // Instruction sent to the client
                    instruct: "__play_music__",
                    message: "Sure!",
                    /**
                     * Used to return audio URLs and playback progress
                     * Currently only supports mp3, wav formats
                     * @param {String} name is the song name
                     * @return {number} seek Progress: (in seconds)
                     * @return {message} TTS when data cannot be found
                     */
                    music_server: async (name, { user_config }) => {
                        return {
                            url: "[https](https://xiaomingio.top/music.mp3)",
                            seek: 0,
                            message: message
                        };
                    },
                    /**
                     * Callback after audio ends
                     * @param {object} arg.break_second  The stopped progress, in seconds. That is, how many seconds the user has played (seek + play_time)
                     * @param {object} arg.play_time     Actual playback time of the audio, in seconds.
                     * @param {object} arg.seek          Audio start playback time, actually the seek value returned by the music_server function
                     * @param {object} arg.start_time    Start playback audio Unix timestamp in milliseconds
                     * @param {object} arg.end_time      End playback audio Unix timestamp in milliseconds
                     * @param {object} arg.event         End reason: "user_break" User interruption | play_end Playback complete | foo Unknown event
                     */
                    on_end: (arg) => {
                        // Request the business server to save the progress information...
                        console.log(arg);
                    }
                },
            ],
        }
    },
    
    plugins: []
}

port

Type: number
Description: Server port, default is 8080

Usage Example:

const config = {
  port: 3000 // Customize the server port to 3000
};

devLog

Type: number
Description: Log output mode:
- 0: No output (Production mode)
- 1: Normal output
- 2: Detailed output

Usage Example:

const config = {
  devLog: 1 // Set to normal log output mode
};

auth

Type: (params: Record<string, any>, scene: "connect" | "start_session") => Promise<{ success: boolean, message?: string }>
Description: Client authentication, called every time the client initially connects and starts a session.
Parameters:
- params - Parameters configured by the client, parsed into a literal object, developers can directly use the key to reference them.
- scene - Authentication scene:
  - "connect": When connecting
  - "start_session": When starting a session

Usage Example:

const config = {
  auth: async (params, scene) => {
    if (params.token === "valid_token") {
      return { success: true }; // Authentication successful
    } else {
      return { success: false, message: "Invalid authentication token" }; // Authentication failed
    }
  }
};

llm_params_set

Type: (params: Record<string, any>) => Record<string, any>
Description: LLM parameter control, can set temperature, etc.
Parameters:
- params - Default LLM parameters.

Usage Example:

const config = {
  llm_params_set: (params) => {
    // Modify default LLM parameters
    return { ...params, temperature: 0.8 };
  }
};

tts_params_set

Type: (params: Record<string, any>) => Record<string, any>
Description: TTS parameter control, can set speaker, volume, speed, etc.
Parameters:
- params - Default TTS parameters.

Usage Example:

const config = {
  tts_params_set: (params) => {
    // Modify default TTS parameters
    return { ...params, voice: "male", speed: 0.9 };
  }
};

onDeviceConnect

Type: (arg: { device_id: string, ws: WebSocket, client_version: string }) => void
Description: Callback when a new device connects to the service.
Parameters:
- device_id - Device ID.
- client_version - Client version.
- ws - Connection handle, can send data using ws.send().

Usage Example:

const config = {
  onDeviceConnect: ({ device_id, ws, client_version }) => {
    console.log(`Device ${device_id} connected, client version: ${client_version}`);
    ws.send("Welcome to the server!");
  }
};

onIAT

Type: (arg: { device_id: string, ws: WebSocket }) => void
Description: Callback before the user triggers an IAT service request.

Usage Example:

const config = {
  onIAT: ({ device_id, ws }) => {
    console.log(`Preparing for speech recognition service: ${device_id}`);
  }
};

onIATcb

Type: (arg: { device_id: string, text: string, ws: WebSocket }) => void
Description: IAT callback: Callback during speech recognition.
Parameters:
- device_id - Device ID.
- text - Speech-to-text.

Usage Example:

const config = {
  onIATcb: ({ device_id, text, ws }) => {
    console.log(`Speech recognition result for device ${device_id}: ${text}`);
  }
};

onIATEndcb

Type: (arg: { device_id: string, text: string, ws: WebSocket }) => void
Description: IAT callback: Callback after speech recognition is completed, used for sending the last frame to the speech recognition server, etc.
Parameters:
- device_id - Device ID.
- text - Speech-to-text.

Usage Example:

const config = {
  onIATEndcb: ({ device_id, text, ws }) => {
    console.log(`Speech recognition completed for device ${device_id}: ${text}`);
  }
};

onTTS

Type: (arg: { device_id: string, tts_task_id: string, text: string, ws: WebSocket }) => void
Description: Callback function executed each time the TTS service is invoked.

Usage Example:

const config = {
  onTTS: ({ device_id, tts_task_id, text, ws }) => {
    console.log(`TTS service started, task ID: ${tts_task_id}, text: ${text}`);
  }
};

onTTScb

Type: (arg: { device_id: string, is_over: boolean, audio: Buffer, ws: WebSocket }) => void
Description: TTS callback.
Parameters:
- device_id - Device ID.
- is_over - Whether completed.
- audio - Audio stream.

Usage Example:

const config = {
  onTTScb: ({ device_id, is_over, audio, ws }) => {
    console.log(`TTS conversion for device ${device_id} ${is_over ? "completed" : "in progress"}`);
  }
};

onLLM

Type: (arg: { device_id: string, text: string, ws: WebSocket }) => void
Description: Callback before invoking the LLM service.
Parameters:
- device_id - Device ID.
- text - Text segment generated by the large language model inference.

Usage Example:

const config = {
  onLLM: ({ device_id, text, ws }) => {
    console.log(`Device ${device_id} invokes LLM service, generated text: ${text}`);
  }
};

onLLMcb

Type: (arg: { device_id: string, text: string, is_over: boolean, llm_historys: Record<string, any>[], ws: WebSocket }) => void
Description: LLM callback.
Parameters:
- device_id - Device ID.
- text - Text segment generated by the large language model inference.
- is_over - Whether the response is complete.
- llm_historys - Conversation history.

Usage Example:

const config = {
  onLLMcb: ({ device_id, text, is_over, llm_historys, ws }) => {
    console.log(`LLM response for device ${device_id}: ${text}`);
  }
};

plugins

Type: { name: string; type: "LLM" | "TTS" | "IAT"; main: (arg: Record<string, any>) => void; }[]
Description: Plugin configuration.
Parameters:
- name - Plugin name.
- type - Plugin type.
- main - Main function of the plugin.

Usage Example:

const config = {
  plugins: [
    {
      name: "customPlugin",
      type: "LLM",
      main: (arg) => {
        console.log("Custom plugin execution", arg);
      }
    }
  ]
};

Service Stress Test

/**
 * The following test data uses Tencent Cloud as the service provider: CPU 2 cores | Memory 2GB | Bandwidth 4mb | SSD 50GB
 * During the testing process, server-side logging was enabled, so actual performance can be slightly higher.
 *
 * 1. Connection + Sending a Single Data Packet (Without considering whether reconnection is successful, only recording concurrent request situations)
 * --------------------------------------------------------------------------------
 *   Number of Connections   |  Successful Connections | Failed Connections  | Server Status During Instantaneous Concurrency             | Server Status After Connection  
 * ---------------------------------------------------------------------------------
 *   1000      |  1000  |   0   | CPU:100%,100%, MEM:1.5GB  | CPU:4%, 3%, MEM:1.5GB 
 * ---------------------------------------------------------------------------------
 *   2000      |  2000  |   0   | CPU:100%,100%, MEM:1.5GB  | CPU:4%, 3%, MEM:1.5GB 
 * ---------------------------------------------------------------------------------
 *   3000(peak) |  3000  |   0   | CPU:100%,100%, MEM:1.5GB  | CPU:4%, 3%, MEM:1.5GB 
 * ---------------------------------------------------------------------------------
 *   4000      |  3806  |194(5%)| CPU:100%,100%, MEM:1.5GB  | CPU:4%, 3%, MEM:1.5GB 
 * ---------------------------------------------------------------------------------
 *   5000      |  4685  |315(6.7%)| CPU:100%,100%, MEM:1.5GB  | CPU:4%, 3%, MEM:1.5GB 
 * ---------------------------------------------------------------------------------
 *   6000      |  4030  |1970(32%)| CPU:100%,100%, MEM:1.6GB  | CPU:4%, 3%, MEM:1.5GB 
 * ---------------------------------------------------------------------------------
 *   10000     |  52    |Service Crashed   | -                         | -
 * ---------------------------------------------------------------------------------
 *
 * 
 * 2. Connection + Necessary Flags + Sending Audio Stream (Test Situation): 10kb audio stream, sent in 2048-byte chunks. Each connection should send 6 messages.
 *  * --------------------------------------------------------------------------------
 *   Number of Connections   |Successful Connections| Messages to Send | Messages Sent  | Failed Messages  | Server Status During Instantaneous Concurrency             | Server Status After Connection  
 * ---------------------------------------------------------------------------------
 *   100      |  922   | 600     |600      |  0   | CPU:100%,100%, MEM:1.5GB  | CPU:4%, 3%, MEM:1.5GB 
 * ---------------------------------------------------------------------------------
 *   500      |  500   | 3000    | 3000    |   0   | CPU:100%,100%, MEM:1.5GB  | CPU:4%, 3%, MEM:1.5GB 
 * ---------------------------------------------------------------------------------
 *   1000     |  1000   | 6000    | 6000    |   0   | CPU:100%,100%, MEM:1.5GB  | CPU:4%, 3%, MEM:1.5GB 
 * ---------------------------------------------------------------------------------
 *   2000(peak)     |  2000   | 12000    | 12000    |   0   | CPU:100%,100%, MEM:1.5GB  | CPU:4%, 3%, MEM:1.5GB 
 * ---------------------------------------------------------------------------------
 *   3000     |  2982   | 18000    | 2982    |   0   | CPU:100%,100%, MEM:1.5GB  | CPU:4%, 3%, MEM:1.5GB 
 * ---------------------------------------------------------------------------------
 */