[RFC] Remote Model Inference Streaming #3630

jngz-es · 2025-03-07T04:40:23Z

[RFC] Remote Model Inference Streaming

Overview

Popular LLMs provide the ability to stream responses back to a client for better customer experience, like OpenAI streaming etc.
By supporting this feature, OpenSearch ML provides the ability to integrate in a streaming system as a component.

Scope

Support remote model only at the first step.
Provide a new REST streaming model prediction API.
Introduce the feature as experimental. The required dependent features like arrow flight, rest streaming channel, reactor-netty4 etc., either the development is still ongoing or they are experimental too.

Out of Scope

Support agent/tool streaming execution.

Challenges

REST streaming API. The default external HTTP-based communication module transport-netty4 not support streaming. We have to change the network settings to use another module transport-reactor-netty4 (experimental) which supports a new rest channel, StreamingRestChannel. Accordingly we need to implement the REST action in reactor way.
Internal TCP-based communication between nodes. Currently OpenSearch doesn’t have a network module supporting it in streaming way. Fortunately we will have a new feature being released in 3.0 which is about Arrow Flight. More details from Search streams using Apache Arrow and Flight and Arrow Flight Server bootstrap logic. With this feature, we are able to stream LLM responses between coordinate node and ml node.
Support SSE between OpenSearch and LLM. Popular LLMs like OpenAI, Claude etc. follow SSE standard. For different LLMs, they recommend to use their clients as it is the easiest way to interact with the streaming API. But from our end, it’s not the case. 1. Usually they don’t have java clients. 2. We need to support many LLMs and don’t want to introduce a different client for every LLM.

System Architecture

Apache Arrow

OpenSearch is introducing Apache Arrow in 3.0 release. There are some key takeaways from arrow for our use cases.

It’s column-based which is more suitable for OLAP like OpenSearch compared to row-based which is mostly for OLTP.
It’s an “on-the-wire” data format in-memory that doesn’t need serialization/deserialization which could be a bottleneck in a large scale data processing according to the paper Making sense of performance in data analytics frameworks. Basically the system can process data in a zero-copy way to improve data throughput over the network by avoiding something like JSON format serialization/deserialization.
It supports data streams by batches naturally.
It has client-server framework Arrow Flight built on top of gRPC.
It’s supported in Java.

Flight

It’s a client-server framework built on top of gRPC. With this framework, developers can implement a service that produce and consume data streams easily. This framework is optimized by reducing memory copy and skip protobuf encoding/decoding steps for arrow data except metadata.

Server-sent events

It’s a HTML standard. Currently popular LLMs follow this SSE standard to provide streaming functionality which allows users to get partial results in a steaming way. Basically it is a series of events sent from a sever to save multiple connection setup cost.

An raw HTTP message example is from Claude, the below block is from https://docs.anthropic.com/en/api/messages-streaming.

event: message_start
data: {"type": "message_start", "message": {"id": "msg_1nZdL29xx5MUA1yADyHTEsnR8uuvGzszyY", "type": "message", "role": "assistant", "content": [], "model": "claude-3-7-sonnet-20250219", "stop_reason": null, "stop_sequence": null, "usage": {"input_tokens": 25, "output_tokens": 1}}}

event: content_block_start
data: {"type": "content_block_start", "index": 0, "content_block": {"type": "text", "text": ""}}

event: ping
data: {"type": "ping"}

event: content_block_delta
data: {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "Hello"}}

event: content_block_delta
data: {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "!"}}

event: content_block_stop
data: {"type": "content_block_stop", "index": 0}

event: message_delta
data: {"type": "message_delta", "delta": {"stop_reason": "end_turn", "stop_sequence":null}, "usage": {"output_tokens": 15}}

event: message_stop
data: {"type": "message_stop"}

Model creation

There is no difference between non-streaming and streaming for model creation, as ml plugin will append the stream parameter in request body to LLMs automatically. So users don’t have to create different models with the same endpoint for non-streaming and streaming. For more model creation information, please refer to remote inference blueprints. A streaming request example from Claude (https://docs.anthropic.com/en/api/messages-streaming)

curl https://api.anthropic.com/v1/messages \
     --header "anthropic-version: 2023-06-01" \
     --header "content-type: application/json" \
     --header "x-api-key: $ANTHROPIC_API_KEY" \
     --data \
'{
  "model": "claude-3-7-sonnet-20250219",
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 256,
  "stream": true
}'

Endpoints

POST /_plugins/_ml/models/<model_id>/_predict/stream

Example request

POST /_plugins/_ml/models/rcormY8B8aiZvtEZIe89/_predict/stream
{
  "parameters": {
    "inputs": "What is the meaning of life?"
  }
}

Example response

{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "dataAsMap": {
            "response": "<chunk 1>"
          }
        }
      ],
      "status_code": 200
    }
  ]
}
{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "dataAsMap": {
            "response": "<chunk 2>"
          }
        }
      ],
      "status_code": 200
    }
  ]
}

...

{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "dataAsMap": {
            "response": "<chunk n>"
          }
        }
      ],
      "status_code": 200
    }
  ]
}

The text was updated successfully, but these errors were encountered:

jngz-es added enhancement New feature or request untriaged labels Mar 7, 2025

jngz-es self-assigned this Mar 7, 2025

jngz-es removed the untriaged label Mar 7, 2025

jngz-es added this to ml-commons projects Mar 11, 2025

jngz-es moved this to In Progress in ml-commons projects Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Remote Model Inference Streaming #3630

[RFC] Remote Model Inference Streaming #3630

jngz-es commented Mar 7, 2025

[RFC] Remote Model Inference Streaming #3630

[RFC] Remote Model Inference Streaming #3630

Comments

jngz-es commented Mar 7, 2025

[RFC] Remote Model Inference Streaming

Overview

Scope

Out of Scope

Challenges

System Architecture

Apache Arrow

Flight

Server-sent events

Model creation

Endpoints

Example request

Example response