Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Remote Model Inference Streaming #3630

Open
jngz-es opened this issue Mar 7, 2025 · 0 comments
Open

[RFC] Remote Model Inference Streaming #3630

jngz-es opened this issue Mar 7, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@jngz-es
Copy link
Collaborator

jngz-es commented Mar 7, 2025

[RFC] Remote Model Inference Streaming

Overview

  • Popular LLMs provide the ability to stream responses back to a client for better customer experience, like OpenAI streaming etc.
  • By supporting this feature, OpenSearch ML provides the ability to integrate in a streaming system as a component.

Scope

  • Support remote model only at the first step.
  • Provide a new REST streaming model prediction API.
  • Introduce the feature as experimental. The required dependent features like arrow flight, rest streaming channel, reactor-netty4 etc., either the development is still ongoing or they are experimental too.

Out of Scope

  • Support agent/tool streaming execution.

Challenges

  • REST streaming API. The default external HTTP-based communication module transport-netty4 not support streaming. We have to change the network settings to use another module transport-reactor-netty4 (experimental) which supports a new rest channel, StreamingRestChannel. Accordingly we need to implement the REST action in reactor way.
  • Internal TCP-based communication between nodes. Currently OpenSearch doesn’t have a network module supporting it in streaming way. Fortunately we will have a new feature being released in 3.0 which is about Arrow Flight. More details from Search streams using Apache Arrow and Flight and Arrow Flight Server bootstrap logic. With this feature, we are able to stream LLM responses between coordinate node and ml node.
  • Support SSE between OpenSearch and LLM. Popular LLMs like OpenAI, Claude etc. follow SSE standard. For different LLMs, they recommend to use their clients as it is the easiest way to interact with the streaming API. But from our end, it’s not the case. 1. Usually they don’t have java clients. 2. We need to support many LLMs and don’t want to introduce a different client for every LLM.

System Architecture

Image

Apache Arrow

OpenSearch is introducing Apache Arrow in 3.0 release. There are some key takeaways from arrow for our use cases.

  1. It’s column-based which is more suitable for OLAP like OpenSearch compared to row-based which is mostly for OLTP.
  2. It’s an “on-the-wire” data format in-memory that doesn’t need serialization/deserialization which could be a bottleneck in a large scale data processing according to the paper Making sense of performance in data analytics frameworks. Basically the system can process data in a zero-copy way to improve data throughput over the network by avoiding something like JSON format serialization/deserialization.
  3. It supports data streams by batches naturally.
  4. It has client-server framework Arrow Flight built on top of gRPC.
  5. It’s supported in Java.

Flight

It’s a client-server framework built on top of gRPC. With this framework, developers can implement a service that produce and consume data streams easily. This framework is optimized by reducing memory copy and skip protobuf encoding/decoding steps for arrow data except metadata.

Image

Server-sent events

It’s a HTML standard. Currently popular LLMs follow this SSE standard to provide streaming functionality which allows users to get partial results in a steaming way. Basically it is a series of events sent from a sever to save multiple connection setup cost.

An raw HTTP message example is from Claude, the below block is from https://docs.anthropic.com/en/api/messages-streaming.

event: message_start
data: {"type": "message_start", "message": {"id": "msg_1nZdL29xx5MUA1yADyHTEsnR8uuvGzszyY", "type": "message", "role": "assistant", "content": [], "model": "claude-3-7-sonnet-20250219", "stop_reason": null, "stop_sequence": null, "usage": {"input_tokens": 25, "output_tokens": 1}}}

event: content_block_start
data: {"type": "content_block_start", "index": 0, "content_block": {"type": "text", "text": ""}}

event: ping
data: {"type": "ping"}

event: content_block_delta
data: {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "Hello"}}

event: content_block_delta
data: {"type": "content_block_delta", "index": 0, "delta": {"type": "text_delta", "text": "!"}}

event: content_block_stop
data: {"type": "content_block_stop", "index": 0}

event: message_delta
data: {"type": "message_delta", "delta": {"stop_reason": "end_turn", "stop_sequence":null}, "usage": {"output_tokens": 15}}

event: message_stop
data: {"type": "message_stop"}


Model creation

There is no difference between non-streaming and streaming for model creation, as ml plugin will append the stream parameter in request body to LLMs automatically. So users don’t have to create different models with the same endpoint for non-streaming and streaming. For more model creation information, please refer to remote inference blueprints. A streaming request example from Claude (https://docs.anthropic.com/en/api/messages-streaming)

curl https://api.anthropic.com/v1/messages \
     --header "anthropic-version: 2023-06-01" \
     --header "content-type: application/json" \
     --header "x-api-key: $ANTHROPIC_API_KEY" \
     --data \
'{
  "model": "claude-3-7-sonnet-20250219",
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 256,
  "stream": true
}'

Endpoints

POST /_plugins/_ml/models/<model_id>/_predict/stream

Example request

POST /_plugins/_ml/models/rcormY8B8aiZvtEZIe89/_predict/stream
{
  "parameters": {
    "inputs": "What is the meaning of life?"
  }
}

Example response

{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "dataAsMap": {
            "response": "<chunk 1>"
          }
        }
      ],
      "status_code": 200
    }
  ]
}
{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "dataAsMap": {
            "response": "<chunk 2>"
          }
        }
      ],
      "status_code": 200
    }
  ]
}

...

{
  "inference_results": [
    {
      "output": [
        {
          "name": "response",
          "dataAsMap": {
            "response": "<chunk n>"
          }
        }
      ],
      "status_code": 200
    }
  ]
}
@jngz-es jngz-es added enhancement New feature or request untriaged labels Mar 7, 2025
@jngz-es jngz-es self-assigned this Mar 7, 2025
@jngz-es jngz-es removed the untriaged label Mar 7, 2025
@jngz-es jngz-es moved this to In Progress in ml-commons projects Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

1 participant