You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Popular LLMs provide the ability to stream responses back to a client for better customer experience, like OpenAI streaming etc.
By supporting this feature, OpenSearch ML provides the ability to integrate in a streaming system as a component.
Scope
Support remote model only at the first step.
Provide a new REST streaming model prediction API.
Introduce the feature as experimental. The required dependent features like arrow flight, rest streaming channel, reactor-netty4 etc., either the development is still ongoing or they are experimental too.
Out of Scope
Support agent/tool streaming execution.
Challenges
REST streaming API. The default external HTTP-based communication module transport-netty4 not support streaming. We have to change the network settings to use another module transport-reactor-netty4 (experimental) which supports a new rest channel, StreamingRestChannel. Accordingly we need to implement the REST action in reactor way.
Internal TCP-based communication between nodes. Currently OpenSearch doesn’t have a network module supporting it in streaming way. Fortunately we will have a new feature being released in 3.0 which is about Arrow Flight. More details from Search streams using Apache Arrow and Flight and Arrow Flight Server bootstrap logic. With this feature, we are able to stream LLM responses between coordinate node and ml node.
Support SSE between OpenSearch and LLM. Popular LLMs like OpenAI, Claude etc. follow SSE standard. For different LLMs, they recommend to use their clients as it is the easiest way to interact with the streaming API. But from our end, it’s not the case. 1. Usually they don’t have java clients. 2. We need to support many LLMs and don’t want to introduce a different client for every LLM.
System Architecture
Apache Arrow
OpenSearch is introducing Apache Arrow in 3.0 release. There are some key takeaways from arrow for our use cases.
It’s column-based which is more suitable for OLAP like OpenSearch compared to row-based which is mostly for OLTP.
It’s an “on-the-wire” data format in-memory that doesn’t need serialization/deserialization which could be a bottleneck in a large scale data processing according to the paper Making sense of performance in data analytics frameworks. Basically the system can process data in a zero-copy way to improve data throughput over the network by avoiding something like JSON format serialization/deserialization.
It supports data streams by batches naturally.
It has client-server framework Arrow Flight built on top of gRPC.
It’s supported in Java.
Flight
It’s a client-server framework built on top of gRPC. With this framework, developers can implement a service that produce and consume data streams easily. This framework is optimized by reducing memory copy and skip protobuf encoding/decoding steps for arrow data except metadata.
Server-sent events
It’s a HTML standard. Currently popular LLMs follow this SSE standard to provide streaming functionality which allows users to get partial results in a steaming way. Basically it is a series of events sent from a sever to save multiple connection setup cost.
There is no difference between non-streaming and streaming for model creation, as ml plugin will append the stream parameter in request body to LLMs automatically. So users don’t have to create different models with the same endpoint for non-streaming and streaming. For more model creation information, please refer to remote inference blueprints. A streaming request example from Claude (https://docs.anthropic.com/en/api/messages-streaming)
[RFC] Remote Model Inference Streaming
Overview
Scope
Out of Scope
Challenges
transport-netty4
not support streaming. We have to change the network settings to use another moduletransport-reactor-netty4
(experimental) which supports a new rest channel, StreamingRestChannel. Accordingly we need to implement the REST action in reactor way.System Architecture
Apache Arrow
OpenSearch is introducing Apache Arrow in 3.0 release. There are some key takeaways from arrow for our use cases.
Flight
It’s a client-server framework built on top of gRPC. With this framework, developers can implement a service that produce and consume data streams easily. This framework is optimized by reducing memory copy and skip protobuf encoding/decoding steps for arrow data except metadata.
Server-sent events
It’s a HTML standard. Currently popular LLMs follow this SSE standard to provide streaming functionality which allows users to get partial results in a steaming way. Basically it is a series of events sent from a sever to save multiple connection setup cost.
An raw HTTP message example is from Claude, the below block is from https://docs.anthropic.com/en/api/messages-streaming.
Model creation
There is no difference between non-streaming and streaming for model creation, as ml plugin will append the stream parameter in request body to LLMs automatically. So users don’t have to create different models with the same endpoint for non-streaming and streaming. For more model creation information, please refer to remote inference blueprints. A streaming request example from Claude (https://docs.anthropic.com/en/api/messages-streaming)
Endpoints
Example request
Example response
The text was updated successfully, but these errors were encountered: