pytorch
diff --git a/‎examples/stateful/Readme.md ‎examples/stateful/sequence_batching/Readme.md
+7-16 b/‎examples/stateful/Readme.md ‎examples/stateful/sequence_batching/Readme.md
+7-16
diff --git a/‎examples/stateful/model-config.yaml ‎examples/stateful/sequence_batching/model-config.yaml b/‎examples/stateful/model-config.yaml ‎examples/stateful/sequence_batching/model-config.yaml
diff --git a/‎examples/stateful/stateful_handler.py ‎examples/stateful/sequence_batching/stateful_handler.py b/‎examples/stateful/stateful_handler.py ‎examples/stateful/sequence_batching/stateful_handler.py
diff --git a/‎examples/stateful/sequence_continuous_batching/Readme.md
+198 b/‎examples/stateful/sequence_continuous_batching/Readme.md
+198
diff --git a/‎examples/stateful/sequence_continuous_batching/model-config.yaml
+12 b/‎examples/stateful/sequence_continuous_batching/model-config.yaml
+12
@@ -6,9 +6,9 @@ Within this context, TorchServe offers a mechanism known as sequence batching. T
 
 The following picture show the workflow of stateful inference. A job group has a job queue which stores incoming inference requests from a streaming. The max capacity of a job queue is defined by `maxSequenceJobQueueSize`. A sequence batch aggregator polls an inference request from each job group. A batch of requests is sent to backend.
 
-![sequence batch](../../docs/images/stateful_batch.jpg)
+![sequence batch](../../../docs/images/stateful_batch.jpg)
 
-This example serves as a practical showcase of employing stateful inference. Underneath the surface, the backend leverages an [LRU dictionary](https://github.com/amitdev/lru-dict), functioning as a caching layer. Users can choose different caching library in the handler implementation based on their own use cases.
+This example serves as a practical showcase of employing stateful inference via sequence batching. Underneath the surface, the backend leverages an [LRU dictionary](https://github.com/amitdev/lru-dict), functioning as a caching layer. Users can choose different caching library in the handler implementation based on their own use cases.
 
 ### Step 1: Implement handler
 
@@ -92,16 +92,10 @@ handler:
 ### Step 3: Generate mar or tgz file
 
 ```bash
-torch-model-archiver --model-name stateful --version 1.0 --model-file model.py --serialized-file model_cnn.pt --handler stateful_handler.py -r requirements.txt --config-file model-config.yaml
+torch-model-archiver --model-name stateful --version 1.0 --model-file model.py --serialized-file model_cnn.pt --handler stateful_handler.py -r ../requirements.txt --config-file model-config.yaml
 ```
 
-### Step 4: Start torchserve
-
-```bash
-torchserve --start --ncs --model-store model_store --models stateful.mar
-```
-
-### Step 6: Build GRPC Client
+### Step 4: Build GRPC Client
 The details can be found at [here](https://github.com/pytorch/serve/blob/master/docs/grpc_api.md).
 * Install gRPC python dependencies
 ```bash
@@ -111,26 +105,23 @@ pip install -U grpcio protobuf grpcio-tools googleapis-common-protos
 
 * Generate python gRPC client stub using the proto files
 ```bash
-cd ../..
+cd ../../..
 python -m grpc_tools.protoc -I third_party/google/rpc --proto_path=frontend/server/src/main/resources/proto/ --python_out=ts_scripts --grpc_python_out=ts_scripts frontend/server/src/main/resources/proto/inference.proto frontend/server/src/main/resources/proto/management.proto
-cd -
 ```
 
-### Step 7: Run inference
+### Step 5: Run inference
 * Start TorchServe
 
 ```bash
-torchserve --ncs --start --model-store models --model stateful.mar --ts-config config.properties
+torchserve --ncs --start --model-store models --model stateful.mar --ts-config examples/stateful/config.properties
 ```
 
 * Run sequence inference via GRPC client
 ```bash
-cd ../../
 python ts_scripts/torchserve_grpc_client.py  infer_stream2 stateful seq_0 examples/stateful/sample/sample1.txt,examples/stateful/sample/sample2.txt,examples/stateful/sample/sample3.txt
 ```
 
 * Run sequence inference via HTTP
 ```bash
-cd ../../
 curl -H "ts_request_sequence_id: seq_0" http://localhost:8080/predictions/stateful -T examples/stateful/sample/sample1.txt
 ```
@@ -0,0 +1,198 @@
+# Stateful Inference
+
+A stateful model possesses the ability to leverage interdependencies between successive inference requests. This type of model maintains a persistent state across inference requests, thereby establishing a linkage between the outcomes of prior inquiries and those that follow. Notable illustrations of stateful models encompass online speech recognition systems, such as the Long Short-Term Memory (LSTM) model. Employing stateful inference mandates that the model server adheres to the sequential order of inference requests, ensuring predictions build upon the previous outcomes.
+
+Within this context, TorchServe offers a mechanism known as sequence continuous batching. This approach involves the retrieval of an individual inference request from a particular sequence, followed by the combination of multiple requests originating from different sequences into a unified batch. Each request is associated with a unique sequence ID, which can be extracted using the "get_sequence_id" function of context.py. This `sequence ID` serves as a key employed by custom handlers to store and retrieve values within the backend cache store, fostering efficient management of stateful inference processes. Client can also reuse the `sequence ID` when a connection resumes as long as the sequence is not expired on the TorchServe side. Additionally, continuous batching enables a new inference request of a sequence to be served while the previous one is in a response steaming mode.
+
+The following picture show the workflow of stateful inference. A job group has a job queue which stores incoming inference requests from a streaming. The max capacity of a job queue is defined by `maxSequenceJobQueueSize`. A sequence batch aggregator polls an inference request from each job group. A batch of requests is sent to backend.
+
+![sequence batch](../../../docs/images/stateful_batch.jpg)
+
+This example serves as a practical showcase of employing stateful inference via sequence batching and continuous batching. Underneath the surface, the backend leverages an [LRU dictionary](https://github.com/amitdev/lru-dict), functioning as a caching layer. Users can choose different caching library in the handler implementation based on their own use cases.
+
+### Step 1: Implement handler
+
+stateful_handler.py is an example of stateful handler. It creates a cache `self.cache` by calling `[LRU](https://github.com/amitdev/lru-dict)`.
+
+```python
+    def initialize(self, ctx: Context):
+        """
+        Loads the model and Initializes the necessary artifacts
+        """
+
+        ctx.cache = {}
+        if ctx.model_yaml_config["handler"] is not None:
+            self.cache = LRU(
+                int(
+                    ctx.model_yaml_config["handler"]
+                    .get("cache", {})
+                    .get("capacity", StatefulHandler.DEFAULT_CAPACITY)
+                )
+            )
+        self.initialized = True
+```
+
+Handler uses sequenceId (ie., `sequence_id = self.context.get_sequence_id(idx)`) as key to store and fetch values from `self.cache`.
+
+```python
+    def preprocess(self, data):
+        """
+        Preprocess function to convert the request input to a tensor(Torchserve supported format).
+        The user needs to override to customize the pre-processing
+
+        Args :
+            data (list): List of the data from the request input.
+
+        Returns:
+            tensor: Returns the tensor data of the input
+        """
+
+        results = []
+        for idx, row in enumerate(data):
+            sequence_id = self.context.get_sequence_id(idx)
+            # SageMaker sticky router relies on response header to identify the sessions
+            # The sequence_id from request headers must be set in response headers
+            self.context.set_response_header(
+                idx, self.context.header_key_sequence_id, sequence_id
+            )
+
+            # check if sequence_id exists
+            if self.context.get_request_header(
+                idx, self.context.header_key_sequence_start
+            ):
+                prev = int(0)
+                self.context.cache[sequence_id] = {
+                    "start": True,
+                    "cancel": False,
+                    "end": False,
+                    "num_requests": 0,
+                }
+            elif self.cache.has_key(sequence_id):
+                prev = int(self.cache[sequence_id])
+            else:
+                prev = None
+                logger.error(
+                    f"Not received sequence_start request for sequence_id:{sequence_id} before"
+                )
+
+            req_id = self.context.get_request_id(idx)
+            # process a new request
+            if req_id not in self.context.cache:
+                logger.info(
+                    f"received a new request sequence_id={sequence_id}, request_id={req_id}"
+                )
+                request = row.get("data") or row.get("body")
+                if isinstance(request, (bytes, bytearray)):
+                    request = request.decode("utf-8")
+
+                self.context.cache[req_id] = {
+                    "stopping_criteria": self._create_stopping_criteria(
+                        req_id=req_id, seq_id=sequence_id
+                    ),
+                    "stream": True,
+                }
+                self.context.cache[sequence_id]["num_requests"] += 1
+
+                if type(request) is dict and "input" in request:
+                    request = request.get("input")
+
+                # -1: cancel
+                if int(request) == -1:
+                    self.context.cache[sequence_id]["cancel"] = True
+                    self.context.cache[req_id]["stream"] = False
+                    results.append(int(request))
+                elif prev is None:
+                    logger.info(
+                        f"Close the sequence:{sequence_id} without open session request"
+                    )
+                    self.context.cache[sequence_id]["end"] = True
+                    self.context.cache[req_id]["stream"] = False
+                    self.context.set_response_header(
+                        idx, self.context.header_key_sequence_end, sequence_id
+                    )
+                    results.append(int(request))
+                else:
+                    val = prev + int(request)
+                    self.cache[sequence_id] = val
+                    # 0: end
+                    if int(request) == 0:
+                        self.context.cache[sequence_id]["end"] = True
+                        self.context.cache[req_id]["stream"] = False
+                        self.context.set_response_header(
+                            idx, self.context.header_key_sequence_end, sequence_id
+                        )
+                    # non stream input:
+                    elif int(request) % 2 == 0:
+                        self.context.cache[req_id]["stream"] = False
+
+                    results.append(val)
+            else:
+                # continue processing stream
+                logger.info(
+                    f"received continuous request sequence_id={sequence_id}, request_id={req_id}"
+                )
+                time.sleep(1)
+                results.append(prev)
+
+        return results
+```
+
+### Step 2: Model configuration
+
+Stateful inference has two parameters. TorchServe is able to process (maxWorkers * batchSize) sequences of inference requests of a model in parallel.
+* sequenceMaxIdleMSec: the max idle in milliseconds of a sequence inference request of this stateful model. The default value is 0 (ie. this is not a stateful model.) TorchServe does not process the new inference request if the max idle timeout.
+* maxSequenceJobQueueSize: the job queue size of an inference sequence of this stateful model. The default value is 1.
+
+
+```yaml
+#cat model-config.yaml
+
+minWorkers: 2
+maxWorkers: 2
+batchSize: 4
+sequenceMaxIdleMSec: 60000
+maxSequenceJobQueueSize: 10
+sequenceBatching: true
+continuousBatching: true
+
+handler:
+  cache:
+    capacity: 4
+```
+
+### Step 3: Generate mar or tgz file
+
+```bash
+torch-model-archiver --model-name stateful --version 1.0 --model-file model.py --serialized-file model_cnn.pt --handler stateful_handler.py -r ../requirements.txt --config-file model-config.yaml
+```
+
+### Step 4: Build GRPC Client
+The details can be found at [here](https://github.com/pytorch/serve/blob/master/docs/grpc_api.md).
+* Install gRPC python dependencies
+```bash
+git submodule init
+pip install -U grpcio protobuf grpcio-tools googleapis-common-protos
+```
+
+* Generate python gRPC client stub using the proto files
+```bash
+cd ../../..
+python -m grpc_tools.protoc -I third_party/google/rpc --proto_path=frontend/server/src/main/resources/proto/ --python_out=ts_scripts --grpc_python_out=ts_scripts frontend/server/src/main/resources/proto/inference.proto frontend/server/src/main/resources/proto/management.proto
+```
+
+### Step 5: Run inference
+* Start TorchServe
+
+```bash
+torchserve --ncs --start --model-store models --model stateful.mar --ts-config examples/stateful/config.properties
+```
+
+* Run sequence inference via GRPC client
+```bash
+python ts_scripts/torchserve_grpc_client.py  infer_stream2 stateful seq_0 examples/stateful/sample/sample1.txt,examples/stateful/sample/sample2.txt,examples/stateful/sample/sample3.txt
+```
+
+* Run sequence inference via HTTP
+```bash
+curl -H "ts_request_sequence_id: seq_0" http://localhost:8080/predictions/stateful -T examples/stateful/sample/sample1.txt
+```
@@ -0,0 +1,12 @@
+minWorkers: 2
+maxWorkers: 2
+batchSize: 4
+maxNumSequence: 4
+sequenceMaxIdleMSec: 10
+maxSequenceJobQueueSize: 10
+sequenceBatching: true
+continuousBatching: true
+
+handler:
+  cache:
+    capacity: 4