Skip to content

Commit 0905861

Browse files
committed
Added: MXNet local mode example
1 parent 1c46424 commit 0905861

File tree

3 files changed

+337
-0
lines changed

3 files changed

+337
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
2+
{
3+
"default-runtime": "nvidia",
4+
"runtimes": {
5+
"nvidia": {
6+
"path": "/usr/bin/nvidia-container-runtime",
7+
"runtimeArgs": []
8+
}
9+
}
10+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,259 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"## Local MNIST Training with MXNet and Gluon\n",
8+
"\n",
9+
"### Pre-requisites\n",
10+
"\n",
11+
"This notebook shows how to use the SageMaker Python SDK to run your code in a local container before deploying to SageMaker's managed training or hosting environments. This can speed up iterative testing and debugging while using the same familiar Python SDK interface. Just change your estimator's `train_instance_type` to `local` (or `local_gpu` if you're using an ml.p2 or ml.p3 notebook instance).\n",
12+
"\n",
13+
"In order to use this feature you'll need to install docker-compose (and nvidia-docker if training with a GPU).\n",
14+
"\n",
15+
"**Note, you can only run a single local notebook at one time.**"
16+
]
17+
},
18+
{
19+
"cell_type": "code",
20+
"execution_count": null,
21+
"metadata": {},
22+
"outputs": [],
23+
"source": [
24+
"!/bin/bash ./setup.sh"
25+
]
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"metadata": {},
30+
"source": [
31+
"### Overview\n",
32+
"\n",
33+
"MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train and test an MNIST model on SageMaker using MXNet and the Gluon API."
34+
]
35+
},
36+
{
37+
"cell_type": "code",
38+
"execution_count": null,
39+
"metadata": {},
40+
"outputs": [],
41+
"source": [
42+
"import os\n",
43+
"import subprocess\n",
44+
"import boto3\n",
45+
"import sagemaker\n",
46+
"from sagemaker.mxnet import MXNet\n",
47+
"from mxnet import gluon\n",
48+
"from sagemaker import get_execution_role\n",
49+
"\n",
50+
"sagemaker_session = sagemaker.Session()\n",
51+
"\n",
52+
"instance_type = 'local'\n",
53+
"\n",
54+
"if subprocess.call('nvidia-smi') == 0:\n",
55+
" ## Set type to GPU if one is present\n",
56+
" instance_type = 'local_gpu'\n",
57+
" \n",
58+
"print(\"Instance type = \" + instance_type)\n",
59+
"\n",
60+
"role = get_execution_role()"
61+
]
62+
},
63+
{
64+
"cell_type": "markdown",
65+
"metadata": {},
66+
"source": [
67+
"## Download training and test data"
68+
]
69+
},
70+
{
71+
"cell_type": "code",
72+
"execution_count": null,
73+
"metadata": {},
74+
"outputs": [],
75+
"source": [
76+
"gluon.data.vision.MNIST('./data/train', train=True)\n",
77+
"gluon.data.vision.MNIST('./data/test', train=False)"
78+
]
79+
},
80+
{
81+
"cell_type": "markdown",
82+
"metadata": {},
83+
"source": [
84+
"## Uploading the data\n",
85+
"\n",
86+
"We use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value `inputs` identifies the location -- we will use this later when we start the training job."
87+
]
88+
},
89+
{
90+
"cell_type": "code",
91+
"execution_count": null,
92+
"metadata": {},
93+
"outputs": [],
94+
"source": [
95+
"inputs = sagemaker_session.upload_data(path='data', key_prefix='data/mnist')"
96+
]
97+
},
98+
{
99+
"cell_type": "markdown",
100+
"metadata": {},
101+
"source": [
102+
"## Implement the training function\n",
103+
"\n",
104+
"We need to provide a training script that can run on the SageMaker platform. The training scripts are essentially the same as one you would write for local training, except that you need to provide a `train` function. When SageMaker calls your function, it will pass in arguments that describe the training environment. Check the script below to see how this works.\n",
105+
"\n",
106+
"The script here is an adaptation of the [Gluon MNIST example](https://github.com/apache/incubator-mxnet/blob/master/example/gluon/mnist.py) provided by the [Apache MXNet](https://mxnet.incubator.apache.org/) project. "
107+
]
108+
},
109+
{
110+
"cell_type": "code",
111+
"execution_count": null,
112+
"metadata": {},
113+
"outputs": [],
114+
"source": [
115+
"!cat 'mnist.py'"
116+
]
117+
},
118+
{
119+
"cell_type": "markdown",
120+
"metadata": {},
121+
"source": [
122+
"## Run the training script on SageMaker\n",
123+
"\n",
124+
"The ```MXNet``` class allows us to run our training function on SageMaker. We need to configure it with our training script, an IAM role, the number of training instances, and the training instance type. This is the the only difference from [mnist_with_gluon.ipynb](./mnist_with_gluon.ipynb). Instead of ``train_instance_type='ml.c4.xlarge'``, we set it to ``train_instance_type='local'``. For local training with GPU, we could set this to \"local_gpu\". In this case, `instance_type` was set above based on your whether you're running a GPU instance."
125+
]
126+
},
127+
{
128+
"cell_type": "code",
129+
"execution_count": null,
130+
"metadata": {},
131+
"outputs": [],
132+
"source": [
133+
"m = MXNet(\"mnist.py\", \n",
134+
" role=role, \n",
135+
" train_instance_count=1, \n",
136+
" train_instance_type=instance_type,\n",
137+
" hyperparameters={'batch_size': 100, \n",
138+
" 'epochs': 2, \n",
139+
" 'learning_rate': 0.1, \n",
140+
" 'momentum': 0.9, \n",
141+
" 'log_interval': 100})"
142+
]
143+
},
144+
{
145+
"cell_type": "markdown",
146+
"metadata": {},
147+
"source": [
148+
"After we've constructed our `MXNet` object, we fit it using the data we uploaded to S3. Even though we're in local mode, using S3 as our data source makes sense because it maintains consistency with how SageMaker's distributed, managed training ingests data."
149+
]
150+
},
151+
{
152+
"cell_type": "code",
153+
"execution_count": null,
154+
"metadata": {
155+
"scrolled": true
156+
},
157+
"outputs": [],
158+
"source": [
159+
"m.fit(inputs)"
160+
]
161+
},
162+
{
163+
"cell_type": "markdown",
164+
"metadata": {},
165+
"source": [
166+
"After training, we use the MXNet object to deploy an MXNetPredictor object. This creates a SageMaker endpoint locally that we can use to perform inference. \n",
167+
"\n",
168+
"This allows us to perform inference on json encoded multi-dimensional arrays. "
169+
]
170+
},
171+
{
172+
"cell_type": "code",
173+
"execution_count": null,
174+
"metadata": {
175+
"scrolled": true
176+
},
177+
"outputs": [],
178+
"source": [
179+
"predictor = m.deploy(initial_instance_count=1, instance_type=instance_type )"
180+
]
181+
},
182+
{
183+
"cell_type": "markdown",
184+
"metadata": {},
185+
"source": [
186+
"We can now use this predictor to classify hand-written digits. Drawing into the image box loads the pixel data into a 'data' variable in this notebook, which we can then pass to the mxnet predictor."
187+
]
188+
},
189+
{
190+
"cell_type": "code",
191+
"execution_count": null,
192+
"metadata": {},
193+
"outputs": [],
194+
"source": [
195+
"from IPython.display import HTML\n",
196+
"HTML(open(\"input.html\").read())"
197+
]
198+
},
199+
{
200+
"cell_type": "markdown",
201+
"metadata": {},
202+
"source": [
203+
"The predictor runs inference on our input data and returns the predicted digit (as a float value, so we convert to int for display)."
204+
]
205+
},
206+
{
207+
"cell_type": "code",
208+
"execution_count": null,
209+
"metadata": {
210+
"scrolled": true
211+
},
212+
"outputs": [],
213+
"source": [
214+
"response = predictor.predict(data)\n",
215+
"print int(response)"
216+
]
217+
},
218+
{
219+
"cell_type": "markdown",
220+
"metadata": {},
221+
"source": [
222+
"## Clean-up\n",
223+
"\n",
224+
"Deleting the local endpoint when you're finished is important since you can only run one local endpoint at a time."
225+
]
226+
},
227+
{
228+
"cell_type": "code",
229+
"execution_count": null,
230+
"metadata": {},
231+
"outputs": [],
232+
"source": [
233+
"m.delete_endpoint()"
234+
]
235+
}
236+
],
237+
"metadata": {
238+
"kernelspec": {
239+
"display_name": "conda_mxnet_p27",
240+
"language": "python",
241+
"name": "conda_mxnet_p27"
242+
},
243+
"language_info": {
244+
"codemirror_mode": {
245+
"name": "ipython",
246+
"version": 2
247+
},
248+
"file_extension": ".py",
249+
"mimetype": "text/x-python",
250+
"name": "python",
251+
"nbconvert_exporter": "python",
252+
"pygments_lexer": "ipython2",
253+
"version": "2.7.14"
254+
},
255+
"notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
256+
},
257+
"nbformat": 4,
258+
"nbformat_minor": 2
259+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
#!/bin/bash
2+
3+
# Do we have GPU support?
4+
nvidia-smi > /dev/null 2>&1
5+
if [ $? -eq 0 ]; then
6+
# check if we have nvidia-docker
7+
NVIDIA_DOCKER=`rpm -qa | grep -c nvidia-docker2`
8+
if [ $NVIDIA_DOCKER -eq 0 ]; then
9+
# Install nvidia-docker2
10+
#sudo pkill -SIGHUP dockerd
11+
sudo yum -y remove docker
12+
sudo yum -y install docker-17.09.1ce-1.111.amzn1
13+
14+
sudo /etc/init.d/docker start
15+
16+
curl -s -L https://nvidia.github.io/nvidia-docker/amzn1/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
17+
sudo yum install -y nvidia-docker2
18+
sudo cp daemon.json /etc/docker/daemon.json
19+
sudo pkill -SIGHUP dockerd
20+
echo "installed nvidia-docker2"
21+
else
22+
echo "nvidia-docker2 already installed. We are good to go!"
23+
fi
24+
fi
25+
26+
# This is common for both GPU and CPU instances
27+
28+
# check if we have docker-compose
29+
docker-compose version >/dev/null 2>&1
30+
if [ $? -ne 0 ]; then
31+
# install docker compose
32+
pip install docker-compose
33+
fi
34+
35+
# check if we need to configure our docker interface
36+
SAGEMAKER_NETWORK=`docker network ls | grep -c sagemaker-local`
37+
if [ $SAGEMAKER_NETWORK -eq 0 ]; then
38+
docker network create --driver bridge sagemaker-local
39+
fi
40+
41+
# Notebook instance Docker networking fixes
42+
RUNNING_ON_NOTEBOOK_INSTANCE=`sudo iptables -S OUTPUT -t nat | grep -c 169.254.0.2`
43+
44+
# Get the Docker Network CIDR and IP for the sagemaker-local docker interface.
45+
SAGEMAKER_INTERFACE=br-`docker network ls | grep sagemaker-local | cut -d' ' -f1`
46+
DOCKER_NET=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f1`
47+
DOCKER_IP=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f12`
48+
49+
# check if both IPTables and the Route Table are OK.
50+
IPTABLES_PATCHED=`sudo iptables -S PREROUTING -t nat | grep -c 169.254.0.2`
51+
ROUTE_TABLE_PATCHED=`sudo ip route show table agent | grep -c $SAGEMAKER_INTERFACE`
52+
53+
if [ $RUNNING_ON_NOTEBOOK_INSTANCE -gt 0 ]; then
54+
55+
if [ $ROUTE_TABLE_PATCHED -eq 0 ]; then
56+
# fix routing
57+
sudo ip route add $DOCKER_NET via $DOCKER_IP dev $SAGEMAKER_INTERFACE table agent
58+
else
59+
echo "SageMaker instance route table setup is ok. We are good to go."
60+
fi
61+
62+
if [ $IPTABLES_PATCHED -eq 0 ]; then
63+
sudo iptables -t nat -A PREROUTING -i $SAGEMAKER_INTERFACE -d 169.254.169.254/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 169.254.0.2:9081
64+
echo "iptables for Docker setup done"
65+
else
66+
echo "SageMaker instance routing for Docker is ok. We are good to go!"
67+
fi
68+
fi

0 commit comments

Comments
 (0)