2. Hello World training example

Lets launch our first "Hello World" experiment. To follow this guide, ensure you have accessed your container as described in Section 1.

In this guide we're going to train a custom Convolutional Neural Network (CNN) on the FashionMNIST dataset using the Strong Compute Instant SuperComputer (ISC).

2.1 Create a Project on Control Plane

Visit the "Projects" page on Control Plane (https://cp.strongcompute.ai). Click on "New Project" and give your new project a name such as "Hello World". Make a note of the ID of your new project which you will need later.

All experiments launched on the ISC must be associated with a project which is used for tracking compute consumption and cost control. To successfully launch experiments you will need the help of your organisation owner or admins to ensure your organisation has sufficient credits and that any applied cost controls permit experiments to be launched under your project.

2.2 Install requirements

Actually you don't need to do this because, if you created your container from the StrongCompute/isc-demos image (Step 1.5) then all of the requirements necessary to run the Strong Compute Hello World example. In general, you will need to think about installing dependencies necessary for your project.

2.4 Clone the Strong Compute ISC Demos GitHub repository

In your terminal run the following commands to clone the ISC Demos repo and install the dependencies.

cd ~
git clone --depth 1 https://github.com/StrongResearch/isc-demos.git
cd ~/isc-demos

The ISC Demos repo includes a project subdirectory for our FashionMNIST example. Navigate to that subdirectory and inspect the requirements.txt file - all of these dependencies have already been installed to a python virtual environment in your container at /opt/venv.

cd ~/isc-demos/fashion_mnist
cat requirements.txt

Notice that in addition to PyTorch and other dependencies, we have installed another GitHub repository called cycling_utils. This is a repository developed by Strong Compute to offer simple helpful utilities for enabling saving and resuming your training from checkpoints.

2.5 Update the experiment launch file

Experiments are launched on the ISC using a TOML file which communicates important details of your experiment to the ISC. This file can be named anything you like. We suggest using the file extension .isc to distinguish it from other files.

Open the fashion_mnist launch file for editing with the following command (or open it for editing in VSCode).

cd ~/isc-demos/fashion_mnist
nano fashion_mnist.isc

Update the fashion_mnist.isc file with the ID of the Project you created above.

isc_project_id = "<project-id>"
experiment_name = "fashion_mnist"
gpus = 16
compute_mode = "burst"
dataset_id_list = ["uds-decorous-field-baritone-250513"]
command = '''
source /opt/venv/bin/activate && 
torchrun --nnodes=$NNODES --nproc-per-node=$N_PROC 
--master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --node_rank=$NODE_RANK 
/root/isc-demos/fashion_mnist/train.py 
--dataset-id uds-decorous-field-baritone-250513
--lr 0.001 --batch-size 16'''
  • experiment_name is a required field that must be a string and can be anything you like.

  • gpus is a required field that must be an integer between 1 and 48 inclusive and describes the number of GPUs you want to use for your experiment.

  • compute_mode must be a string and must be either "cycle" (default if not specified in the experiment lauch file),"interruptible", or "burst". For explanation of these options and general ISC dynamics see the Compute mode heading of Experiments under Basic Concepts. The experiment launch file will come set with compute_mode = "burst". Users wishing to launch their experiments instead on a running cluster (e.g. Sydney Strong Compute cluster) may change this to compute_mode="cycle".

  • dataset_id_list is an optional field that must be a list of unique strings corresponding to the IDs for Datasets that you have access to in Control Plane. This example is based on the FashionMNIST Open Dataset. For more information about Datasets see the Datasets section under Basics.

  • command is a required field that must be a string and describes the sequence of operations you want each node to perform when it is started to run your experiment. In this example, we are activating our .fashion virtual environment, navigating into our fashion_mnist project directory, and calling torchrun to start our distributed training routine described train.py. Note that the torchrun arguments include --nnodes=$NNODES and --nproc-per-node=$N_PROC. These environment variables are set by the ISC based on the required gpus and the number of GPUs per node in the cluster.

2.6 Launch an experiment

Launch your experiment by running the following commands.

cd ~/isc-demos/fashion_mnist
isc train fashion_mnist.isc

You will receive the following response in your terminal.

Using credentials file /root/credentials.isc
[notice] Container has been paused while we prepare your experiment. Please wait...
[notice] Container has been unpaused

Note that your container is "paused" - locked for editing - while a copy of your container image is made for running your experiment. After the experiment image is created, your container is unpaused and you can once again make changes to files in your container. Your experiment image will now being exporting to cloud storage.

Visit the Experiments page in Control Plane and click the "Burst" button for the experiment. The "Burst" button will be enabled after the experiment image has finished exporting to cloud storage - this can take several minutes depending on the size of your experiment image.

The ISC will now proceed to provision a suitable cluster somewhere in the world with enough GPUs to run your experiment (described in your experiment launch file). This process involves several steps, such as initialising the cluster, downloading your container and any datasets you require (in this case the FashionMNIST dataset) and running your experiment.

Please note users launching experiments on a running cluster (e.g. the Sydney Strong Compute cluster) in compute_mode = "cycle" or compute_mode = "burst" will receive a different success message, and these experiments will launch without the need for further action in Control Plane. For more information about compute modes and their behaviour, see the Launching Experiments section of these docs.

Track the progress of these steps by clicking the "Details" button for the experiment and observe the Performance Logs on the right side of the page.

You can also track the status of your experiment from the container terminal by running the following command.

isc experiments

The following report will displayed in your terminal.

                               ISC Experiments                                                                                                                          
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃  Experiment ID  ┃ Name          ┃ Created              ┃ GPUs ┃ Compute Mode   ┃ Cycle Count ┃ Status   ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ <experiment-id> │ fashion_mnist │ YYYY-MM-DD HH:MM:SS  │ 16   │ <compute-mode> │ ...         │ ...      │
└─────────────────┴───────────────┴──────────────────────┴──────┴────────────────┴─────────────┴──────────┘

Once this initialisation process is complete, the experiment Status will change to running.

2.7 Synchronising experiment artifacts

Outputs from your experiment such as print (standard out) logs, tensorboard logs, and checkpoints are saved in Artifacts. All experiments will generate "logs" type Artifacts, and may generate other Artifacts if the experiment is properly configured by the user. See the Artifacts section of these docs for more information on properly configuring experiments to generate Artifacts.

To download artifacts from your experiment to your workstation, visit the Experiments page on Control Plane https://cp.strongcompute.ai and click on the "Details" button for your experiment, then click "Sync to workstation" for each artifact you want to download. The three artifact types that are important for this experiment are as follows.

Logs

Logs artifacts contain text files for each node running the experiment (e.g. rank_N.txt) with anything printed to standard out or standard error. The logs artifact should be the first place to look for information to assist in debugging training code. Updates to the logs artifact are synchronised from running experiments every 10 seconds, and at the end of training (e.g. at the end of a 90 second cycle experiment).

Checkpoints

Checkpoint artifacts are intended to contain larger files such as model weights. Updates to checkpoint artifacts are synchronised from running experiments every 10 minutes, and at the end of training (e.g. at the end of a 90 second cycle experiment). Note that we are passing in the CHECKPOINT_ARTIFACT_PATH environment variable set by the ISC in the experiment launch file above as the path for saving our model checkpoints.

Lossy

Lossy artifacts are intended to contain smaller files that we make more frequent updates to such as tensorboard logs. Updates to lossy artifacts are synchronised from running experiments every 30 seconds, and at the end of training (e.g. at the end of a 90 second cycle experiment). Note that we are passing in the LOSSY_ARTIFACT_PATH environment variable set by the ISC in the experiment launch file above as the path for saving our tensorboard logs.

Accessing artifacts in your container

After the artifact has downloaded to your workstation, the contents of the artifact will be available to retrieve from the following location inside your container.

/shared/artifacts/<experiment-id>/<type>

When you click "Sync to Workstation", the experiment artifacts are downloaded in their state as at that moment in time. If the experiment is still running, you will need to click "Sync to Workstation" again to update the artifacts with latest changes from your running experiment.

2.8 Launch tensorboard

To launch the tensorboard view of logs generated by your experiment, first download the "lossy" logs to your workstation, then run the following command in your container terminal.

tensorboard --logdir /shared/artifacts/<experiment-id>/lossy

Enter the following URL in your browser to view your tensorboard.

http://localhost:6006/

Your tensorboard will resemble the following.

Congratulations, you have successfully launched and tracked your first distributed training experiment on the ISC!

More Examples & Demos

Last updated