# Launching Experiments

## Experiments page

You can view all experiments associated with your organisation on the **"Experiments"** page in Control Plane (<https://cp.strongcompute.ai>) and filter experiments by **mode** (compute mode) and **status**. From the Experiments page you can view the most recently printed 100 lines of the master node log file for each experiment, and cancel experiments.

Please note that, from the Experiments page, anyone in an organisation can view or cancel jobs launched by anyone else within the same organisation.

<figure><img src="/files/NTitoR8uLWMR7vYYwtwa" alt=""><figcaption></figcaption></figure>

## Experiment launch file

Experiments are launched using an **experiment launch file** in TOML format which communicates important details of your experiment to the ISC. This file can be named anything you like, by convention we suggest using the file extension `.isc` for distinction. An example of such a file is shown below.

{% code overflow="wrap" lineNumbers="true" fullWidth="false" %}

```toml
isc_project_id = "<your-project-id>"
experiment_name = "foo_experiment"
gpus = 16
compute_mode = "burst"
command = '''
source /opt/venv/bin/activate && 
cd /root/isc-demos/fashion_mnist/ && 
torchrun --nnodes=$NNODES --nproc-per-node=$N_PROC 
--master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --node_rank=$NODE_RANK 
train.py --lr 0.001 --batch-size 16 --save-dir $OUTPUT_PATH --tboard-path $OUTPUT_PATH/tb'''
```

{% endcode %}

* **`isc_project_id`** is <mark style="color:green;">**required**</mark> and can be obtained from the Projects page in Control Plane.
* **`experiment_name`** is <mark style="color:green;">**required**</mark> and can be any string you like.
* **`gpus`** is <mark style="color:green;">**required**</mark> and must be an integer between 1 and 72 inclusive, describing the number of GPUs you want to use for your experiment.
* **`command`** is <mark style="color:green;">**required**</mark> and describes the operation that each node will execute when started, typically including, as a minimum, sourcing your virtual environment and launching a training script.
* **`compute_mode`** is <mark style="color:orange;">**optional**</mark> and must be "**cycle**" (default), "**interruptible**", or "**burst**". See below for more information about these compute modes.
* **`max_rapid_cycles`** is <mark style="color:orange;">**optional**</mark> and must be an integer describing, for experiments with `compute_mode="cycle"`, the number of times the experiment will cycle before completing. See below for more information about these compute modes.
* **`dataset_id_list`** is <mark style="color:orange;">**optional**</mark> and must be a list of Dataset IDs in quotation marks. E.g: `dataset_id_list = [ "dataset-id" ]`. These will be available within your container at runtime at `/data/<dataset-id>`.
* **`burst_shape_priority_list`** is <mark style="color:orange;">**optional**</mark> and must be a list of Burst Shape IDs in quotation marks. E.g: `burst_shape_priority_list = [ "gcp-desired-shape" ]`\
  This field should only be specified when `compute_mode="burst"`. See below for more information about this optional argument.
* **`input_artifact_id_list`** is <mark style="color:orange;">**optional**</mark> and must be a list of at most 3 artifact IDs in quotation marks. E.g. `input_artifact_id_list = [ "<artifact-id>", "<artifact-id>", "<artifact-id>" ]`.

## Artifacts (Job Results)

For information about retrieving experiment artifacts see the [Artifacts (Job Results)](/basic-concepts/artifacts.md) page.&#x20;

## Compute modes

### Burst

Experiments launched with `compute_mode="burst"` will have a dedicated cluster provisioned on a commercial cloud and will run uninterrupted for as long as the user allows or while available credits permit.

Commercial clouds offer a number of different types of compute node, equipped with different types and numbers of GPUs. These types of commercial cloud compute node are referred to as "**shapes**". Each shape of compute node is typically a unique combination of the following.

* Commercial cloud provider,
* Geographic region,
* Processor type,
* Type and number of GPUs,
* Provisioning model (spot / on-demand).

When an experiment is launched with `compute_mode="burst"` the ISC will search for a cluster on a commercial cloud with a suitable **shape**. Users can also specify their preferred shapes by including the following in their experiment launch file.&#x20;

{% code overflow="wrap" %}

```toml
burst_shape_priority_list = ["<shape-1-id>", "<shape-2-id>", ...]
```

{% endcode %}

Users can obtain the necessary `shape-id` values from the **Burst** page on Control Plane. If no valid `shape-id` can be parsed from the `burst_shape_priority_list` argument, or if the User does not include the `burst_shape_priority_list` argument in the experiment launch file, the ISC will try all shapes listed on the **Burst** page on Control Plane.  Each shape will be charged at a different rate.

**Note:** Burst experiments are currently limited to a **maximum of 48 GPUs**. Attempts to launch `burst` experiments with more than 48 GPUs will return an error in the terminal.

### Cycle (Sydney Strong Compute Cluster only)

Experiments with `compute_mode="cycle"` will interrupt any experiment with `compute_mode="interruptible"` that is currently running, and run in priority. Compute mode `cycle` experiments will cycle between `running` for 90s and `paused` a number of times. Users can specify the number of times their `cycle` mode experiment should cycle using the `max_rapid_cycles` argument in the experiment launch file, with a minimum of 1 and a maximum of 5 cycles.

If your experiment does not return an error during cycling and resuming, the experiment status will show "<mark style="color:green;">**completed**</mark>". If your experiment returns an error within that time your experiment status will show "<mark style="color:red;">**failed**</mark>". The purpose of this compute mode is to provide immediate developer feedback on code viability, and to verify that your experiment is able to successfully pause and resume.

When an experiment is launched with compute\_mode="cycle" any changes made inside the container since it was started or restarted are committed to a temporary fork of the container image which the experiment will then run on. If large changes have been made inside the container - e.g. installing heavy dependencies like `torch` to a virtual environment - then this commit can take a long time.

**The ISC will** <mark style="color:yellow;">reject</mark> **any experiment launched with `compute_mode="cycle"` which takes longer than 2 minutes to complete the container image commit.**&#x20;

Therefore it is recommended that Users **restart** their container via the **Stop or Restart** window on [Control Plane](/basic-concepts/workstations-images-and-containers.md) or with the [ISC CLI](/basic-concepts/isc-commands-cli.md) `isc container restart` before starting to launch experiments with `compute_mode="cycle"` to ensure short commit times.

### Interruptible (Sydney Strong Compute Cluster only)

Experiments with `compute_mode="interruptible"` launch into a queue which behaves as follows. Every 2 hours the ISC will enumerate the active interruptible experiments and apportion the next 2 hour period in contiguous blocks to each. The interruptible experiments will then run in order for their apportioned time unless interrupted by an experiment with `compute_mode="cycle"`. If interrupted, the interruptible experiment will then wait until there are no further cycle experiments enqueued before resuming and completing its apportioned time. Interruptible experiments will cycle in this manner indefinitely until the experiment completes or an error is encountered.

## Environment variables

When your experiment is launched to run on the ISC, a copy of your container is started on each node, and a number of environment variables are set which are helpful for coordinating distributed computing operations. Environment variables can be accessed from within training scripts during training, or from within experiment launch file arguments (as above).&#x20;

#### **$OUTPUT\_PATH / $CHECKPOINT\_ARTIFACT\_PATH**

Resolves to `/mnt/checkpoints`.

Synchronises every 10 minutes.

Useful for saving large files such as model weights.

#### **$CRUD\_ARTIFACT\_PATH**

Resolves to `/mnt/crud`.

Syncs every 10 minutes.

Useful for saving large files such as database images.

#### **$LOSSY\_ARTIFACT\_PATH**

Resolves to `/mnt/lossy`.

Synchronises every 30 seconds.

Useful for small files such as tensorboard logs.

#### **$LOG\_ARTIFACT\_PATH**

Resolves to `/mnt/logs`.

Synchronises every 10 seconds.

Only intended for use by Strong Compute processes, not for users.

#### $STRONG\_EXPERIMENT\_ID

The Experiment ID of the currently running experiment. This is automatically generated by the ISC when the experiment is submitted for launch.

#### $STRONG\_EXPERIMENT\_NAME

The user provided name for the experiment described by the `experiment_name` argument  in the experiment launch file.

#### $STRONG\_COMPUTE\_MODE

Compute mode of the currently running experiment, corresponding to the `compute_mode` argument in the experiment launch file.

#### $STRONG\_CYCLE\_COUNT

The number of times this experiment has been started or resumed. For example, an experiment with `compute_mode="interruptible"` will have an incremented `STRONG_CYCLE_COUNT` each 2 hour period in which it is allowed to continue to cycle, or when resuming after being interrupted by an experiment with `compute_mode="cycle"`.

#### $STRONG\_CYCLE\_TIME\_MS

The amount of time allocated to the experiment to run as scheduled. For experiments with `compute_mode="cycle"` this will typically be 90s. For experiments with `compute_mode="interruptible"` this will be the share of the current 2 hour period apportioned to the experiment. Notwithstanding the allocated `$STRONG_CYCLE_TIME`, experiments with `compute_mode="interruptible"` will still be interrupted by experiments with `compute_mode="cycle"`.

#### $IP\_TO\_RANK\_MAPPING

A dictionary mapping the IP addresses of each node in the cluster to the `$RANK` of that machine in the cluster.

#### $NODE\_RANK

An integer index for each node in the cluster.

#### $MASTER\_ADDR

The IP address for the node in the cluster with `$RANK=0`.

#### $MASTER\_PORT

The available port associated with IP address for the node in the cluster with `$RANK=0`.

#### $NNODES

The number of nodes that are running this experiment.

Based on the `gpus` argument in the experiment launch file and used in conjunction with the `$N_PROC` variable, the `$NNODES` variable describes the number of **nodes** for the experiment to be launched on in order that the **total number of processes** started on the cluster is equal to `gpus`. This variable also describes the actual number of nodes that are running the experiment. This is helpful for launching distributed training with popular utilities such as `torchrun` and HuggingFace `accelerate`.

#### $N\_PROC

Based on the `gpus` argument in the experiment launch file, when used in conjunction with the `$NNODES` variable the `$N_PROC` variable describes the number of **processes** to be started on the node in order that the **total number of processes** started on the cluster is equal to `gpus`. This is helpful for launching distributed training with popular utilities such as `torchrun` (as above).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.strongcompute.com/basic-concepts/launching-experiments.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
