Strong Docs
  • Welcome
  • Getting Started
    • 1. Registration and VPN
    • 2. Setting up your development environment
    • 3. Hello World
  • Basic Concepts
    • Organisation & Teams
    • Containers
    • Projects
    • Datasets
    • Launching Experiments
    • Experiment States
    • Artifacts
    • ISC Commands (CLI)
    • Resuming Experiments
    • Billing
  • Advanced
    • Clusters
    • Destinations
    • BYO Cloud API Keys
    • Cluster health logs
  • Training with ISC
    • Deep dive tutorial
    • Data Parallel Scaling
  • Use Cases
    • More Examples & Demos
  • Change Log
    • May 2025
    • April 2025
    • March 2025
    • February 2025
    • January 2025
    • December 2024
    • November 2024
    • October 2024
    • September 2024
    • August 2024
    • July 2024
Powered by GitBook
On this page
  • Step 1: Adding a custom dataset
  • Step 2: Downloading it to our Cluster
  • Step 3. Accessing datasets in development
  • Step 4. Accessing datasets in training
  1. Basic Concepts

Datasets

PreviousProjectsNextLaunching Experiments

Last updated 2 months ago

Datasets belong to Organisations in Strong Compute, so it is important to have selected the appropriate Organisation from the menu at the top-right of the Control Plane page before navigating to the "Datasets" page. Datasets imported into Strong Compute will be accessible to all members of the Organisation to which that Dataset belongs.

Step 1: Adding a custom dataset

On the Datasets page, you can import your own dataset (up to 100GB) from an S3-compatible bucket by clicking on the "New Dataset" button and completing the form.

Note: if your dataset not yet on S3 (e.g, a Huggingface dataset), here's a friendly guide to convert it!

Example walkthrough with Cloudflare R2 (free for <10GB!)

You can use any S3-compatible provider - we've found R2 to be a decent value option!

  1. Signup for a free Cloudflare account - e.g, at .

  2. Navigate to R2 Object Storage in the sidebar (or ).

  1. Click on the "Create Bucket" button.

Give it a relevant name - this is how it'll show up on our platform.

You can leave all else default and hit "Create", unless you have a preference!

  1. Bucket Created! Time to upload files - click on the name you gave it.

Click on the "Select from Computer" button. (Or, use the API if you prefer).

Note: You can grab Huggingface dataset files from `~/.cache` per screenshot, if you've used it locally. But avoid doing this method on StrongCompute, as it breaks your account if you exceed your allowance (75GB); likely with big datasets. To see usage, run: df -h | grep 75G

  1. Open the Settings pane; note your s3 URL - we'll use this in a moment.

The S3 URL should be visible near the top of the page.

  1. Finally - create an API token for access ().

Again, the defaults are fine - but it's good practice to set a TTL expiry.

  1. Success! Note your Access ID and Secret Key, then proceed with the steps below :D

A new Dataset requires 5 parameters:

  • Name (optional): A useful descriptor for the Dataset.

  • Access Key ID: A valid access key ID.

  • Secret Access Key: A valid secret access key that matches the above ID.

  • Endpoint: The endpoint for the s3 host of your bucket e.g. s3.amazonaws.com.

  • Region: The region in which your bucket is located e.g. us-east-1.

  • Bucket Name: The name of the S3 bucket without leading protocol. For example when importing a dataset from the bucket s3://hello-world the input to this field is hello-world.

Your Access Key must have permission to read and access buckets and objects from S3, and the S3 bucket must be non-empty.

Once your Dataset is created and validated, it will be automatically cached to the Strong Compute Global Silo. Your Dataset is finished caching to the Global Silo when the Global Silo Dataset Cache State is stored.

Step 2: Downloading it to our Cluster

After your Dataset is cached to the Global Silo, it can be downloaded to a Constellation Cluster so that Users can access it Containers and for training.

Once the Global Silo Dataset Cache for your Dataset shows its State as stored, select your dataset from the "User Dataset Name" menu and your destination Cluster from the "Constellation Cluster" menu and click "Cache Dataset". This will start the ISC downloading and creating a Constellation Cluster Dataset Cache of your Dataset.

Your Dataset will be ready to access in your Container and in training when the Constellation Cluster Dataset Cache of your Dataset shows its State is available.

User datasets will show Access is Private indicating that Users can only access those Datasets from Containers associated with the Organisation that is the owner of that Dataset. Users can also use any of the datasets cached on the Cluster which shows Access is Public.

Step 3. Accessing datasets in development

Step 4. Accessing datasets in training

dataset_id_list = ["<dataset1-id>","<dataset2-id>","<dataset3-id>"]

After your script has launched, your Datasets will be mounted to your Container at /data/<dataset-id> and available to your code during training.

To access your Organisations datasets or any of the Public datasets during development, select the appropriate Dataset from the "Mounted Dataset" menu for the appropriate Container before you start your .

Your dataset will then be mounted to your development container at /data/<dataset-id> and available there to you during development. , navigate to your dataset with cd /data/<dataset-id>.

You can access any of your Organisations datasets or any of the Public datasets during training, including multiple within the same training script. Include a dataset_id_list field in your with the dataset IDs as an array of unique strings, as follows.

Container
Once inside your Container
this link
click here
link here
experiment launch file
The Datasets Page
Initial S3 Upload
Click "Cache" to download it for use!