Datasets

Datasets belong to Organisations in Strong Compute, so it is important to have selected the appropriate Organisation from the menu at the top-right of the Control Plane page before navigating to the "Datasets" page. Datasets imported into Strong Compute will be accessible to all members of the Organisation to which that Dataset belongs.

HuggingFace

HuggingFace models and datasets can be ingested as Datasets in Strong Compute.

This is strongly recommended instead of downloading large models and datasets directly to your container which can cause your container to be slow to start and stop.

Step 1: Adding a custom dataset

Datasets up to 1TB can be imported into Strong Compute from either an S3-compatible bucket or from a Hugging Face repository (model or dataset).

To get started, click on the Datasets tab in Control Plane and click on the "New Dataset" button at the bottom of the User Datasets table.

For datasets in an S3-compatible bucket, click on the "S3" button. To create a Dataset from a Hugging Face repository, click on "Hugging Face".

Note: if your dataset not yet on S3, here's a friendly guide on how to upload it to Cloudflare R2.

Example setup of S3 bucket on Cloudflare R2

You can use any S3-compatible provider - we've found R2 to be a decent value option!

  1. Signup for a Cloudflare account - e.g, at this link.

  2. Navigate to R2 Object Storage in the sidebar (or click here).

  1. Click on the "Create Bucket" button.

Give it a relevant name - this is how it'll show up on our platform.

You can leave all else default and hit "Create", unless you have a preference!

  1. Bucket Created! Time to upload files - click on the name you gave it.

Click on the "Select from Computer" button. (Or, use the API if you prefer).

Note: You can grab Huggingface dataset files from `~/.cache` per screenshot, if you've used it locally. But avoid doing this method on StrongCompute, as it breaks your account if you exceed your allowance (75GB); likely with big datasets. To see usage, run: df -h | grep 75G

  1. Open the Settings pane; note your s3 URL - we'll use this in a moment.

The S3 URL should be visible near the top of the page.

  1. Finally - create an API token for access (link here).

Again, the defaults are fine - but it's good practice to set a TTL expiry.

  1. Success! Note your Access ID and Secret Key, then proceed with the steps below :D

Complete the relevant form for your dataset, including access credentials and click "Add Dataset".

Note: For datasets in Cloudflare R2 storage, your S3 endpoint will look as follows.

https://<letters-and-numbers>.r2.cloudflarestorage.com/<bucket-name>

The Endpoint extracted from this url is <letters-and-numbers>.r2.cloudflarestorage.com (i.e. without leading https:// and trailing /<bucket-name>.

**Note: If you are using a provider that does not support regions or incorporates the region into the Endpoint URL (e.g. Cloudflare R2, OCI Storage), then leave this field blank.

For datasets in S3 buckets, your credentials must have permission to access buckets and objects from S3, and the S3 bucket must be non-empty.

For datasets from Hugging Face, your Hugging Face Token must have permission to access the model or dataset.

The Hugging Face Repo ID can be found at the top of the page for the model or dataset.

Once your Dataset is created and validated, it will be automatically cached to the Strong Compute Global Silo. Your Dataset is finished caching to the Global Silo when the Global Silo Dataset Cache State is stored.

Step 2: Downloading it to our Cluster

After your Dataset is cached to the Global Silo, it can be downloaded to a Constellation Cluster so that Users can access it Containers and for training.

Once the Global Silo Dataset Cache for your Dataset shows its State as stored, select your dataset from the "User Dataset Name" menu and your destination Cluster from the "Constellation Cluster" menu and click "Cache Dataset". This will start the ISC downloading and creating a Constellation Cluster Dataset Cache of your Dataset.

Your Dataset will be ready to access in your Container and in training when the Constellation Cluster Dataset Cache of your Dataset shows its State is available.

User datasets will show Access is Private indicating that Users can only access those Datasets from Containers associated with the Organisation that is the owner of that Dataset. Users can also use any of the datasets cached on the Cluster which shows Access is Public.

Step 3. Accessing datasets in development

To access your Organisations datasets or any of the Public datasets during development, select the appropriate Dataset from the "Mounted Dataset" menu for the appropriate Container before you start your Container.

Your dataset will then be mounted to your development container at /data/<dataset-id> and available there to you during development. Once inside your Container, navigate to your dataset with cd /data/<dataset-id>.

Step 4. Accessing datasets in training

You can access any of your Organisations datasets or any of the Public datasets during training, including multiple within the same training script. Include a dataset_id_list field in your experiment launch file with the dataset IDs as an array of unique strings, as follows.

dataset_id_list = ["<dataset1-id>","<dataset2-id>","<dataset3-id>"]

After your script has launched, your Datasets will be mounted to your Container at /data/<dataset-id> and available to your code during training.

Clear your Hugging Face cache before launching experiments or stopping your container

The benefit of ingesting Hugging Face models and datasets as Datasets in Strong Compute is that these data do not contribute to the size of your container. This is desirable for keeping your container as small as possible, thereby allowing your container to start and stop as quickly as possible.

However, when models and datasets are loaded directly from Hugging Face they cache by default to a directory inside your container which is typically at ~/.cache/huggingface. We strongly recommend deleting this directory before either:

  • Launching experiments, or

  • Stopping your container.

Both of these actions will trigger a backup of your container to cloud storage, including the contents of ~/.cache/huggingface, thus committing it to your container. To avoid this, we recommend running the following command before launching experiments (with isc train) or stopping your container.

rm -rf ~/.cache/huggingface

Last updated