Datasets
Last updated
Last updated
Datasets belong to Organisations in Strong Compute, so it is important to have selected the appropriate Organisation from the menu at the top-right of the Control Plane page before navigating to the "Datasets" page. Datasets imported into Strong Compute will be accessible to all members of the Organisation to which that Dataset belongs.
On the Datasets page Users can import their own datasets (up to 100GB) from AWS S3 buckets by clicking on the "New Dataset" button and completing the form.
A new Dataset requires 5 parameters:
Name (optional): A useful descriptor for the Dataset.
Access Key ID: A valid access key ID.
Secret Access Key: A valid secret access key that matches the above ID.
Endpoint: The endpoint for the s3 host of your bucket e.g. s3.amazonaws.com
.
Region: The region in which your bucket is located e.g. us-east-1
.
Bucket Name: The name of the S3 bucket without leading protocol. For example when importing a dataset from the bucket s3://hello-world
the input to this field is hello-world
.
Your Access Key must have permission to read and access buckets and objects from S3, and the S3 bucket must be non-empty.
Once your Dataset is created and validated, it will be automatically cached to the Strong Compute Global Silo. Your Dataset is finished caching to the Global Silo when the Global Silo Dataset Cache State is stored
.
After your Dataset is cached to the Global Silo, it can be downloaded to a Constellation Cluster so that Users can access it Containers and for training.
Once the Global Silo Dataset Cache for your Dataset shows its State as stored, select your dataset from the "User Dataset Name" menu and your destination Cluster from the "Constellation Cluster" menu and click "Cache Dataset". This will start the ISC downloading and creating a Constellation Cluster Dataset Cache of your Dataset.
Your Dataset will be ready to access in your Container and in training when the Constellation Cluster Dataset Cache of your Dataset shows its State is available
.
User datasets will show Access is Private
indicating that Users can only access those Datasets from Containers associated with the Organisation that is the owner of that Dataset. Users can also use any of the datasets cached on the Cluster which shows Access is Public
.
To access your Organisations datasets or any of the Public datasets during development, select the appropriate Dataset from the "Mounted Dataset" menu for the appropriate Container before you start your Container.
Your dataset will then be mounted to your development container at /data
and available there to you during development. Once inside your Container, navigate to your dataset with cd /data
.
To access your Organisations datasets or any of the Public datasets during training, include a dataset_id
field with the corresponding Dataset ID in your experiment launch file as follows.
Your Dataset will then be mounted to your Container at /data
and available there to your training script during training.