Skip to main content

Training & Best Practices

This section outlines recommended practices for developing, training, and managing AI workloads on Rackrr compute instances.

Rackrr virtual machines should be treated as dedicated compute environments for your workload, similar to working on a local or on-premise server.

Working Inside Your VM

Once connected to your VM, you can:

  • Clone Git repositories
  • Create directories and manage files
  • Download datasets and dependencies
  • Install libraries and frameworks
  • Run long-running training jobs

For project-based work, we strongly recommend using version control (e.g. GitHub or GitLab) so code can be easily cloned into the VM and reproduced across environments.

Managing Datasets

Small to Medium Datasets

Datasets can be downloaded directly onto the VM using a browser or command-line tools such as wget or curl.

Transferring Data via SCP

If you have data locally and need to move it to your VM, you can use scp:

scp -P <ssh_port> ~/Downloads/dataset user@<vm_ip>:/home/user/destination_folder

Where:

  • <ssh_port> is the forwarded SSH port
  • <vm_ip> is the VM's public IP address
  • The first path is the local source
  • The second path is the destination on the VM

Environment Setup & Dependency Management

To avoid dependency conflicts, it is best practice to use a virtual environment.

Create a Python Virtual Environment

sudo apt install python3-venv
python3 -m venv myenv
source myenv/bin/activate

Once activated, install required packages inside the environment:

pip install package_name

This keeps project dependencies isolated and reproducible.

Running Training Jobs

After setting up your environment and dependencies, you can begin training your model directly from the VM terminal.

Training is typically executed via a Python script or command-line interface.

Example Training Command

python train.py \
--name people \
--lr 4e-5 \
--batchSize 4 \
--gpu_ids 0 \
--dataset "/home/user/folder/dataset" \
--total_step 250000 \
--continue_train True \
--model_freq 1000 \
--load_pretrain "/home/user/folder/checkpoints/people" \
--checkpoints_dir "/home/user/folder/checkpoints" \
--which_epoch 150000

This example demonstrates a typical training workflow. Your parameters will vary based on model architecture and dataset size.

Checkpoints & Model Artifacts

Always configure your training jobs to save checkpoints at regular intervals.

Best practices:

  • Store checkpoints in a dedicated directory
  • Periodically copy important artifacts off the VM
  • Keep track of the latest stable checkpoint for recovery or deployment

Writing & Running Scripts

You can run commands directly in the terminal or write reusable scripts.

Create a Python Script

nano myscript.py

After writing your code:

  • Press Ctrl + O to save
  • Press Ctrl + X to exit

Run the script:

python myscript.py

This approach is recommended for repeatable training runs and automation.

Operational Tips

  • Monitor GPU usage during training to ensure resources are fully utilized
  • Avoid running multiple heavy workloads on a single VM unless explicitly configured
  • Stop or terminate VMs when not in use to manage costs
  • Use reserved capacity for long-running or predictable workloads