Skip to content

Infrastructure Guide

ItemSpec
GPUNVIDIA H100 SXM5 × 8
GPU Memory80GB HBM3 × 8 (640GB total)
CPUIntel Xeon Platinum 8480C × 2 (112 cores)
System Memory2TB DDR5
Storage7.68TB NVMe SSD
Network8 × InfiniBand 400Gb/s

Each student is assigned one 1g.10gb MIG slice:

Slice TypeGPU MemoryMax InstancesSuitable For
1g.10gb10GB7vLLM Lite models, labs
2g.20gb20GB3Medium-scale models
3g.40gb40GB2Large-scale deployment
7g.80gb80GB1Full GPU

The DGX server is protected by Cloudflare Zero Trust. Before connecting via SSH, you must install and log in to the Cloudflare WARP client.

Download and install the client for your operating system from the Cloudflare WARP download page.

  1. Launch Cloudflare WARP.
  2. Click the gear (Settings) icon. (Windows: bottom-left / macOS: top-right menu bar)
  3. Navigate to Preferences → Account and click Login to Cloudflare Zero Trust.
  4. Enter the team name and log in with your university email.

The team name, server address, and other sensitive details will be provided separately during class.

With the WARP connection active, open a terminal and connect with the following command.

Terminal window
ssh {USER}@{SERVER_IP} -p {PORT}
ItemDescription
{USER}Server account ID
{SERVER_IP}DGX server address
{PORT}SSH port

Account information and server address will be provided individually during class.

Terminal window
# Check assigned MIG slices
nvidia-smi mig -lgip
# Monitor GPU utilization
nvidia-smi dmon -s u -d 5 # every 5 seconds
# Run Python on your assigned MIG slice
CUDA_VISIBLE_DEVICES=MIG-GPU-[UUID] python your_script.py
# job.yaml — batch job submission
apiVersion: batch/v1
kind: Job
metadata:
name: [student-id]-experiment
namespace: ai-systems
spec:
template:
spec:
containers:
- name: experiment
image: pytorch/pytorch:2.5-cuda12-cudnn9-devel
command: ["python", "train.py"]
resources:
limits:
nvidia.com/mig-1g.10gb: "1"
memory: "16Gi"
cpu: "8"
volumeMounts:
- name: workspace
mountPath: /workspace
volumes:
- name: workspace
persistentVolumeClaim:
claimName: [student-id]-pvc
restartPolicy: Never
Terminal window
# Submit job
kubectl apply -f job.yaml -n ai-systems
# View logs
kubectl logs -f job/[student-id]-experiment -n ai-systems
# Delete job
kubectl delete job [student-id]-experiment -n ai-systems
PathCapacityPurpose
/home/[student-id]100GBHome directory
/workspace/[student-id]500GBLab projects
/data/shared10TBShared datasets (read-only)
/models/cache5TBShared model cache (read-only)
Terminal window
# Check disk usage
du -sh /workspace/[student-id]/*
# Check running processes
ps aux | grep python
# Check GPU processes
nvidia-smi
# List Slurm jobs (queued jobs)
squeue -u [student-id]
  1. Conserve compute resources: Terminate processes when you finish a lab session
  2. Large files: Request to share files over 1GB in /data/shared
  3. Model downloads: Models already in /models/cache do not need to be re-downloaded
  4. Overnight batch jobs: Submit long experiments as Kubernetes Jobs during off-hours (22:00–06:00)

For technical issues, contact the AI Lab administrator (lab@chu.ac.kr) or open a GitHub Issue.