Infrastructure Guide
AI Lab Infrastructure
Section titled “AI Lab Infrastructure”DGX H100 Specifications
Section titled “DGX H100 Specifications”| Item | Spec |
|---|---|
| GPU | NVIDIA H100 SXM5 × 8 |
| GPU Memory | 80GB HBM3 × 8 (640GB total) |
| CPU | Intel Xeon Platinum 8480C × 2 (112 cores) |
| System Memory | 2TB DDR5 |
| Storage | 7.68TB NVMe SSD |
| Network | 8 × InfiniBand 400Gb/s |
MIG Slice Allocation
Section titled “MIG Slice Allocation”Each student is assigned one 1g.10gb MIG slice:
| Slice Type | GPU Memory | Max Instances | Suitable For |
|---|---|---|---|
1g.10gb | 10GB | 7 | vLLM Lite models, labs |
2g.20gb | 20GB | 3 | Medium-scale models |
3g.40gb | 40GB | 2 | Large-scale deployment |
7g.80gb | 80GB | 1 | Full GPU |
Server Access
Section titled “Server Access”The DGX server is protected by Cloudflare Zero Trust. Before connecting via SSH, you must install and log in to the Cloudflare WARP client.
1. Install Cloudflare WARP
Section titled “1. Install Cloudflare WARP”Download and install the client for your operating system from the Cloudflare WARP download page.
2. Log in to Cloudflare Zero Trust
Section titled “2. Log in to Cloudflare Zero Trust”- Launch Cloudflare WARP.
- Click the gear (Settings) icon. (Windows: bottom-left / macOS: top-right menu bar)
- Navigate to Preferences → Account and click Login to Cloudflare Zero Trust.
- Enter the team name and log in with your university email.
The team name, server address, and other sensitive details will be provided separately during class.
3. SSH Connection
Section titled “3. SSH Connection”With the WARP connection active, open a terminal and connect with the following command.
ssh {USER}@{SERVER_IP} -p {PORT}| Item | Description |
|---|---|
{USER} | Server account ID |
{SERVER_IP} | DGX server address |
{PORT} | SSH port |
Account information and server address will be provided individually during class.
4. Check and Use GPU
Section titled “4. Check and Use GPU”# Check assigned MIG slicesnvidia-smi mig -lgip
# Monitor GPU utilizationnvidia-smi dmon -s u -d 5 # every 5 seconds
# Run Python on your assigned MIG sliceCUDA_VISIBLE_DEVICES=MIG-GPU-[UUID] python your_script.pyRunning Kubernetes Workloads
Section titled “Running Kubernetes Workloads”# job.yaml — batch job submissionapiVersion: batch/v1kind: Jobmetadata: name: [student-id]-experiment namespace: ai-systemsspec: template: spec: containers: - name: experiment image: pytorch/pytorch:2.5-cuda12-cudnn9-devel command: ["python", "train.py"] resources: limits: nvidia.com/mig-1g.10gb: "1" memory: "16Gi" cpu: "8" volumeMounts: - name: workspace mountPath: /workspace volumes: - name: workspace persistentVolumeClaim: claimName: [student-id]-pvc restartPolicy: Never# Submit jobkubectl apply -f job.yaml -n ai-systems
# View logskubectl logs -f job/[student-id]-experiment -n ai-systems
# Delete jobkubectl delete job [student-id]-experiment -n ai-systemsStorage
Section titled “Storage”| Path | Capacity | Purpose |
|---|---|---|
/home/[student-id] | 100GB | Home directory |
/workspace/[student-id] | 500GB | Lab projects |
/data/shared | 10TB | Shared datasets (read-only) |
/models/cache | 5TB | Shared model cache (read-only) |
Useful Commands
Section titled “Useful Commands”# Check disk usagedu -sh /workspace/[student-id]/*
# Check running processesps aux | grep python
# Check GPU processesnvidia-smi
# List Slurm jobs (queued jobs)squeue -u [student-id]Important Notes
Section titled “Important Notes”- Conserve compute resources: Terminate processes when you finish a lab session
- Large files: Request to share files over 1GB in
/data/shared - Model downloads: Models already in
/models/cachedo not need to be re-downloaded - Overnight batch jobs: Submit long experiments as Kubernetes Jobs during off-hours (22:00–06:00)
Contact
Section titled “Contact”For technical issues, contact the AI Lab administrator (lab@chu.ac.kr) or open a GitHub Issue.