DesignSafe provides three places where computation happens. Each serves a different purpose, and most researchers move between them as a project evolves: develop interactively in JupyterHub, then submit production runs to HPC.
JupyterHub¶
JupyterHub is where most day-to-day work happens. Each session gets a dedicated Kubernetes container at TACC with up to 8 CPU cores and 20 GB RAM. Sessions start immediately with no queue wait. The browser-based environment includes notebooks, a terminal, a file manager, and a text editor — all sharing the same filesystem.
Move to HPC when the workload needs more memory, more cores, multi-node parallelism (MPI), or longer runtimes than an interactive session allows.
For heavier interactive work, Jupyter HPC Native sessions run directly on Stampede3 CPU nodes or Vista GPU nodes (NVIDIA H200) with full node resources. These go through the SLURM queue, so there may be a wait before the session starts. See the linked TACC user guides for Vista queue policies and hardware details.
Virtual machines¶
Virtual machines (VMs) run interactive GUI applications without a queue wait. OpenSees Interactive, MATLAB, ADCIRC Interactive, STKO, and QGIS all run on shared VMs at TACC. STKO and QGIS provide a full graphical desktop through NICE DCV, which streams a remote desktop to the browser. VMs share hardware across users, so they work best for lightweight tasks and quick tests.
HPC systems¶
HPC (High-Performance Computing) systems handle production-scale computation. These are clusters of interconnected machines (nodes), each with dozens of CPU cores and hundreds of gigabytes of memory. They are shared systems — thousands of researchers submit jobs to the same hardware, so SLURM manages access through job queues. Researchers using DesignSafe never interact with SLURM directly; Tapis generates SLURM scripts automatically. Long-running simulations, multi-core parallel analyses, and parametric sweeps with hundreds of runs all belong on HPC.
DesignSafe researchers have access to three TACC systems:
| System | Cores per Node | Memory per Node | Primary Use |
|---|---|---|---|
| Stampede3 | 48–112 (varies by node type) | 128 GB–4 TB | General-purpose, most DesignSafe jobs |
| Frontera | 56 | 192 GB | Large-scale parallel simulations |
| Lonestar6 | 128 | 256 GB | General-purpose, GPU nodes available |
Nodes, cores, and memory¶
A node is a complete physical computer. Each node has multiple cores (CPUs) that execute work in parallel, sharing the same pool of RAM.
When submitting a job, specify node_count, cores_per_node, and max_minutes. Total cores = node_count x cores_per_node. For MPI jobs, each core runs one parallel process (rank). For PyLauncher sweeps, each core runs one independent task.
All cores on a node share memory. If each process needs more memory, request fewer cores per node:
| Cores per Node (192 GB SKX) | Memory per Core |
|---|---|
| 48 | ~4 GB |
| 24 | ~8 GB |
| 12 | ~16 GB |
SLURM and queues¶
SLURM is the job scheduler on all TACC systems. When a job is submitted, SLURM places it in a queue (also called a partition). Each queue groups nodes with similar hardware and enforces limits on node count and runtime. Researchers using DesignSafe never write SLURM scripts directly — Tapis generates them automatically from the job parameters.
Stampede3 queues (full policy in the Stampede3 User Guide):
| Queue | Node Type | Cores | Memory | Max Nodes | Max Duration | Charge Rate |
|---|---|---|---|---|---|---|
| skx | SKX (Skylake) | 48 | 192 GB | 256 | 48 hrs | 1 SU |
| skx-dev | SKX (Skylake) | 48 | 192 GB | 16 | 2 hrs | 1 SU |
| icx | ICX (Ice Lake) | 80 | 256 GB | 32 | 48 hrs | 1.5 SUs |
| spr | SPR (Sapphire Rapids) | 112 | 128 GB HBM | 32 | 48 hrs | 2 SUs |
| pvc | PVC (Ponte Vecchio) | 96 | 512 GB | 4 | 48 hrs | 3 SUs |
| nvdimm | NVDIMM (Large Memory) | 80 | 4 TB | 1 | 48 hrs | 4 SUs |
SKX nodes are the most numerous (1,060) and a good default. The skx-dev queue is designed for short test runs with low wait times. Always test there before submitting production jobs.
Frontera queues (56 cores, 192 GB per node; Frontera User Guide):
| Queue | Max Nodes | Max Duration | Notes |
|---|---|---|---|
| normal | 512 | 48 hrs | General production |
| development | 40 | 2 hrs | Testing and debugging |
| large | 2048 | 48 hrs | Requires special approval |
Lonestar6 queues (128 cores, 256 GB per node; Lonestar6 User Guide):
| Queue | Max Nodes | Max Duration | Notes |
|---|---|---|---|
| normal | 32 | 48 hrs | General production |
| development | 4 | 2 hrs | Testing and debugging |
| gpu-a100 | 16 | 48 hrs | NVIDIA A100 GPU nodes |
| gpu-a100-dev | 4 | 2 hrs | GPU development |
Choosing a queue¶
Estimate memory per process. If each MPI rank needs 8 GB and the node has 192 GB, use at most 24 cores per node.
Determine total cores needed. A model decomposed into 96 subdomains needs 96 cores (e.g., 2 nodes x 48 cores).
Pick the queue that fits. Use
skx-devordevelopmentfor testing. Use production queues for real runs.Check system load. Live queue status: Stampede3, Frontera, Lonestar6.
Allocations and Service Units¶
A TACC allocation is a grant of computing time tied to a research project. Running jobs charges Service Units (SUs):
SUs = nodes x hours x charge_rateA job on 4 SKX nodes for 2 hours at 1 SU/node-hour costs 8 SUs. The same job on SPR nodes at 2 SUs/node-hour costs 16 SUs. Nodes are billed entirely regardless of how many cores are used. Every job incurs a minimum charge of 15 minutes.
HPC-enabled tools (OpenSeesMP, OpenFOAM, ADCIRC) require an allocation. If you don’t have one, submit a ticket through the DesignSafe help desk. Remaining balance and allocation codes are on the TACC Dashboard.