Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

DesignSafe provides three places where computation happens. Each serves a different purpose, and most researchers move between them as a project evolves: develop interactively in JupyterHub, then submit production runs to HPC.

JupyterHub

JupyterHub is where most day-to-day work happens. Each session gets a dedicated Kubernetes container at TACC with up to 8 CPU cores and 20 GB RAM. Sessions start immediately with no queue wait. The browser-based environment includes notebooks, a terminal, a file manager, and a text editor — all sharing the same filesystem.

Move to HPC when the workload needs more memory, more cores, multi-node parallelism (MPI), or longer runtimes than an interactive session allows.

For heavier interactive work, Jupyter HPC Native sessions run directly on Stampede3 CPU nodes or Vista GPU nodes (NVIDIA H200) with full node resources. These go through the SLURM queue, so there may be a wait before the session starts. See the linked TACC user guides for Vista queue policies and hardware details.

Virtual machines

Virtual machines (VMs) run interactive GUI applications without a queue wait. OpenSees Interactive, MATLAB, ADCIRC Interactive, STKO, and QGIS all run on shared VMs at TACC. STKO and QGIS provide a full graphical desktop through NICE DCV, which streams a remote desktop to the browser. VMs share hardware across users, so they work best for lightweight tasks and quick tests.

HPC systems

HPC (High-Performance Computing) systems handle production-scale computation. These are clusters of interconnected machines (nodes), each with dozens of CPU cores and hundreds of gigabytes of memory. They are shared systems — thousands of researchers submit jobs to the same hardware, so SLURM manages access through job queues. Researchers using DesignSafe never interact with SLURM directly; Tapis generates SLURM scripts automatically. Long-running simulations, multi-core parallel analyses, and parametric sweeps with hundreds of runs all belong on HPC.

DesignSafe researchers have access to three TACC systems:

SystemCores per NodeMemory per NodePrimary Use
Stampede348–112 (varies by node type)128 GB–4 TBGeneral-purpose, most DesignSafe jobs
Frontera56192 GBLarge-scale parallel simulations
Lonestar6128256 GBGeneral-purpose, GPU nodes available

Nodes, cores, and memory

A node is a complete physical computer. Each node has multiple cores (CPUs) that execute work in parallel, sharing the same pool of RAM.

When submitting a job, specify node_count, cores_per_node, and max_minutes. Total cores = node_count x cores_per_node. For MPI jobs, each core runs one parallel process (rank). For PyLauncher sweeps, each core runs one independent task.

All cores on a node share memory. If each process needs more memory, request fewer cores per node:

Cores per Node (192 GB SKX)Memory per Core
48~4 GB
24~8 GB
12~16 GB

SLURM and queues

SLURM is the job scheduler on all TACC systems. When a job is submitted, SLURM places it in a queue (also called a partition). Each queue groups nodes with similar hardware and enforces limits on node count and runtime. Researchers using DesignSafe never write SLURM scripts directly — Tapis generates them automatically from the job parameters.

Stampede3 queues (full policy in the Stampede3 User Guide):

QueueNode TypeCoresMemoryMax NodesMax DurationCharge Rate
skxSKX (Skylake)48192 GB25648 hrs1 SU
skx-devSKX (Skylake)48192 GB162 hrs1 SU
icxICX (Ice Lake)80256 GB3248 hrs1.5 SUs
sprSPR (Sapphire Rapids)112128 GB HBM3248 hrs2 SUs
pvcPVC (Ponte Vecchio)96512 GB448 hrs3 SUs
nvdimmNVDIMM (Large Memory)804 TB148 hrs4 SUs

SKX nodes are the most numerous (1,060) and a good default. The skx-dev queue is designed for short test runs with low wait times. Always test there before submitting production jobs.

Frontera queues (56 cores, 192 GB per node; Frontera User Guide):

QueueMax NodesMax DurationNotes
normal51248 hrsGeneral production
development402 hrsTesting and debugging
large204848 hrsRequires special approval

Lonestar6 queues (128 cores, 256 GB per node; Lonestar6 User Guide):

QueueMax NodesMax DurationNotes
normal3248 hrsGeneral production
development42 hrsTesting and debugging
gpu-a1001648 hrsNVIDIA A100 GPU nodes
gpu-a100-dev42 hrsGPU development

Choosing a queue

  1. Estimate memory per process. If each MPI rank needs 8 GB and the node has 192 GB, use at most 24 cores per node.

  2. Determine total cores needed. A model decomposed into 96 subdomains needs 96 cores (e.g., 2 nodes x 48 cores).

  3. Pick the queue that fits. Use skx-dev or development for testing. Use production queues for real runs.

  4. Check system load. Live queue status: Stampede3, Frontera, Lonestar6.

Allocations and Service Units

A TACC allocation is a grant of computing time tied to a research project. Running jobs charges Service Units (SUs):

SUs = nodes x hours x charge_rate

A job on 4 SKX nodes for 2 hours at 1 SU/node-hour costs 8 SUs. The same job on SPR nodes at 2 SUs/node-hour costs 16 SUs. Nodes are billed entirely regardless of how many cores are used. Every job incurs a minimum charge of 15 minutes.

HPC-enabled tools (OpenSeesMP, OpenFOAM, ADCIRC) require an allocation. If you don’t have one, submit a ticket through the DesignSafe help desk. Remaining balance and allocation codes are on the TACC Dashboard.