Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

A job finished with status FAILED, has been stuck in QUEUED for hours, or completed but the results look wrong. This page walks through how to diagnose and fix these problems.

Job states

Tapis tracks every job through a sequence of states from submission to completion. Knowing which state a job reached narrows the problem to a specific part of the pipeline.

StateWhat It Means
PENDINGSubmitted but not yet started.
STAGING_INPUTSTapis is copying input files to the execution system.
QUEUEDWaiting in the SLURM scheduler queue for compute resources.
RUNNINGActively executing on compute nodes.
ARCHIVINGTapis is copying output files back to the archive location on Corral.
FINISHEDCompleted successfully and outputs were archived.
FAILEDSomething went wrong during staging, execution, or archiving.
CANCELLEDManually cancelled before completion.
# Check the current state
job.get_status()

# Poll until completion, printing each state transition
job.monitor()

Job stuck in QUEUED

A job that stays in QUEUED for a long time is not broken. It is waiting for SLURM to allocate the requested hardware. Common reasons for long waits:

To reduce wait time, try the development queue (skx-dev) for test runs, request fewer nodes, or request a shorter walltime.

Reading the output files

Every job produces two log files.

tapisjob.out is the standard output stream (stdout). Programs print their normal progress messages here, along with results and debugging statements. If a structural analysis prints iteration counts or convergence norms, they appear in this file.

tapisjob.err is the standard error stream (stderr). Programs report problems here. Runtime errors, missing file messages, MPI startup failures, and SLURM warnings all appear in this file. When something goes wrong, this is usually the first place to look.

FileWhat to Look For
tapisjob.outProgress messages, printed results, completion indicators
tapisjob.errSyntax errors, missing files, failed module loads, MPI problems, permission errors

Both files are archived with the job outputs. There are several ways to view them:

An empty .out file usually means the script failed before producing any output. The error will be in .err. Always check .err even when the job produced output, because it may reveal warnings or silent failures.

Troubleshooting checklist

When a job does not behave as expected, work through these checks in order.

StepWhat to CheckWhere to Look
1Did the job run at all? Look for start/stop messages..out
2Is there a syntax or runtime error?.err
3Any missing input files or path typos?.err
4Are MPI core counts correct? (see Parallel jobs).err
5Is the correct executable being used (OpenSees, OpenSeesSP, OpenSeesMP)?.out / .err
6Does output stop partway through, indicating a crash?.out / .err
7Does the script use absolute or relative paths correctly?Both
8Is there a SLURM-specific error, such as an exceeded time limit?.err

Common failure patterns

Allocation expired or invalid. The job fails immediately at the QUEUED stage with a message like Unable to allocate resources: Invalid account or account/partition combination specified. Verify the allocation code on the TACC Dashboard and confirm it has remaining SUs.

Input files not found. Path typos are the most common cause. Tapis stages files into a working directory, so relative paths in the script must match the directory structure that was uploaded. Look in .err for messages like No such file or directory: 'inputs/motions/GM_01.txt'.

Walltime exceeded. SLURM kills jobs that exceed their max_minutes limit. The .err file will contain CANCELLED AT ... DUE TO TIME LIMIT. Resubmit with a longer walltime. A good rule of thumb is to add 50% margin beyond the expected runtime.

MPI configuration wrong. On TACC systems, ibrun is the correct MPI launcher (see Running HPC Jobs). If .err shows ORTE was unable to reliably start one or more daemons or MPI hostfile errors, check that the node/core counts match the model’s decomposition.

Wrong number of MPI ranks. If an OpenSees MP model is partitioned into 96 subdomains but the job requests 48 cores, the simulation will fail or produce incorrect results. Total ranks (node_count x cores_per_node) must match the model setup.

Ranks writing to the same file. If output files contain garbled data or results seem randomly wrong, check that each MPI rank writes to a unique filename. See Rank-aware file management.

Module not available. TACC uses environment modules to manage software versions. If .err shows Lmod has detected the following error: The following module(s) are unknown, the compute node does not have the expected software. This often happens when a module name or version has changed. Verify the app definition includes the correct module loads.

Staging failure. If the job fails during STAGING_INPUTS, Tapis could not copy the input files to the execution system. Common causes:

Archiving failure. If the job ran successfully but fails during ARCHIVING, Tapis could not copy the output files back to DesignSafe storage. Common causes:

In both cases, job.get_status() will show FAILED. Check the Tapis job history for the specific error message, or contact DesignSafe support with the job UUID.

Reconnecting to a running job

A lost notebook session does not mean losing track of a job. Reconnect using the job UUID.

from dapi import DSClient

ds = DSClient()
job = ds.jobs.job("your-job-uuid-here")
job.get_status()

The job UUID appears in the output from ds.jobs.submit() and on the DesignSafe Job Status page in the portal.

Quick debugging workflow

When a job fails, work through these steps:

# 1. Check the final state
job.get_status()

# 2. List the output files
job.list_outputs()

# 3. Read the error log
print(job.get_output_content("tapisjob.err"))

# 4. Read the output log
print(job.get_output_content("tapisjob.out"))

Most problems become clear from the error log. If not, work through the Troubleshooting checklist above.