VS Code Remote SSH into an Apptainer Instance on HPC
If your cluster login environment is too old for the current VS Code server, but you are allowed to run Apptainer or Singularity on compute nodes, you can move the editor target into a newer container without touching the host OS. In my case, I specifically wanted a container with CUDA and a newer userspace, including a newer glibc, so GPU debugging and interactive inspection would work cleanly. This post shows a practical way to do that by running sshd inside the container and reaching it through the normal SSH jump path.
Background
This is useful on clusters where the host userspace is effectively frozen, for example older enterprise Linux images or nodes with outdated glibc. VS Code Remote SSH may fail on the host even though the compute environment itself is perfectly usable. Installing a newer runtime directly on the host is usually fragile, and VS Code tunnel may not be something you want to rely on in a shared HPC environment.
The workaround is simple: launch a container on the compute node, run sshd inside it, and let VS Code connect to that container instead of the host.
The Solution Stack
The connection path looks like this:
Laptop
↓
Login node
↓
Allocated compute node
↓
Apptainer/Singularity instance
↓
sshd inside the container
Prerequisites
- You can already SSH from your laptop to the cluster login node.
- You have an interactive allocation on a compute node.
- The cluster allows Apptainer or Singularity instances.
- Your home directory is visible inside the container so
~/.ssh/authorized_keyscan be reused. openssh-serveris available in the container image.
Step 1: Build an image with sshd
Start from a Rocky 9 CUDA base and bake sshd into the image so the instance can accept SSH connections directly. The definition below is only an example, not a required template. The main point is to prepare an image that has the newer userspace you need for debugging, plus CUDA if you are debugging GPU workloads.
Bootstrap: docker
From: nvidia/cuda:12.2.0-devel-rockylinux9
%post
set -e
dnf -y update
dnf -y install curl --allowerasing 'dnf-command(config-manager)'
dnf config-manager --set-enabled crb
dnf -y install epel-release
dnf -y groupinstall "Development Tools"
dnf -y install gcc gcc-c++ make cmake wget curl which vim git tmux htop btop \
python3 python3-devel python3-pip openssh-server openssh-clients
python3 -m pip install --upgrade pip
ssh-keygen -A
dnf clean all && rm -rf /var/cache/dnf
nvcc --version
%environment
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export APPTAINER_SSHD_PORT=${APPTAINER_SSHD_PORT:-2222}
%startscript
/usr/sbin/sshd -D \
-p ${APPTAINER_SSHD_PORT:-2222} \
-o PidFile=/tmp/sshd.pid \
-o "AuthorizedKeysFile=$HOME/.ssh/authorized_keys" \
-o StrictModes=no \
-o UsePAM=no
%runscript
exec /bin/bash "$@"
In practice, prepare this definition file and build the image on your laptop or another machine where you have sudo access:
sudo apptainer build rocky9_ssh.sif rocky9_ssh.def
If your site still exposes singularity instead of apptainer, the definition file stays the same, and the build command becomes sudo singularity build rocky9_ssh.sif rocky9_ssh.def.
Step 2: Start the instance on the compute node
The image already knows how to start sshd, so the launcher only needs to start the instance, pass the port, and verify the listener.
#!/bin/bash
# start_rocky.sh - Start a Rocky Linux instance with baked-in SSHD
# Usage: ./start_rocky.sh [port] [instance_name] [image_path]
set -euo pipefail
INSTANCE_NAME=${2:-rockynv}
SSHD_PORT=${1:-2222}
SIF=${3:-${SIF_IMAGE:-$PWD/rocky9_ssh.sif}}
COMPUTE_NODE=$(hostname -s)
if command -v apptainer >/dev/null 2>&1; then
CTR=apptainer
elif command -v singularity >/dev/null 2>&1; then
CTR=singularity
else
echo "ERROR: neither apptainer nor singularity was found in PATH"
exit 1
fi
if [ ! -f "$SIF" ]; then
echo "ERROR: container image not found: $SIF"
echo "Pass the image path as arg 3 or set SIF_IMAGE before running this script."
exit 1
fi
find_lib() {
local name=$1
local path
path=$(ldconfig -p 2>/dev/null | grep "$name" | awk '{print $NF}' | head -1)
if [ -z "$path" ]; then
path=$(find /usr/lib64 /usr/lib /usr/local/lib64 -name "${name}*" 2>/dev/null | head -1)
fi
echo "$path"
}
LIBCUDA=$(find_lib "libcuda.so")
LIBNVML=$(find_lib "libnvidia-ml.so")
echo "==> Instance : $INSTANCE_NAME"
echo "==> Compute : $COMPUTE_NODE"
echo "==> Runtime : $CTR"
echo "==> SSHD port : $SSHD_PORT"
echo "==> Image : $SIF"
echo "==> libcuda : ${LIBCUDA:-not found}"
echo "==> libnvml : ${LIBNVML:-not found}"
if $CTR instance list 2>/dev/null | grep -q "^${INSTANCE_NAME} "; then
echo "==> Stopping existing instance: $INSTANCE_NAME"
$CTR instance stop "$INSTANCE_NAME"
sleep 1
fi
BIND_ARGS=""
if [ -n "$LIBCUDA" ] && [ -f "$LIBCUDA" ]; then
BIND_ARGS="$BIND_ARGS --bind ${LIBCUDA}"
fi
if [ -n "$LIBNVML" ] && [ -f "$LIBNVML" ]; then
BIND_ARGS="$BIND_ARGS --bind ${LIBNVML}"
fi
echo "==> Starting instance..."
$CTR instance start --nv $BIND_ARGS \
--env HOME="${HOME}" \
--env SCRATCH="${SCRATCH:-}" \
--env SHARE="${SHARE:-}" \
--env PROJECT="${PROJECT:-}" \
--env PSCRATCH="${PSCRATCH:-}" \
--env APPTAINER_SSHD_PORT="${SSHD_PORT}" \
"$SIF" "$INSTANCE_NAME"
sleep 2
if ss -tln 2>/dev/null | grep -q ":${SSHD_PORT}"; then
echo "==> sshd is listening on port $SSHD_PORT"
echo "==> Jump target: ${COMPUTE_NODE}"
else
echo "ERROR: sshd does not appear to be listening on port $SSHD_PORT"
echo "Check: $CTR exec instance://${INSTANCE_NAME} ps -ef | grep sshd"
exit 1
fi
At this point, the compute node is acting like a temporary SSH jump target for the container.
Step 3: Connect from your laptop
Replace <login_node> with your cluster login host, <compute_node> with the actual allocated node name from hostname, and 2222 with the port you passed to the startup script.
ssh -J <login_node>,<compute_node> -p 2222 YOUR_USERNAME@localhost
For VS Code Remote SSH, adding a dedicated host entry is more convenient:
Host rocky-container
HostName localhost
Port 2222
User YOUR_USERNAME
ProxyJump <login_node>,<compute_node>
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
Once that is in place, VS Code can attach to rocky-container as if it were any other SSH host.
Why this is useful
This setup is worth keeping around if you hit any of these cases:
- The cluster host OS is too old for the current VS Code server.
- You need a newer compiler, Python, CUDA userspace, or glibc than the host provides.
- You want to keep your development stack isolated from the shared node environment.
- You want the convenience of Remote SSH without asking the cluster admins to change the base image.
It is not meant to replace batch jobs or a proper site-supported development workflow. It is mainly a pragmatic escape hatch for debugging, interactive editing, and one-off investigations.
Gotchas
StrictModes nois often necessary on shared filesystems where home directory permissions do not look like a normal standalone Linux box.- If your site rewrites
$HOME,$PROJECT, or scratch paths in shell startup files, test the container environment with a plain SSH session first. - Some clusters disable direct compute-node SSH except through the login node, so
ProxyJumpis the important part of the setup. - If the image is shared broadly, baking host keys into
%postis not ideal. For a personal development image, it is usually acceptable. - If your site uses a nondefault SSH port, keep the script argument, the direct
sshcommand, and the SSH config entry in sync.
Closing note
I would trust the %startscript design over a second ad hoc runtime wrapper. It is simpler, easier to reason about, and matches how container instances are supposed to be started. The only thing I still verify manually is whether the port is actually listening, because cluster networking and filesystem behavior vary more than the container itself.