Data preprocessing in the ANC
The Austrian NeuroCloud (ANC) repository is powered by its own hardware residing in the SCC cluster, offering users the ability to execute data processing directly on the cluster. This setup ensures that the data physically resides within the cluster, significantly reducing data transfer times and enhancing processing efficiency. The hardware infrastructure is funded by the Digital Neuroscience Initiative project.
Resources
To follow this tutorial, you need a basic knowledge of Bash. If you have not used Bash before, have a look at this Bash Scripting Tutorial.
Key resources for this documentation:
1. Get an Account
During the pilot phase of the SCC cluster, access is provided on demand to a limited number of users. If you are an ANC user and wish to process your data on the SCC cluster, please send an email to anc@plus.ac.at to request an account.
2. Connect to the SCC
Please follow the SCC-documentation and connect to the login node.
Note that the login node should only be used for job management and basic file operation in the home directory. Computation jobs must not be executed in the login node, but on the cluster using the srun
command.
ssh USERNAME@login01.scc-pilot.plus.ac.at
Type squeue
to see your username and the node you're connected to.
For tasks that require more working memory, you need to use the cluster and therefore type srun
before your commands. Type the following commands, and you'll notice that they execute on different nodes.
hostname
srun hostname
3. Pull the data
To view your current working directory, type pwd
. You should be in /home/USERNAME/
.
Next, configure Git following the instructions to download your data from the ANC.
Clone the example SOCCER dataset with all large files (imaging files stored using Git LFS). Since cloning the data is a resource-intensive task, use srun
to execute a SLURM job.
srun git clone git@data.anc.plus.ac.at:bids-datasets/neurocog/soccer.git
4. Get the required software
Apptainer is a container platform that allows you to package software dependencies in a portable and reproducible way. You can build a container with Apptainer on your laptop or use an existing one, and then run it in the ANC or SCC cluster. In this example, we will use the containerized neuroimaging pipeline, fMRIPrep. To start, check if Apptainer is installed (it should be available on the cluster).
apptainer --version
Next, create a directory where you want to save your Apptainer images. For building the image, use srun
to allocate cluster resources for the process:
mkdir apptainer
cd apptainer
srun apptainer build fmri_prep_24.1.0.sif docker://nipreps/fmriprep:24.1.0
5. Test run
Run fMRIPrep for one subject now. Again use srun
, to execute your job in the cluster.
Create directories soccer_prep
and soccer_wrk
to store fMRIPrep outputs. The following command is an example execution of fMTIPrep. Use the fMRIPrep documentation to adjust the command to your needs.
The --bind /home/USERNAME/soccer:/data
command in Apptainer maps the local directory /home/USERNAME/soccer
in the host system to the /data
directory inside the container, allowing files in /home/USERNAME/soccer
to be accessible within the container under /data.
srun apptainer run --no-mount bind-paths --cleanenv \
--bind /home/USERNAME/soccer:/data \ # data directory
--bind /home/USERNAME/soccer_prep:/out \ # folder to save preprocessed data
--bind /home/USERNAME/soccer_wrk:/work \
fmri_prep_24.1.0.sif \ # directory to fMRIPrep
/data \
/out \
participant \
--participant-label sub-0124 \
--fs-license-file /home/USERNAME/apptainer/free_license.txt \ # path to freesurfer license
--fs-no-reconall \
-w /work \
Execute the command, ensuring your directories are correct.
Note, that for the example (SOCCER) dataset this command takes around 10 minutes.
6. Run in parallel
The cluster hardware is designed to handle multiple resource-intensive jobs simultaneously. This setup allows you to run processes like fMRIPrep across multiple subjects at once, maximizing efficiency. Below is an example batch script that requests the necessary cluster resources to execute fMRIPrep for three subjects in parallel.
The comments in the script explain the specific SLURM parameters for each execution (in this case this means per job and also per subject). Save the script as prep_job.sh
in /home/USERNAME/apptainer
.
#!/bin/bash
#
#SBATCH --job-name=prep_job # Job Name
#SBATCH --time=00:10:00 # ADJUST!!! Set a limit on the total run time of the job allocation.
#SBATCH --cpus-per-task=4 # ADJUST!!! Number of CPU cores per fMRIPrep execution/subject
#SBATCH --mem-per-cpu=4G # ADJUST!!! Memory preallocated per CPU core
#SBATCH --array=0-2 # ADJUST!!! Job array indices (3 jobs/subjects) / e.g. for 80 subjects it would be 0-79
#SBATCH --output=log/%A/job_%a.out # Filename for log output per array task. %A is the Job ID and %a the Job indice
# Array of subjects
subjects=("sub-0124" "sub-0426" "sub-0811") # ADJUST!!! add a list of your subjects here
# Get the current subject based on the Job array
SUBJECT=${subjects[$SLURM_ARRAY_TASK_ID]}
# Here your fmriprep command starts
srun apptainer run --no-mount bind-paths --cleanenv \
--bind /home/USERNAME/soccer:/data \ # data directory
--bind /home/USERNAME/soccer_prep:/out \ # folder to save preprocessed data
--bind /home/USERNAME/soccer_wrk:/work \
fmri_prep_24.1.0.sif \
/data \
/out \
participant \
--participant-label $SUBJECT \
--fs-license-file /home/USERNAME/apptainer/free_license.txt \ # this license must be downloaded [here](https://surfer.nmr.mgh.harvard.edu/fswiki/License)
--fs-no-reconall \
-w /work \
Note that for testing, we’ve set the time limit to 10 minutes, but if you want the script to complete, you'll need to adjust the runtime to your data.
To submit the jobs type:
sbatch prep_job.sh
7. Monitor jobs
To view all the jobs your running, type squeue -u USERNAME
. In our example you should see three jobs running with the name fmri_pre, each for one subject and probably distributed over different nodes.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
290491_0 base fmri_pre USERNAME R 0:05 1 node09.scc-pilot.plus.ac.at
290491_1 base fmri_pre USERNAME R 0:05 1 node10.scc-pilot.plus.ac.at
290491_2 base fmri_pre USERNAME R 0:05 1 node10.scc-pilot.plus.ac.at
If you want to cancel your job, type in scancel <JOBID>
. For example, to cancel the job with the Job ID 260450 execute scancel 290491
, or to cancel multiple jobs execute scancel 290491_[0-2]
After your jobs have finished, you can check how much memory they used and specify it more precisely in the future. For example, if you want this for the first subject of job 290491, type seff 290491_0
.
Job ID: 290491
Array Job ID: 290491_0
Cluster: openhpc
User/Group: kbenz/slurm_sbdl
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:20:00
CPU Efficiency: 65.50% of 00:30:32 core-walltime
Job Wall-clock time: 00:07:38
Memory Utilized: 11.69 GB
Memory Efficiency: 73.09% of 16.00 GB
Job Wall-Clock Time:
Represents the total runtime of a job from start to finish.Core-Wall Time:
The cumulative time across all cores used during the job’s execution. (Core-wall time = Wall-clock time x Cores per node
; here00:07:38 x 4 = 00:30:32
)Memory Utilization:
Total memory consumption across all cores during the job.Memory Efficiency:
If you allocate too little memory to your job, it will run out of memory and break. If you allocate a lot of memory, it will most likely get cued on the cluster. Thus you want to adjust the memory settings in you script for your data, we recommend targeting an efficiency between 65% and 85%.
Hints
After some piloting with fMRIPrep, we recommend 8 cpus-per-task
and to set mem-per-cpu
as small as possible.
Total memory is allocated by cpus-per-task
multiplied with mem-per cpus
.
If you need to clone a big data-set, you can also do this using a screen, so it won't break if you disconnect from SCC.