Download data¶
ANC datasets are stored as Git repositories in the ANC GitLab.
Downloading a dataset to a local system is done by cloning the dataset repository from its ANC GitLab project. A clone is a local version of the dataset repository. It remains linked to the remote repository, so changes can be made locally and, if you have write permissions, pushed back to the repository. See our guide on how to work locally.
We use Git LFS for large files
If you are familiar with Git and GitLab or GitHub, most of this will be known. However, ANC dataset repositories contain large files, and these are managed with Git LFS. If you are unfamiliar with Git LFS, we recommend that you read through the following guide.
LFS stands for Large File Storage. In the GitLab interface, large files such as .nii files may be marked with an LFS label.
The dataset structure, metadata, and small files are managed directly with Git, while large data files, such as imaging files, are handled with Git LFS (see the full list of file formats handled as LFS files in the ANC).
To work with ANC dataset repositories, Git LFS must be installed. Git LFS allows large files to be tracked by Git without storing their full contents directly in the Git repository. Instead, Git stores small pointer files, while the actual large file contents are stored separately in LFS storage.
Git LFS also makes it possible to clone a dataset repository without immediately downloading all large files. This is useful for large datasets, or when only a subset of the data, such as one subject, is needed.
Install Git and Git LFS¶
Git and Git LFS may already be installed on your system, especially if you are using a Linux operating system. You can check this by running the following commands in your terminal and verifying that version numbers are returned.
If Git or Git LFS is not installed on your system, follow the installation instructions for Git and Git LFS.
Note for Windows
The recommended Git distribution for Windows has Git LFS included.
Set up Git LFS¶
Open the command line interface and enter the following command:
Using the --skip-smudge flag ensures that LFS files are not downloaded automatically during cloning or checkout. Instead, large files appear as small pointer files until their contents are explicitly pulled.
We recommend installing Git LFS this way, regardless of whether you plan to download the large files immediately. This helps prevent accidentally downloading an entire dataset.
Smudge filter assumption
All ANC guides assume that Git LFS has been set up with the --skip-smudge flag.
Get the clone link¶
To clone a dataset repository, first open the dataset project in the ANC GitLab. How to find relevant dataset projects is described in Find relevant datasets.
In the dataset project, click the Code button and copy one of the available clone links:
- Clone with SSH, if your SSH key is configured in GitLab.
- Clone with HTTPS, if you want to authenticate with GitLab credentials.
The copied link is used in the git clone command below.
SSH and HTTPS authentication
ANC GitLab supports cloning with SSH and HTTPS. SSH is usually the recommended option once your SSH key has been added to your GitLab account.
However, depending on the current GitLab configuration, Git LFS may still ask for HTTPS credentials to pull LFS files, even when the repository itself was cloned with SSH.
This behavior may change in the future if ANC switches to an SSH-only workflow.
Cloning a repository without LFS files¶
You can clone a dataset repository without downloading the contents of the LFS files. In this case, LFS files appear as representative pointer files in your local clone. These pointer files are small text files that refer to the actual large files in LFS storage.
The pointer files can be moved, renamed, or deleted like regular files. Such operations affect the corresponding files in the dataset repository. However, the actual contents of the large files are not available locally until they are pulled with Git LFS.
Open the command line interface, navigate to the directory where the repository should be stored, and enter the following command:
Getting LFS files¶
If the contents of the LFS files are required, for example for analysis, they can be pulled after cloning. It is possible to pull all LFS files in a dataset, or only a subset.
Cloning the entire repository with all LFS files¶
Mind the resources and LFS cache
Depending on the size of the dataset, this process may require significant resources, including network bandwidth, local storage capacity, and time.
By default, a repository with LFS files may occupy up to twice as much local disk space as the repository files themselves. This is caused by the local LFS cache.
Open the command line interface, navigate to the directory where the repository should be stored, and enter the following command:
By setting GIT_LFS_SKIP_SMUDGE=0, Git LFS downloads the contents of the LFS files during cloning instead of leaving them as pointer files.
Pulling specific LFS files¶
It is also possible to download only specific LFS files by using filenames or filename patterns.
Open the command line interface and navigate to the cloned dataset repository. Then enter the following command with the relative paths to the files:
This is a resource-saving option if only a small fraction of the dataset is needed.
Remember to use --include
If you do not use --include, all LFS files from the repository are pulled.
You can also use a file pattern. For example, to download all runs of the Stroop task within the first session of subject sub-01, use:
To download all files for one subject, use a subject-level pattern:
Removing downloaded LFS files from local storage¶
After downloading selected LFS files, for example one subject, you may later want to free local disk space again. This can be done by replacing the downloaded LFS file contents with pointer files and then pruning the local Git LFS cache.
Run the following commands from inside the cloned dataset repository:
The first two commands replace locally checked-out LFS file contents with pointer files. The final command removes locally cached LFS objects where possible.
Local cleanup only
This only affects your local clone. It does not delete data from the ANC GitLab repository, the remote LFS storage, or the dataset history. The files can be downloaded again with git lfs pull.
Uncommitted changes
The command git checkout -f HEAD overwrites local changes in the working tree. Only use this cleanup workflow when you do not have uncommitted changes that you want to keep.