Simons Genome Diversity Project datasets
The Simons Foundation's Genome Diversity Project datasets are now available on UPPMAX. These represent deep human genome sequence data sampled to represent as much diversity as possible:
There are currently approximately 14 TB of data, in the form of CRAM files with associated indices and summaries of the BAM files from which the CRAM files werre derived.
Our current SGDP data are those aligned to human reference genome GRCh38DH found at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/simons_diversity_data/. The local UPPMAX directory for these data is /sw/data/SGDP/. The command used to collect the data was
echo "mirror data" | lftp ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/simons_diversity_data
As a result, the local UPPMAX archive is found at /sw/data/SGDP/data/. Within this directory are subdirectories for each of the populations included in the full dataset, with individual samples found within each population directory. For example,
rackham1: /sw/data/SGDP $ ls -l data/Greek total 8 drwxr-s--- 3 douglas kgp 4096 Apr 29 14:03 SAMEA3302732 drwxr-s--- 3 douglas kgp 4096 Apr 29 14:03 SAMEA3302763
and one of these sample directories contains
rackham1: /sw/data/SGDP $ ls -l data/Greek/SAMEA3302732/alignment/ total 34529204 -rw-r----- 1 douglas kgp 635 Nov 30 2020 SAMEA3302732.alt_bwamem_GRCh38DH.20200922.Greek.simons.bam.bas -rw-r----- 1 douglas kgp 35355769475 Nov 30 2020 SAMEA3302732.alt_bwamem_GRCh38DH.20200922.Greek.simons.cram -rw-r----- 1 douglas kgp 2079029 Dec 1 2020 SAMEA3302732.alt_bwamem_GRCh38DH.20200922.Greek.simons.cram.crai
To access this data, please request membership in the kgp group by emailing email@example.com. As for the 1000 Genomes Project, this is not to restrict access in any way, but rather to make it easier to inform UPPMAX users using the datasets of any relevant changes. Because the local copies of these datasets are hosted on UPPMAX systems, access is restricted to UPPMAX users; non-UPPMAX users will need to follow the procedures described on the SGDP website to download their own copies of the datasets.