Files and Disk Areas

Ruddle has two main file systems, /ycga-ba and /ycga-gpfs, that contain all of the data on the cluster. They are shared across the cluster, so you are able to access them from the login nodes and any compute node.  The four main areas that you’ll be using on a daily basis are (1) the home directories, (2) the lab disk areas (for those in larger labs, where each of them keep their working files), (3) the 60-day scratch space and (4) the sequencing run data.

Home Directories

The home directories current reside at /ycga-ba/home, although there is a symbolic link to that from /home (so you can refer to anyone’s home directory using /home/netId).  The home directories have a quota of 500 GB.  This is enough space to store the results from 50 to 80 exome analyses, if you are prudent about only keeping the final bam files and variant files.  (TIP: Don’t keep intermediate copies of bam files, and never store sam files on disk, always compress them to bam.)

Running out of disk space in your home directories is one of the more common reasons for weird crashes of analysis pipelines, so check your disk space if a pipeline crashes with an odd error message (or if your result files are empty or incomplete). Currently, the myquota.sh program is not implemented on ruddle yet, so if you do want to check your home directory usage and have access to bulldogn, log into bulldogn and use the myquota.sh command there.

Scratch Space

If you do need additional short-term space to perform analyses, there is 60-day scratch space available in /ycga-gpfs/scratch60 for you to use, using the same “division/PI/netId” structure as in /ycga-gpfs/project. If you do have a lab folder there and wish to begin using it, contact Rob Bjornson for access. Please remember that all files older than 60 days will be automatically deleted from that disk area, so it should be used only for performing your analyses and the final result files should be copied to your home directory or lab space.

Lab “Project” Areas

For each lab that does regular sequencing at the YCGA, there is an extra “project area” in /ycga-gpfs/project for labs and their members.  The project area is organized into a “division/PI/netId” directory structure, where the current divisions are “fas”, “mdi”, “ysm” and “ycga” (for the Faculty of Arts and Sciences, Microbial Diversity Institute, Yale School of Medicine and Yale Center for Genome Analysis).  Inside each division’s directory are directories with the last name of each PI in that division.  Within the PI’s directories are directories for each netId that has requested sequencing.

This is long-term space available to labs for perform their analysis work on the sequencing data generated at the YCGA, and has a quota proportional to the amount of sequence data generated for the lab.  The quota is sufficient to generate an additional alignment file (such as a BAM file) for the sequencing data, plus any downstream analysis files, but is likely not sufficient for storing many large intermediate files or large datasets not generated by the YCGA.  Please use best practices in the efficient use of this space while working in the project area (see Using Disk Space Efficiently for some of those best practices).

If you are in a lab that is expanding beyond the home directories of your members and/or is expanding beyond the space allocated in the lab’s project area, you can contact Rob Bjornson about the possibility of adding lab disk space.

One thing you should note is that the default Unix permissions for the directories in this area are set to maximize the privacy of the lab’s data, but may not be ideal for your lab’s work.  If you cd into your lab’s directory and run “ls -la”, you will likely see lines like the following (my division is “ycga”, my PI is “mane” and my netId is “jk2269”):

[ruddle@ruddle2 ~]$ cd /ycga-gpfs/project/ycga/mane
[ruddle@ruddle2 mane]$ ls -la
total 0
drwxrws--- 5 root   mane 2048 Mar  2 14:47 ./
drwx--S--- 4 jk2269 mane 2048 Mar  2 23:01 jk2269/

(If you are not familiar with Unix permissions, this tutorial page may help.)  So, the default permissions for the lab directory give full access to anyone in the PI’s group (my group is “mane”), but no access to anyone outside the group.  And the default permissions for each person’s directory (my netId is jk2269) give full access to the user, but no access to the group or others.

[Note: The “s” and “S” letters in the group permissions area denote “sticky bits”, which means that if you create any files or sub-directories inside a directory with these permissions, those new files will inherit the group name of the directory. What that means is all of the files and directories in each lab area will always have the PI’s group (so that you have group access to all of the files, even if there are users from other lab’s who have access to your group). And, if you did not understand that, don’t worry about it. You don’t really need to know it.]

If you want to allow more group-based access to files, then you may need to either change the permissions, or work outside the defined structure.  For example, if you would like the files that everyone in the lab creates to be accessible by the group, then you should either create new sub-directories in the lab directory (which you have the permission to do…you are not required to stay within the netId folders, the lab directory is your lab’s area to use and organize as you see fit), or change the group permissions to allow group access (by having each person run run “chmod g+rwx netId” on their netId folder in the lab area).

If you would like everyone in the group to have write access to the files and directories created in the area, then you should set the “umask” permissions to allow that, by adding the line “umask 0002” in everyone’s .bashrc file.

If you would like to allow others to have some or all access to lab files, then you can request the permissions to be changed on the lab directory (you will have to have your PI ask Rob Bjornson or Jason Ignatius for the permission change).  To allow general read access to the lab area for all ruddle users, then request that “world read+execute” permission be set on the lab directory. If you want to allow more limited access, you might be able to request a “public” folder be exposed where you can place files you would like to allow public access. (I do not know that this is supported, but my suggestion would be for you to create a “public” folder in your lab directory, set “world read+execute” permissions to it, then request that Rob/Jason set “world execute” permissions to the lab directory, and then create a symbolic link to the lab public folder in the parent directory. For example, to do that for the /ycga-gpfs/project/ycga/mane lab space, it would involve the following commands run by me:

mkdir /ycga-gpfs/project/ycga/mane/public
chmod o+rx /ycga-gpfs/project/ycga/mane/public

and then the following commands run by Rob or Jason:

chmod o+x /ycga-gpfs/project/ycga/mane
ln -s /ycga-gpfs/project/ycga/mane/public /ycga-gpfs/project/ycga/mane.public

Sequencing Run Data

All of the Illumina and PacBio sequencers at YCGA are directly connected to the ruddle file systems and write their data to those disks.  The run output files, namely the FASTQ files for newer runs, are maintained on the server and unless we start running out of disk space, we do not plan to remove them. So, our recommendation is that you always use symbolic links to the sequencing run FASTQ files, instead of making a copy of them on ruddle. The ycgaFastq – Accessing YCGA Sequencing Run Data page describes an easy tool to access your samples’ reads and create those symbolic links. If you do need more detailed access to the sequencing run directories, please contact Jim Knight or Rob Bjornson for more information about the data organization.

Comments are closed.