ycgaFastq – Accessing YCGA Sequencing Run Data

The ycgaFastq command is a program that makes it easy to access and organize the FASTQ files from YCGA sequencing runs.  You describe the sequencing data locations in a number of different ways, and the program will go into the sequencing run folders, find the samples and their FASTQ files, and then organize symbolic links to the files on a per-sample basis.

The simple method for using ycgaFastq is to first create an “analysis” directory, where you’re going to keep the links to the fastq files and going to run the analysis, and then run ycgaFastq with the information about the sequencing data. As an example, if you get an email from the YCGA saying that your sequencing data is ready, and it contains a URL for a sequencing run that looks like this:

http://sysg1.cs.yale.edu:2011/showrun?run=ba_sequencers6/sequencerU/runs/150223_SN880_0282_AB5B5FACXX

you can create your analysis directory, cd into that directory, and then run

ycgaFastq http://sysg1.cs.yale.edu:2011/showrun?run=ba_sequencers6/sequencerU/runs/150223_SN880_0282_AB5B5FACXX

ycgaFastq will figure out the samples and fastq files associated with that URL (and your netID), report how many samples it has found, then ask if you want it to create the symbolic links to the files.  If you say yes, it creates a directory for each sample (using the sample name as the directory name), then creates “Aligned” and “Unaligned” sub-directories, where symbolic links to the Exports (if any) and FASTQ files (if any) are setup.

ycgaFastq can take a number of different formats for describing the location of the sequencing data in the YCGA sequencing run area. If you know the path to your sample’s data, you can give that as the command-line argument, and it can take FC_layout.txt files, Flowcell_info.info files, files containing lines of the form “sampleName pathToSampleDir”, and other formats.  It will figure out the samples and the FASTQ files.Run “ycgaFastq” without any arguments for full details on how to run it (or view the same help text below).

If you have pre-CASAVA-1.8 runs (where there are only Export files from the sequencing runs), it will write a simplequeue and PBS file that can be submitted to perform the conversion of exports to fastq.

The tool tries to handle as many different formats as has been used on bulldogn and ruddle (describing the samples and file locations of the sequencing data), but if you have a format that doesn’t easily fit one of the supported ones, please let me know.

Help Text

Usage:
   ycgaFastq [-t | -f] [-o outputDir] URL
   ycgaFastq [-t | -f] [-o outputDir] pathToSequencingRunProjectOrSample
   ycgaFastq [-t | -f] [-o outputDir] netId URL
   ycgaFastq [-t | -f] [-o outputDir] netId flowcell
   ycgaFastq [-t | -f] [-o outputDir] projectName URL
   ycgaFastq [-t | -f] [-o outputDir] projectName URL
   ycgaFastq [-t | -f] [-o outputDir] sampleName fastqFile...
   ycgaFastq [-t | -f] [-o outputDir] sampleName directoryPath
   ycgaFastq [-t | -f | -h] [-o outputDir] file

The ycgaFastq program searches for the FASTQ files generated by a YCGA sequencing run, and creates sample directories containing symbolic links to the FASTQ files (or can output a tab-delimited file listing the FASTQ files). The command-line options are:

  • -t – Output a tab-delimited file of the FASTQ files (instead of creating sample directories)
  • -f – Force the creation of sample directories and disable interactive output
  • -h – The file contains a header line which should be skipped
  • -o outputDir – Create the sample directories in “outputDir”

The arguments to ycgaFastq can be in one of several forms (in order to support the common forms of identifying run information):

  • The arguments can be “URL” or “projectName URL”, where the URL is a URL found from the emails sent by the YCGA notifying of a sequencing run result, and projectName is the netId of the project containing the data. If no netId given, the netId of the current user is used. All sample directories matching “URLdir/Data/Intensities/BaseCalls/Unaligned*/Project_projectName/Sample_*” are identified as FASTQ directories, and the sampleName values are extracted from the path.
  • The arguments can be a full path into the sequencing run directories, down to the level of a specific “Project_” or “Sample_” sub-directory. If the path ends at the Project directory, all samples in theProject directory will be included.
  • The arguments can be a netId and flowcell string (like AHJNHMADXX). In this case, the appropriate runpath is identified, and then all of the samples in the “Project_netId” sub-directory are included.
    (Note: If there are multiple Unaligned directories for this run, an error will be reported, and the specific path to the Project directory must be given.)
  • The arguments can be a sampleName and list of FASTQ files or Export files.
  • The arguments can be “sampleName directoryPath”, where the directoryPath is a path to one of the following types of directories:
    • A directory containing FASTQ files or Export files
    • A path to an “Unaligned” sequencing run directory, where the path must be specified down to the “Project_…” or “Sample_…” level (if only specified down to the “Project_…” level, there must be a sub-directory matching “Sample_sampleName*” inside the project directory)
    • A path to an “Aligned” sequencing run directory, specified down to the “Project_…” or “Sample_…” level.
  • A single FC_layout.txt file, as used by Murim’s pipeline.
  • A single Flowcell_info.info file, as used by Murim’s pipeline.
  • A single file containing three columns, “Sample Lane GERALD_Directory”, for any older sequencing runs (pre-1.8).
    Note: For older runs (pre-1.8), the only acceptable formats are the Flowcell_info.info file and this three column format. All other argument forms look only for the CASAVA 1.8 run folder structure (or use the given directory as containing the FASTQ files or Export files)

Finally, a single filename can be given, which contains multiple lines in any of the above formats, so you can submit many of these all at once.

A fraction of sequencing runs may have multiple “Aligned” or “Unaligned” directories, because of run processing issues. In those cases, it may not be possible to automatically determine the correct Unaligned directory containing the FASTQ files, and the program will stop processing at that point and report the situation. In those cases, the full path to each sample’s Unaligned directory is required as input, in order to correctly identify the FASTQ files. This may require you to contact the YCGA personnel to help determine the correct directory for the sequencing run.

By default, the program will create a sample directory for each sample, using the sample name, and then create sub-directories “Unaligned” and/or “Aligned” to hold the symbolic links to FASTQ files and/or export files found in the search. (This provides reliable pointers back into the sequencing data files.) If the sequencing run path information is found as part of the search, the FASTQ files will be given the name “Sample_Flowcell_Lane_R#_###.fastq.gz”, to ensure that the files are uniquely identified. If an arbitrary directory is given as input, the current names of the FASTQ files will be retained in the symbolic links.

Comments are closed.