Normalisation of paired-end data

Expression of genes varies considerably, and reads corresponding to highly expressed genes are over-represented in RNA-Seq datasets. The excessive reads do not improve transcript assembly and some sort of a digital normalisation can reduce memory requirements and decrease the assembly time. Trinity Insilico Normalization is a part of Trinity package (link) and it is available on Galaxy-qld.

Normalisation of single-end data is fairly straightforward, but processing of paired-end reads with default settings produces different number of forward and reverse reads. To avoid this, set “process paired reads by averaging stats between pairs and retaining linking info” option to Yes. It is good to run FastQC on the output files and check number of sequences in file produced by Trinity Read Normalization tool – see “normalisation of paired-end reads” history published on Galaxy-qld.

 

Advertisements

FASTQ Groomer and fancy FASTQ format

FASTQ Groomer is a very popular tool on Galaxy-qld. It is also misused quite often, when people use it for fastq datatype with Sanger (+33) offset. As explained in Galaxy-qld FAQ, Galaxy assigns fastq datatype for uploaded FASTQ files unless other datatype, such as fastqsanger, is specified during upload. For reads with Sanger (+33) offset FASTQ Groomer only changes the datatype, but in doing so it also creates a new file, an exact copy of the uploaded file. The datatype can be changed from fastq to fastqsanger through Edit Attributes > Datatype without creation of a new file.

FASTQ Groomer is sensitive to FASTQ specification. On Galaxy-qld FASTQ Groomer fails on FASTQ data with different names for sequence and quality score. FASTQ format uses four lines per read. The first line contains the read name. It always starts with @ character. The third line starts with +, and it may contain the read name or just ‘+’ character. FASTQ Groomer works on FASTQ data with identical names in both lines, or reads with just ‘+’ in the third line.

FASTQ data with different names in the first and third lines can be groomed to a standard FASTQ by convertion to tabular format followed by tabular to fastq transformation. Do not forget to change the data type to either fastqsanger or fastqillumina through Edit Attributes.

Counting paired-end reads with HTSeq

HTSeq is a popular tool for read counting in annotated features (detailed instructions for HTSeq count). The usage is straightforward for single-end reads, while paired-end alignments have to be sorted by queryname and converted to SAM. Just in case: I am writing about HTSeq count available on Galaxy-qld.

HTSeq count works with alignments sorted by name or position, wth former being the default option (link). The HTSeq count wrapper does not provide option for selection of read order (-r) in Galaxy. This is an important point, as HTSeq counts units of evidence, or fragments for paired-end data. With the default option HTSeq expects paired reads to be next to each other in SAM/BAM files. Paired reads in non-adjacent might be reported as reads with missing mate in the HTSeq log file#, and non-adjacent mate reads can be count as independent. Basically, incorrect settings may easily produce  ~double fragment count on a coordinate-sorted paired-end alignment (and make p-values very convincing 😉 ). There are some other implications, too, e.g in situations when two mate / paired reads are mapped to different features (genes). If these reads are queryname-sorted, the fragment is considered as ambiguous. If the reads are separated in a position-sorted alignment, HTSeq may assign these reads to two features.

BAM files can be sorted by queryname using SAMtools sort or Picard SortSam command. However, HTSeq produces an empty output with queryname-sorted BAM files. To make it work, convert BAM to SAM with SAMtools BAM-to-SAM.

The recommended procedure is illustrated in history Counting reads with HTSeq for paired-end data published on Galaxy-qld. To illustrate the difference, I’ve merged the count tables from the coordinate-sorted alignments (default tophat2 outputs) and the queryname-sorted SAM files into dataset 23. Beside the obvious difference in the read count, several genes with very low read count (HRA1, YAL016C-A ect) are present only in coordinate-sorted alignments. Note that HTSeq produced an empty dataset with queryname-sorted BAMs (file 19). While the dataset 19 is marked in green as a completed job, the error message is available in info (i) > stderr:
*** Error in `python’: double free or corruption (!prev): 0x000000000265e310 ***

# HTSeq log files are accessible through info (i) > stdout.

Variant calling with MPileup and bcftools

The bcftools package, a popular choice for variant calling, had a major update. The variant calling command bcftools view is now replaced with bcftools call that has a new multiallelic calling model. The old variant calling model is also available in bcftools view with Consensus Caller option. Confusingly, bcftools subset option is now renamed as bcftools view.

Galaxy-qld now has the latest versions of bcftools call and bcftools view. Note that new bcftools view tool is used for manipulation of VCF or BCF files (view, subset, filter) and bcftools call should be used for variant calling. A history ‘Variant calling with MPileup and bcftools’ with the latest bcftools call tool is available in Shared Data > History. In this history the bcftools call tool is used with Consensus Caller option, as a demo for the variant calling model used in old bcftools view. Note that the new multiallelic calling model is recommended for most tasks.

Data backup

Galaxy-qld does not have an external backup. The server is provided for data analysis, not for file storage. We recommend users save their results to an external storage as soon as convenient and delete unneeded datasets from the server.

Galaxy provides several options for data download. All datasets can be downloaded to a users computer by clicking on the Save icon. Send Data > GenomeSpace Exporter tool can be used to save the results to external storage. GenomeSpace can manage online storage such as Dropbox, so files can be sent directly to Dropbox.

Users can export an entire history to a file and download the archive. This option comes with several caveats. Archiving of big histories can take a long time. Some archiving jobs run for days. The resulting files can be very big, in range of 10s or 100s GB. Hint: delete and purge all unneeded files before archiving. The histories can be uploaded to Galaxy only via a link. Try this option on a small history, to check the usability of this approach.

Individual datasets can be downloaded to a server on the command line with wget or curl commands (for details check GVL_FAQ). Note that BAM files require a special approach: left mouse click on the Save icon provides an access to alignment and index files, where links can be copied with a right mouse click.

Moving data between Galaxy instances

Data can be copied directly from one Galaxy server to another. Users have several options. For individual datasets:

  1. Click on name of file to expand the info.
  2. Right click on Save icon > Copy Link. Note that for BAM files the Save (Download) icon provides access to both alignment (BAM) and index (BAI) files. For BAM files use left mouse click to access BAM and index files. Move the cursor over the BAM file and use the right click to copy the link.
  3. On another Galaxy server: go to upload menu > Paste / Fetch Data and paste the link. Select attributes, such as genome assembly, if required. Hit the Start button.

Archived histories can be copied between Galaxy servers, with some caveats. First, there is no obvious notification for users about completion of the archiving. Hint for Galaxy-qld users: email Igor Makunin to get a notification if you export a big history to a file. Second, it takes long time to archive big histories on Galaxy-qld. Generally the archiving tool adds ~2 GB on per hour for histories exported as a file. Third, the size of archive might be an issue, especially for users with big disk allocation.

The procedure for histories:

  1. Keep only datasets you want to save. Delete and purge all unneeded files. Alternatively, copy valuable datasets into a new history
  2. History menu > Export to File
  3. Copy the link to the file from the middle (working) window. It will be used later.
  4. Make the history Accessible by clicking on the corresponding link in the middle window.
  5. After completion of archiving go to another Galaxy server. History menu > Import from File. The new history will be created. It takes some time, especially for big histories.
  6. To access the imported history, in History menu > Saved Histories and select the new history imported from archive: name.

User data can be moved through an external storage with GenomeSpace if both Galaxy instances are connected to the same GenomeSpace server, and you have a sufficient storage allocation linked to GenomeSpace. In this situation users also get benefits of a data backup.