RNA_STAR index for TAIR10 assembly

On request from our users we added RNA_STAR index for ‘TAIR10’ assembly (see this post for explanation) to Galaxy-qld.

We noticed that with the default settings RNA_STAR cannot map majority of reads in some RNA-Seq datasets from Arabidopsis. Here is an extract from the log file:

           Uniquely mapped reads % | 3.27%
% of reads mapped to multiple loci | 10.06%
    % of reads unmapped: too short | 86.64%

Read a detailed explanation of ‘too short’ classification from Alexander Dobin.

Proportion of mapped reads can be increased by modification of alignment settings. Note that the procedure is described for RNA STAR Gapped-read mapper for RNA-seq data (Galaxy Version 2.4.0d-2). For additional information check relevant RNA_STAR threads, such as this one.

Set Would you like to set output parameters (formatting and filtering)? to Yes.

Set Would you like to set additional output parameters (formatting and filtering)? to Yes.

Reduce the default 0.66 value for the following filter options:
Minimum alignment score, normalized to read length
Minimum number of matched bases, normalized to read length (–outFilterMatchNminOverLread)
(can be 0)

Set Other parameters (seed, alignment, and chimeric alignment) to Yes

Set Would you like to set alignment parameters? to Yes

Reduce value for Minimum mapped length for a read mate that is spliced, normalized to mate length (–alignSplicedMateMapLminOverLmate) from the default 0.66 to something smaller.

Inspect the alignment, just to make sure you are happy with mapping.


QFAB workshop: Variant detection using Galaxy

Another Galaxy workshop from QFAB Bioinformatics is scheduled on October 11-12, 2017.

Title: Variant detection using Galaxy

Venue: Room 3.141, Queensland Bioscience Precinct, The University of Queensland, St Lucia
Start at Wed, 11/10/2017, 09:00, end at Thu, 12/10/2017 – 12:30.

Registration is essential.

Presentations from the workshop (pdf):
Introduction to Galaxy
Variant calling using Galaxy platform
Galaxy workflows

Instruction for the Galaxy workflow exercise: pdf file or Word document.

Disruption to Galaxy-qld service on Monday, September 11

Our IT provider requested a temporary shutdown of Galaxy-qld on September 11, 2017, Monday, around 9 am Brisbane time, to fix a fault in hardware. We understand the repair may take hours, but exact duration is unknown. Updates on the situation are available through the GVL-Qld Twitter account @GVL_QLD.

Galaxy-qld will not accept new jobs since September 9.

The event will not affect user data.

UPDATE. September 11, 3:05 pm. Galaxy-qld is back online. Initial tests indicate the server is fully functional.

Arabidopsis thaliana resources on Galaxy-qld

Recently a new annotation of the Arabidopsis thaliana genes, Araport11, is added to Arabidopsis thaliana gene annotations data library on Galaxy-qld. The dataset was imported from ARAPORT, and modified for compatibility with the existing Arabidopsis assemblies. This post provides an overview of A. thaliana resources on Galaxy-qld.

The very first version of Galaxy-qld had TAIR9 assembly represented by five chromosomes, with the following contig names: chr1, chr2, chr3, chr4 and chr5. It does not have the mitochondrial and/or chloroplast genomes.

Later on request from our users we added the TAIR10 gene annotation into Arabidopsis thaliana gene annotations data library. This annotation includes genes from Mt and Pt. It uses just numbers (1, 2, 3, 4, 5) for chromosome names. The TAIR10 genomic sequence is identical to TAIR9 (link). To provide our users with greater flexibility we added TAIR10 aligner indices to Galaxy-qld. TAIR10 assembly contains the following contigs: 1, 2, 3, 4, 5, Mt and Pt.

The Araport11 gene annotation is based on TAIR10 genome assembly (link) which is identical to the TAIR9 assembly. The  original annotation comes with the following contig names: Chr1, Chr2, Chr3, Chr4, Chr5, ChrC, ChrM. [no comments on standard nomenclature here] To make the Araport11 annotation compatible with the TAIR9 assembly available on Galaxy-qld we replaced ‘Chr’ with ‘chr’. To make it compatible with the TAIR10, we removed ‘Chr’ from the contig names, replaced ChrC and ChrM with Pt and Mt, respectively, and sorted records in the same order as in the TAIR10 assembly: 1, 2, 3, 4, 5, Mt, Pt. The modified annotation is available in Arabidopsis thaliana gene annotations data library under Araport11_GFF3_genes_transposons.201606.modified.gtf name.

Normalisation of paired-end data

Expression of genes varies considerably, and reads corresponding to highly expressed genes are over-represented in RNA-Seq datasets. The excessive reads do not improve transcript assembly and some sort of a digital normalisation can reduce memory requirements and decrease the assembly time. Trinity Insilico Normalization is a part of Trinity package (link) and it is available on Galaxy-qld.

Normalisation of single-end data is fairly straightforward, but processing of paired-end reads with default settings produces different number of forward and reverse reads. To avoid this, set “process paired reads by averaging stats between pairs and retaining linking info” option to Yes. It is good to run FastQC on the output files and check number of sequences in file produced by Trinity Read Normalization tool – see “normalisation of paired-end reads” history published on Galaxy-qld.


Galaxy workshop at Winter School – 2017

On July 5 we run An Introduction to Galaxy with the Genomics Virtual Lab workshop for participants of Winter School in Mathematical and Computational Biology. Galaxy workshops are very popular, and registration is essential. The participants will analyse differential gene expression using high throughput sequencing data. The workshop is based on the RNA-Seq Basic Galaxy tutorial from the Genomics Virtual Lab.

The workshop is aimed for biologists and does not require IT skills. The presentation is available for download (pdf).