Expression of genes varies considerably, and reads corresponding to highly expressed genes are over-represented in RNA-Seq datasets. The excessive reads do not improve transcript assembly and some sort of a digital normalisation can reduce memory requirements and decrease the assembly time. Trinity Insilico Normalization is a part of Trinity package (link) and it is available on Galaxy-qld.
Normalisation of single-end data is fairly straightforward, but processing of paired-end reads with default settings produces different number of forward and reverse reads. To avoid this, set “process paired reads by averaging stats between pairs and retaining linking info” option to Yes. It is good to run FastQC on the output files and check number of sequences in file produced by Trinity Read Normalization tool – see “normalisation of paired-end reads” history published on Galaxy-qld.
A Galaxy workshop, Genome assembly using Galaxy, is scheduled on August 2-3, 2017. The event is arranged by QFAB. The training will be done by QFAB stuff and members of GVL-Qld team. The registration is essential.
Venue: Room 3.141, IMB/QBP, UQ, St Lucia
Start: 9:00 AM, August 2, 2017
End: 12:30, August 3, 2017
Presentations for the workshops are available in pdf format:
Introduction to Galaxy
Genome assembly with Galaxy
On July 5 we run An Introduction to Galaxy with the Genomics Virtual Lab workshop for participants of Winter School in Mathematical and Computational Biology. Galaxy workshops are very popular, and registration is essential. The participants will analyse differential gene expression using high throughput sequencing data. The workshop is based on the RNA-Seq Basic Galaxy tutorial from the Genomics Virtual Lab.
The workshop is aimed for biologists and does not require IT skills. The presentation is available for download (pdf).
Users of Galaxy-qld kept the server very busy in the last couple weeks, with many queued jobs, especially during working hours, in time of a high user activity. The graph below shows user activity on Galaxy-qld on May 25.
The time is GMT / UTC; add 10 hours for Australian Eastern Standard Time (AEST). The user activity goes up in the morning, around 10 AM AEST, and declines after midnight. During day time users submit a new job every minute.
Majority of submitted jobs are completed within less than two hours, as shown on next graph. However, a small number of jobs run for long time, sometimes for several days.
Cufflinks and Cuffdiff jobs are often run for a very long time. Galaxy-qld offers StringTie, a faster alternative to Cufflinks. Tophat jobs can run for long time, especially with big datasets. Galaxy-qld provides HiSAT2 and precomputed indices for several genomes. HiSAT2 is incredibly fast. Assembly jobs usually require significant time. SPAdes and VelvetOptimiser sometimes stuck and have to be terminated manually. Very often aligner problems are caused by bad data, such as different number of reads in paired datasets, different order of reads in paired data, or excessive presence of nucleotides with low quality. We recommend Trimmomatic for read trimming, as it preserves a proper read pairing in the output files.
To ease the constrains we increased the number of worker nodes on Galaxy-qld and changed CPU allocation for some jobs, as well as number of concurrent jobs per user. The user policy will be modified to provide a better experience for all users, but we can do better with cooperation from our users:
– do not run a new analysis on many genome-scale datasets. Develop your analysis with a small dataset, if possible.
– do not delete and resubmit queued jobs. Usually jobs are queued because there is no spare capacity on the server, or users exceed the limits on number of concurrent jobs. The submitted jobs will be completed eventually.
– use faster tools, such as HiSAT2 and StringTie.
Today we had a Galaxy workshop for postgraduate students at IMB. The participants used data and protocol from the Basic RNA-Seq Galaxy tutorial provided by Genomics Virtual Lab. The second half of the workshop was dedicated to Galaxy workflows. The participants created, modified and used a Galaxy workflow.
The workshop was done on Galaxy-tut. The server was updated recently, and it coped well with the load during the workshop.
The workshop presentation is available for download as a pdf file.
A talk about Galaxy-qld at Griffith University, Gold Coast campus, is scheduled for May 22, 2017, from 12:00 to 13:00. Location: building G40, room 4.111. Title for the seminar: Galaxy-qld, a local server for genomic research. The seminar is aimed on a broad range of biologists interested in analysis of high-throughput sequencing data and does not require knowledge of genomics or programming. The even is open to everyone, without a registration. The presenter: Igor Makunin.
There will be a limited user support for Galaxy-qld users on Monday, May 22, because of the seminar.
UPDATE. Slides are available for download (link to pdf
Galaxy-qld is getting popular, with over 900 registered users. The server is crunching the data 24/7. Demand for computing resources is rising rapidly, and at times Galaxy-qld runs at full capacity, without any spare compute resources. User activity goes up around noon (Brisbane time) and remains high until late evening. We maxed out the available resources, both in storage and CPUs, and introduced changes to the server to improve performance. Since today, the registered users can run up to 12 concurrent jobs, including four concurrent jobs required 5 CPUs (most aligners, some tools from Cufflinks and GATK2 packages).
Performance of the server also can be improved by better user practice. If submitted jobs are queued for long time, check @GVL_QLD updates on Twitter. The updates are accessible to anyone through the link. There is no need to have an account on Twitter to see the feed.
Do not delete and resubmit the queued jobs. The submitted jobs will be completed eventually, usually overnight. It is ok to submit other jobs, if you have queued jobs.
Not all jobs are identical. All assembly jobs run on 16 CPUs and need a whole worker node. These jobs tend to stay in the queue for a long time. Simple jobs such as filtering, FastQC or FASTQ Groomer use a single CPU and sneak into any available slot. Most aligners use five CPUs and also tend to queue for some time.