A major upgrade for Galaxy Australia server

May 22, 2018. 8:00 AM. We are doing a major upgrade on Galaxy Australia server. Among other things the upgrade will bring a new version of Galaxy and expand range of tools. Some old and obsolete tools will be replaced with new wrappers, eg the server will have new Trinity and SPAdes packages. Users will have to modify their workflows and replace the discarded tools with new versions. Re-run option does not work on datasets created with the discarded tools.

Some tools will have new versions available alongside with the older versions. In this situation the existing workflows and re-run option should work.

Some known issues.

Metadata has been lost on some old datasets. These jobs do not display correct settings with “Run this job again” option. The issue is known for some time and is not related to the upgrade.

HiSAT2 and RNA_STAR aligners do not assign database (genome assembly) to BAM files. At the moment users have to assign a correct database/genome assembly manually or through a Galaxy workflow. Note that BAM files without an assigned genome assembly cannot be displayed on UCSC Genome Browser and might be visualised on incorrect assembly with IGV.

Htseq-count (Galaxy Version 0.9.1)  fails with the default settings. It works with “Additional BAM Output” set to “Output additional BAM file” in the advanced settings.

On separate note, we cease all activities in this blog. A new platform for technical updates will be determined by our management at some point in the future.

Galaxy talk at Centre for Digital Scholarship, UQ

We have a talk about Galaxy on Wednesday 9 May 2018, 12:00pm – 1:00pm at Centre for Digital Scholarship, UQ library. The venue: room 525, Level 5, Duhig tower (Building 2), St Lucia (above Merlo Cafe). Registration is required – see the announcement at Centre for Digital Scholarship for additional information. It is an introductory talk aimed for people new to Galaxy platform and interested in analysis of high throughput sequencing data. The seminar includes a Galaxy demo for RNA-Seq analysis and creation of a Galaxy workflow.

Genome assembly with SPAdes

SPAdes is a popular choice for genome assembly on Galaxy Australia. Recently SPAdes was upgraded to Galaxy Version 3.11.1. On Galaxy Australia every SPAdes job uses 16 CPUs and ~60 GB RAM. A user can have one active 16 CPUs job. Users can submit multiple assembly jobs. The submitted jobs will be queued until completion of an active assembly job.

Here we provide general recommendations for genome assembly with SPAdes on Galaxy Australia. The recommendations are based on analysis of failed SPAdes jobs.

Make sure the reads are properly paired. It means both FASTQ files have the same number of reads and paired reads are present in the same positions in these files. Use Trimmomatic for trimming of paired reads. Check number of reads in paired-end data before the assembly.

SPAdes uses de Bruijn graphs for assembly. There is no obvious improvement in assembly after certain level of a read coverage. According to The Microbial Underground blog, assembly of a small bacterial genome does not improve after 27x coverage. Moreover, extreme deep coverage (~500x) can have negative effect on genome assembly. Increase in coverage results in longer assembly time and higher failure rate. Subsample your reads.

High quality reads produce a better assembly. Low quality data may cause a job failure. Trim the primers / adapters and remove low quality reads and short reads. Reads shorter than k-mer are not used for assembly. All you need for a start is ~30x coverage. An example of read trimming is available in the GVL Microbial Assembly tutorial.

Overlapped paired end reads can be merged with Pear. This may improve an assembly. According to some reports, SPAdes does not handle overlapping paired-end reads well.

Do QC on your reads before the assembly. Galaxy Australia provides FastQC tool.

Assembly of a microbial genome usually takes first hours on Galaxy Australia. If your SPAdes job goes for >12 hours, you may want to check your data – see the text above.

SPAdes is designed for assembly of microbial genomes. People use SPAdes on Galaxy Australia for assembly of small eukaryotic genomes, with mixed results, e.g. SPAdes produced 500 Mb assembly for expected 200 Mb genome. We observe a high error rate for assembly of eukaryotic genomes and SPAdes jobs with big FASTQ files. The following settings reduce an error rate for SPAdes jobs with big datasets:
Run only assembly? (without read error correction) – Yes
Careful correction? – No
Automatically choose k-mer values – Yes
Coverage Cutoff – Auto

Galaxy workshops in Brisbane

QFAB announced several Galaxy workshops in Brisbane:
RNA-Seq analysis using Galaxy (1-2 May 2018)
Variant detection using Galaxy (15-16 May 2018)
Genome assembly using Galaxy (29-30 May 2018)

Venue: Room 3.146, Queensland Bioscience Precinct (Building 80), The University of Queensland, St Lucia.

The workshops are open for anyone. Registration is essential.  The workshops incur a fee of $25 AUD.

Galaxy seminar at the Ecosciences Precinct on February 6

On February 6, 2018 we run a Galaxy seminar at the Ecosciences Precinct for people from the Department of Agriculture and Fisheries. The talk provides a broad overview of Galaxy platform and Galaxy-qld server and would be of interest to people interested in analysis of high throughput sequencing data. It will follow by a Galaxy demo session focussed on data upload using ftp and genome assembly.

Slides for the seminar are available for download from Dropbox (pdf).

Long-term data storage on Galaxy-qld

Galaxy-qld was designed for data analysis. It has a modest storage. Unfortunately, some users store their files for long time. The graph below shows distribution of big datasets based on date of creation.

big_data

The big datasets represent about a half of all data stored at the server. The graph demonstrates that Galaxy-qld users store TBs of data for over a year. In some cases people upload data from public repositories and keep the datasets on the server for months and even years without doing any analysis. The long-term data storage has negative impact on the server, as we have to turn down requests for extra quotas from active users.

We are going to clean-up accounts inactive for six months or more. Data stored without analysis will be deleted without warning. Non-active users will be asked to move their results to an external storage prior to a certain date. On this day data in their accounts will be deleted.

Citing Galaxy-qld

We compiled a CiteULike library with 14 papers citing Galaxy-qld. Google Scholar was queried with “galaxy-qld” and the results were manually filtered to exclude incorrectly assigned papers. PhD theses were also excluded from the library.

Some Galaxy-qld users cited the GVL paper Afgan at al 2015 without the link to Galaxy-qld. These papers are available in the Genomics Virtual Lab library with tag “galaxy-qld“.

If you publish results obtained at Galaxy-qld, please cite Afgan at al 2015 paper and provide link to Galaxy-qld.