Galaxy workshop: assembly of microbial genomes

A Galaxy workshop, Genome assembly using Galaxy, is scheduled on August 2-3, 2017. The event is arranged by QFAB. The training will be done by QFAB stuff and members of GVL-Qld team. The registration is essential.

Venue: Room 3.141, IMB/QBP, UQ, St Lucia

Start: 9:00 AM, August 2, 2017
End: 12:30, August 3, 2017

Presentations for the workshops are available in pdf format:
Introduction to Galaxy
Genome assembly with Galaxy

Advertisements

Galaxy workshop at Winter School – 2017

On July 5 we run An Introduction to Galaxy with the Genomics Virtual Lab workshop for participants of Winter School in Mathematical and Computational Biology. Galaxy workshops are very popular, and registration is essential. The participants will analyse differential gene expression using high throughput sequencing data. The workshop is based on the RNA-Seq Basic Galaxy tutorial from the Genomics Virtual Lab.

The workshop is aimed for biologists and does not require IT skills. The presentation is available for download (pdf).

Deciphering user activity on Galaxy-qld

Users of Galaxy-qld kept the server very busy in the last couple weeks, with many queued jobs, especially during working hours, in time of a high user activity. The graph below shows user activity on Galaxy-qld on May 25.

jobs_per_day_2017-05-25

The time is GMT / UTC; add 10 hours for Australian Eastern Standard Time (AEST). The user activity goes up in the morning, around 10 AM AEST, and declines after midnight.  During day time users submit a new job every minute.

Majority of submitted jobs are completed within less than two hours, as shown on next graph. However, a small number of jobs run for long time, sometimes for several days.

job_duration_2017-05-25

Cufflinks and Cuffdiff jobs are often run for a very long time. Galaxy-qld offers StringTie, a faster alternative to Cufflinks. Tophat jobs can run for long time, especially with big datasets. Galaxy-qld provides HiSAT2 and precomputed indices for several genomes. HiSAT2 is incredibly fast. Assembly jobs usually require significant time. SPAdes and VelvetOptimiser sometimes stuck and have to be terminated manually. Very often aligner problems are caused by bad data, such as different number of reads in paired datasets, different order of reads in paired data, or excessive presence of nucleotides with low quality. We recommend Trimmomatic for read trimming, as it preserves a proper read pairing in the output files.

To ease the constrains we increased the number of worker nodes on Galaxy-qld and changed CPU allocation for some jobs, as well as number of concurrent jobs per user. The user policy will be modified to provide a better experience for all users, but we can do better with cooperation from our users:
– do not run a new analysis on many genome-scale datasets. Develop your analysis with a small dataset, if possible.
– do not delete and resubmit queued jobs. Usually jobs are queued because there is no spare capacity on the server, or users exceed the limits on number of concurrent jobs. The submitted jobs will be completed eventually.
– use faster tools, such as HiSAT2 and StringTie.

 

Presentation from RNA-Seq workshop at IMB

Today we had a Galaxy workshop for postgraduate students at IMB. The participants used data and protocol from the Basic RNA-Seq Galaxy tutorial provided by Genomics Virtual Lab. The second half of the workshop was dedicated to Galaxy workflows. The participants created, modified and used a Galaxy workflow.

The workshop was done on Galaxy-tut. The server was updated recently, and it coped well with the load during the workshop.

The workshop presentation is available for download as a pdf file.

Seminar at Griffith Uni, Gold Coast campus

A talk about Galaxy-qld at Griffith University, Gold Coast campus, is scheduled for May 22, 2017, from 12:00 to 13:00. Location: building G40, room 4.111. Title for the seminar:  Galaxy-qld, a local server for genomic research. The seminar is aimed on a broad range of biologists interested in analysis of high-throughput sequencing data and does not require knowledge of genomics or programming. The even is open to everyone, without a registration. The presenter: Igor Makunin.

There will be a limited user support for Galaxy-qld users on Monday, May 22, because of the seminar.

 UPDATE. Slides are available for download (link to pdf).

Changes to Galaxy-qld

Galaxy-qld is getting popular, with over 900 registered users. The server is crunching the data 24/7. Demand for computing resources is rising rapidly, and at times Galaxy-qld runs at full capacity, without any spare compute resources. User activity goes up around noon (Brisbane time) and remains high until late evening. We maxed out the available resources, both in storage and CPUs, and introduced changes to the server to improve performance. Since today, the registered users can run up to 12 concurrent jobs, including four concurrent jobs required 5 CPUs (most aligners, some tools from Cufflinks and GATK2 packages).

Performance of the server also can be improved by better user practice. If submitted jobs are queued for long time, check @GVL_QLD updates on Twitter. The updates are accessible to anyone through the link. There is no need to have an account on Twitter to see the feed.

Do not delete and resubmit the queued jobs. The submitted jobs will be completed eventually, usually overnight. It is ok to submit other jobs, if you have queued jobs.

Not all jobs are identical. All assembly jobs run on 16 CPUs and need a whole worker node. These jobs tend to stay in the queue for a long time. Simple jobs such as filtering, FastQC or FASTQ Groomer use a single CPU and sneak into any available slot. Most aligners use five CPUs and also tend to queue for some time.

High demand for storage on Galaxy-qld

Number of users on Galaxy-qld is growing rapidly, with over 130 new people registered in March. As a result, the server hosts more jobs, and we see a high demand for storage. At the start of the week the storage utilisation was close to 90%, and we asked our users to delete old and unneeded datasets. We have a very good response from active users, but people with dormant / inactive accounts are less cooperative.

Galaxy-qld does not have capacity to store user data for long time. The server is designed for data analysis.

Please download the results as soon as convenient and delete files on Galaxy-qld.

Do not store temporary files or SAM files.

We see users storing original FASTQ files and reads after FASTQ Groomer with the only difference in metadata (fastq vs fastqsanger datatype), files with reads trimmed separately by quality and length, etc. Many aligners keep all reads in alignment, both aligned and unaligned, so users have multiple copies of the same data in different formats.