Genome assembly with SPAdes

SPAdes is a popular choice for genome assembly on Galaxy Australia. Recently SPAdes was upgraded to Galaxy Version 3.11.1. On Galaxy Australia every SPAdes job uses 16 CPUs and ~60 GB RAM. A user can have one active 16 CPUs job. Users can submit multiple assembly jobs. The submitted jobs will be queued until completion of an active assembly job.

Here we provide general recommendations for genome assembly with SPAdes on Galaxy Australia. The recommendations are based on analysis of failed SPAdes jobs.

Make sure the reads are properly paired. It means both FASTQ files have the same number of reads and paired reads are present in the same positions in these files. Use Trimmomatic for trimming of paired reads. Check number of reads in paired-end data before the assembly.

SPAdes uses de Bruijn graphs for assembly. There is no obvious improvement in assembly after certain level of a read coverage. According to The Microbial Underground blog, assembly of a small bacterial genome does not improve after 27x coverage. Moreover, extreme deep coverage (~500x) can have negative effect on genome assembly. Increase in coverage results in longer assembly time and higher failure rate. Subsample your reads.

High quality reads produce a better assembly. Low quality data may cause a job failure. Trim the primers / adapters and remove low quality reads and short reads. Reads shorter than k-mer are not used for assembly. All you need for a start is ~30x coverage. An example of read trimming is available in the GVL Microbial Assembly tutorial.

Overlapped paired end reads can be merged with Pear. This may improve an assembly. According to some reports, SPAdes does not handle overlapping paired-end reads well.

Do QC on your reads before the assembly. Galaxy Australia provides FastQC tool.

Assembly of a microbial genome usually takes first hours on Galaxy Australia. If your SPAdes job goes for >12 hours, you may want to check your data – see the text above.

SPAdes is designed for assembly of microbial genomes. People use SPAdes on Galaxy Australia for assembly of small eukaryotic genomes, with mixed results, e.g. SPAdes produced 500 Mb assembly for expected 200 Mb genome. We observe a high error rate for assembly of eukaryotic genomes and SPAdes jobs with big FASTQ files. The following settings reduce an error rate for SPAdes jobs with big datasets:
Run only assembly? (without read error correction) – Yes
Careful correction? – No
Automatically choose k-mer values – Yes
Coverage Cutoff – Auto