Polishing PacBio assemblies with Arrow and Pilon

My current project involves the assembly of sea anemone genomes from PacBio long reads. From the start to where I am now, I’ve encountered a lot of bugs, errors, and other things I’ve had to spend minutes to days trying to resolve. Polishing assemblies is one of those things which took a few days all up. The result is worth it however.

While it has been a while, I will try to go through the things I did to get polishing software working on a computer environment which is, at times, somewhat limiting – the QUT HPC cluster. This is a fantastic resource on the whole, but permissions limitation among other things can make installing new programs annoying, and often it takes too long for the staff to do this for you, so you’re left with the challenge of doing it yourself. But if you can convince yourself you enjoy the challenge, you’re half way towards resolving the problem.

BLASR installation

The first step in performing polishing with Arrow is to map reads. I tried to map reads with other programs like Minimap2, but Arrow seemed to reject these .BAM files due to not finding BLASR-formatted features in the header tags. I’m sure you could hack the Minimap2 output files to have similar headers to BLASR and trick Arrow, but I wasn’t game trying that.

After various attempts, I ended up getting a working BLASR installation with the Pitchfork repository. You can download this with a simple git clone call.

git clone https://github.com/PacificBiosciences/pitchfork.git

It’s been some time since I installed this originally (my git log says I installed the September 20, 2017 commit). From attempting to run a new installation of the program, I immediately encounter the below error.

[ERROR] gcc version needs to be newer than or equal to 5.3

(Kind of) simple solution on the QUT HPC:

module unload gcc/4.9.3-2.25
module unload gcccore/4.9.3
module unload binutils/2.25-gcccore-4.9.3
module load gcc/5.4.0-2.26

It’s annoying that you have to manually unload prerequisites. If installing this program on a computer where this system isn’t available, you will need to install the appropriate version of GCC yourself. I’m sure there’s tutorials online for this.

After this, you can run:

make PREFIX=/home/n8942188/various_programs/BLOG_tests/pitchfork/blasrbuild blasr

Where the PREFIX directory is to where you’re currently installing the program. As you can see, inside the pitchfork directory I made with “git clone” I’ve chosen to install the BLASR files inside a directory called “blasrbuild”.

After this you can expect to wait a pretty long time.

And after this, if you encounter the same issue I have, this will happen to you.

git clone git://github.com/PacificBiosciences/pbbam /lustre/home-lustre/n8942188/various_programs/BLOG_tests/pitchfork/workspace/pbbam
Cloning into '/lustre/home-lustre/n8942188/various_programs/BLOG_tests/pitchfork/workspace/pbbam'...
fatal: unable to connect to github.com:
github.com[0: 192.30.255.112]: errno=Connection timed out
github.com[1: 192.30.255.113]: errno=Connection timed out

The problem here is that the git://github.com link doesn’t work on the QUT HPC – I’m not sure where this style of git link works now. What you need to do once the above error occurs is run this.

git clone https://github.com/PacificBiosciences/pbbam.git /lustre/home-lustre/n8942188/various_programs/BLOG_tests/pitchfork/workspace/pbbam

Where, of course, you change your output directory accordingly. All I’ve actually done here is replaced “git:” with “https:”. After this, you can re-run the make command from above and you should see that CXX objects are being built.

After a short while, you’ll encounter a similar problem.

git clone git://github.com/PacificBiosciences/blasr_libcpp /lustre/home-lustre/n8942188/various_programs/BLOG_tests/pitchfork/workspace/blasr_libcpp
Cloning into '/lustre/home-lustre/n8942188/various_programs/BLOG_tests/pitchfork/workspace/blasr_libcpp'...

I was impatient so I didn’t wait for the eventual fatal: unable to connect to github.com: error. I just used a keyboard interrupt (ctrl + c) and ran the following.

git clone https://github.com/PacificBiosciences/blasr_libcpp /lustre/home-lustre/n8942188/various_programs/BLOG_tests/pitchfork/workspace/blasr_libcpp

Then I re-ran the make command. It’s at this point that I receive a new error.

fatal error: htslib/sam.h: No such file or directory
compilation terminated.

This tells me that I need htslib installed and in my PATH. I can do this on the QUT HPC like so.

module load htslib/1.3.1-foss-2016a

If this isn’t available to you, then you’ll need to install htslib manually and make sure the sam.h file is discoverable in your PATH value. After this, again, I re-run the make command. With a bit more patience, you’ll be rewarded with another incorrectly formatted git clone call.

git clone git://github.com/PacificBiosciences/blasr /lustre/home-lustre/n8942188/various_programs/BLOG_tests/pitchfork/workspace/blasr

Keyboard interrupt this, change this to have https:// instead of git://, clone the git repository, and run the make command again. You’ll finally be done after a short while!

Now what you need to do before running BLASR is to look inside your blasrbuild/ directory (or equivalent). There will be a file called “setup-env.sh”, to which you can call “source setup-env.sh”. Now, inside your blasrbuild/bin/ directory there is a blasr executable (called “blasr”). Try calling it (./blasr) and you should see this.

blasr - a program to map reads to a genome
 usage: blasr reads genome
 Run with -h for a list of commands
          --help for verbose discussion of how to run blasr.

In release v5.1 of BLASR, command-line options will use the
single dash/double dash convention:
Character options are preceded by a single dash. (Example: -v)
Word options are preceded by a double dash. (Example: --verbose)
Please modify your scripts accordingly when BLASR v5.1 is released.

And that means it’s worked! Remember that every time you want to run BLASR you need to source the setup-env.sh file first.

Running BLASR

Since BLASR is incredibly slow, my goal was to map each sequencing lane’s subreads.BAM file to the unpolished genome individually and then combine these results. I had eight such files for my genomes. The benefit of mapping individual files is that you can, on a PBS queue submission system, submit a batch job which will let you use a lot more CPUs and hence threads than otherwise possible.

To cut a long story short, this is the script I used for queue submission.

#!/bin/bash -l

#PBS -N blasr_smrt
#PBS -l ncpus=8
#PBS -l walltime=100:00:00
#PBS -l mem=10G
#PBS -j oe
#PBS -J 1-8

cd $PBS_O_WORKDIR

GENDIR=/home/stewarz3/genome_assembly/smartdenovo/arrow
GENNAME=species_smrtden_2500.arrow3.fix.fasta
BLASRDIR=/home/stewarz3/various_programs/pitchfork/blasrbuild2/bin
OUTPREFIX=species_smrtden_2500_blasr_iter1
CPUS=8

source /home/stewarz3/various_programs/pitchfork/blasrbuild2/setup-env.sh

$BLASRDIR/blasr $(cat subread_loc.txt | head -n ${PBS_ARRAY_INDEX} | tail -n 1) $GENDIR/$GENNAME --out ${OUTPREFIX}.${PBS_ARRAY_INDEX}.bam --bam --bestn 10 --minMatch 12 --maxMatch 30 --nproc $CPUS --minSubreadLength 50 --minAlnLength 50 --minPctSimilarity 70 --minPctAccuracy 70 --hitPolicy randombest --randomSeed 1

There’s a few things to note here.

Firstly: the walltime is a little bit overkill for my files. You’ll need to trial this for your own data to see what works.

Secondly: the memory usage can vary. My SMARTdenovo assemblies used very little memory (often less than 7G), but WTDBG was very greedy and wanted about 20G for unknown reasons.

Thirdly: the parameters I used are, to my understanding, the PacBio defaults. At the time that I did this analysis, it was the default parameters used by the pbalign program (which I tried but failed to get running).

Lastly: I made a text file listing the locations of my subreads files which I called “subread_loc.txt”. Its contents looks like this.

/home/stewarz3/genome_assembly/subreads_dir/1_A01/1.subreads.bam
/home/stewarz3/genome_assembly/subreads_dir/1_B01/2.subreads.bam
/home/stewarz3/genome_assembly/subreads_dir/2_C01/3.subreads.bam
/home/stewarz3/genome_assembly/subreads_dir/3_D01/4.subreads.bam
/home/stewarz3/genome_assembly/subreads_dir/4_E01/5.subreads.bam
/home/stewarz3/genome_assembly/subreads_dir/5_F01/6.subreads.bam
/home/stewarz3/genome_assembly/subreads_dir/6_G01/7.subreads.bam
/home/stewarz3/genome_assembly/subreads_dir/7_H01/8.subreads.bam

I changed the file names a bit since I don’t know how much information they contain and if I should be sharing that so publicly (I don’t have a real problem with it, but I didn’t pay for the sequencing so I should try to maintain some secrecy I suppose?). The queue submission script will iterate through the lines in this file so you use a different .BAM file as input for each job.

You should change the values indicated such as BLASRDIR= to your appropriate values. You can also modify the number of CPUS= and the PBS requested number which can be done at your discretion. I found 8 worked well for each job.

After this job is complete, you need to sort your output .BAM files and then merge them. My samsort-ing script was like this.

#!/bin/bash -l

#PBS -N samsortb
#PBS -l ncpus=4
#PBS -l walltime=00:45:00
#PBS -l mem=30G
#PBS -j oe
#PBS -J 1-8

cd $PBS_O_WORKDIR

CPUS=4
MAXMEM=23
declare -i THREADMEM
THREADMEM=$MAXMEM/$CPUS

PREFIX=species_smrtden_2500_blasr_iter1

samtools sort -m ${THREADMEM}G -@ $CPUS -o ${PREFIX}.${PBS_ARRAY_INDEX}sort.bam -O bam ${PREFIX}.${PBS_ARRAY_INDEX}.bam

This is a batch job like before. You shouldn’t need to modify any of the components other than the PREFIX. However, you may want to change the requested CPUS/MEM. Also note that samtools is already in my PATH so I don’t need to specify the full path to this executable.

The final step was to merge the sorted .BAM files. I did it with this script

#!/bin/bash -l

#PBS -N mergeBamsm
#PBS -l ncpus=1
#PBS -l walltime=03:00:00
#PBS -l mem=10G

cd $PBS_O_WORKDIR

BAMPREFIX=species_smrtden_2500_blasr_iter1

/home/stewarz3/various_programs/bamUtil/bin/bam mergeBam -i ${BAMPREFIX}.1sort.bam -i ${BAMPREFIX}.2sort.bam -i ${BAMPREFIX}.3sort.bam -i ${BAMPREFIX}.4sort.bam -i ${BAMPREFIX}.5sort.bam -i ${BAMPREFIX}.6sort.bam -i ${BAMPREFIX}.7sort.bam -i ${BAMPREFIX}.8sort.bam -o ${BAMPREFIX}.sort.bam

/home/stewarz3/various_programs/pitchfork/blasrbuild2/bin/pbindex ${BAMPREFIX}.sort.bam

You’ll note that I needed to install bamUtil. I don’t recall this being difficult. Also, you’ll note that I ran pbindex at the same time. This is automatically installed when installing BLASR, and it should be in the same bin/ directory as the BLASR executable.

The output of this process should result in a .sort.bam file, a .sort.bam.bai, and .sort.bam.log file. Now you’re ready for running Arrow.

Arrow installation

Arrow was a pain to install, but surprisingly not as bad as BLASR in my opinion, primarily due to my familiarity with Python. First things first, we need the repository downloaded.

git clone https://github.com/PacificBiosciences/GenomicConsensus.git

Next, we need to make sure Python 2 is our default version. Normally, Python 3 is first in my PATH. Thus, what I do here is this.

export PATH="/home/n8942188/anaconda2/bin:$PATH"

Anaconda is by far the easiest way to work with Python on a HPC, so if you don’t have it, you should install Anaconda 2 and put something similar in your .bashrc file assuming you didn’t do this during the Anaconda 2 installation process.

Knowing that Python 2 needs to be default, you then enter the GenomicConsensus directory (cd GenomicConsensus) and type “make”. This should automatically install all your Python prerequisites for Arrow. It’s possible that, in the process of re-doing this all while writing this blog post, I’ve already done some minor thing which would normally cause errors. Hopefully not. If you encounter an error doing this, let me know.

If successful, you’ll see this.

Finished processing dependencies for GenomicConsensus==2.2.2

The next thing to do is install ConsensusCore, another prerequisite for Arrow in Python. As always, download the repository.

git clone https://github.com/PacificBiosciences/ConsensusCore.git

Next, enter this directory and run this.

python setup.py install

You’ll see a bunch of g++ lines floating by. If successful, it’ll complete with this line.

Finished processing dependencies for ConsensusCore==1.0.2

There’s one more thing to do before you can run Arrow. Weirdly, I don’t recall facing this problem when I did this a few months ago. If I try to run Arrow at this point, I receive this error.

/home/n8942188/anaconda2/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Traceback (most recent call last):
  File "/home/n8942188/anaconda2/bin/variantCaller", line 6, in 
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/lustre/home-lustre/n8942188/various_programs/BLOG_tests/GenomicConsensus/bin/variantCaller", line 5, in 
    sys.exit(main())
  File "/lustre/home-lustre/n8942188/various_programs/BLOG_tests/GenomicConsensus/GenomicConsensus/main.py", line 395, in main
    parser=get_parser(),
  File "/lustre/home-lustre/n8942188/various_programs/BLOG_tests/GenomicConsensus/GenomicConsensus/options.py", line 103, in get_parser
    tcp.add_choice_str(
AttributeError: 'ToolContractParser' object has no attribute 'add_choice_str'

I did find this helpful link on the topic https://www.biostars.org/p/259433/.

I followed its instructions by first downloading the git repository.

git clone https://github.com/PacificBiosciences/pbcommand.git

I entered the directory and ran “make”, making sure once again that the Anaconda distributed Python 2 was my default. Now, when I call Arrow (./Arrow inside the GenomicConsensus/bin/ directory) I see the following text.

/home/n8942188/anaconda2/lib/python2.7/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
usage: variantCaller [-h] [--version] [--emit-tool-contract]
                     [--resolved-tool-contract RESOLVED_TOOL_CONTRACT]
                     [--log-file LOG_FILE]
                     [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL} | --debug | --quiet | -v]
                     --referenceFilename REFERENCEFILENAME -o OUTPUTFILENAMES
                     [-j NUMWORKERS] [--minConfidence MINCONFIDENCE]
                     [--minCoverage MINCOVERAGE]
                     [--noEvidenceConsensusCall {nocall,reference,lowercasereference}]
                     [--coverage COVERAGE] [--minMapQV MINMAPQV]
                     [--referenceWindow REFERENCEWINDOWSASSTRING]
                     [--alignmentSetRefWindows]
                     [--referenceWindowsFile REFERENCEWINDOWSASSTRING]
                     [--barcode _BARCODE] [--readStratum READSTRATUM]
                     [--minReadScore MINREADSCORE] [--minSnr MINHQREGIONSNR]
                     [--minZScore MINZSCORE] [--minAccuracy MINACCURACY]
                     [--algorithm {quiver,arrow,plurality,poa,best}]
                     [--parametersFile PARAMETERSFILE]
                     [--parametersSpec PARAMETERSSPEC]
                     [--maskRadius MASKRADIUS] [--maskErrorRate MASKERRORRATE]
                     [--pdb] [--notrace] [--pdbAtStartup] [--profile]
                     [--dumpEvidence [{variants,all,outliers}]]
                     [--evidenceDirectory EVIDENCEDIRECTORY] [--annotateGFF]
                     [--reportEffectiveCoverage] [--diploid]
                     [--queueSize QUEUESIZE] [--threaded]
                     [--referenceChunkSize REFERENCECHUNKSIZE]
                     [--fancyChunking] [--simpleChunking]
                     [--referenceChunkOverlap REFERENCECHUNKOVERLAP]
                     [--autoDisableHdf5ChunkCache AUTODISABLEHDF5CHUNKCACHE]
                     [--aligner {affine,simple}] [--refineDinucleotideRepeats]
                     [--noRefineDinucleotideRepeats] [--fast]
                     [--skipUnrecognizedContigs]
                     inputFilename
variantCaller: error: too few arguments

Which is perfect! The first deprecated warning is fine, it shouldn’t affect program behaviour. The last little error just means it wants at least one argument. You can call ./Arrow -h for a more detailed help message.

Running Arrow

Now that the hard part is out of the way, we just need to run our Arrow job. We’ve already got our sorted and indexed BLASR .BAM files from above, so we can call this script for job submission to PBS.

#!/bin/bash -l

#PBS -N arrow_smrt
#PBS -l ncpus=14
#PBS -l walltime=100:00:00
#PBS -l mem=30G

cd $PBS_O_WORKDIR

ARROWDIR=/home/stewarz3/various_programs/GenomicConsensus/bin

GENDIR=/home/stewarz3/genome_assembly/smartdenovo/arrow
GENNAME=species_smrtden.fasta
BAMFILE=species_smrtden_blasr_iter1.sort.bam

THREADS=14
PREFIX=species_smrtden
ITERNUM=1

$ARROWDIR/arrow $BAMFILE -j $THREADS --referenceFilename $GENDIR/$GENNAME -o ${PREFIX}.arrow${ITERNUM}.fasta -o ${PREFIX}.arrow${ITERNUM}.gff -o ${PREFIX}.arrow${ITERNUM}.fastq

There’s a few things to unpack here.

Firstly, you need to edit the script and specify your file locations appropriately.

Secondly, I’m using ITERNUM here to keep track of my Arrow iterations – you’ll want to run this three times on average. Each iteration, you’ll change your OUTPREFIX in the BLASR scripts, and you’ll eventually change your GENNAME here, the BAMFILE will be _iter2.sort.bam, etc., and you’ll change the ITERNUM value.

One tip here – you’ll get an output .gff file. On your first iteration, the file size will be large because you’re making a lot of changes. On the second and third iteration the file will decrease in size as less changes are made. Beyond the third iteration you’ll find the .gff file doesn’t decrease much in size due to diminishing returns. Arrow isn’t deterministic, which means you can run it 100 times and it will always find a certain number of things to change.

Pilon installation

After going through the hassle of dealing with PacBio’s software, you’ll be relieved to find that some software developers know how to keep things simple. Pilon is really easy to install, you just download the latest release (https://github.com/broadinstitute/pilon/releases) and unzip/untar a single .jar file wherever you want it to be.

Running Pilon

Again, this is easy compared to what you’ve had to do. I’ll show the queue submission script then discuss it a bit.

#!/bin/bash -l

#PBS -N pilsmrt
#PBS -l ncpus=12
#PBS -l walltime=100:00:00
#PBS -l mem=180G

cd $PBS_O_WORKDIR

BWADIR=/home/stewarz3/various_programs/bwa
PILONDIR=/home/stewarz3/various_programs/pilon

READDIR=/home/stewarz3/species_assembly/illumina_reads
READPREF=species_illum_R
READSUFF=.fastq

GENDIR=/home/stewarz3/species_assembly/smartdenovo/pilon
GENNAME=species_smrtden.ar3.fasta
OUTPREF=species_smrtden.ar3.pil1

THREADS=8
MAXMEM=60
declare -i THREADMEM
THREADMEM=$MAXMEM/$THREADS

$BWADIR/bwa index $GENDIR/$GENNAME
$BWADIR/bwa mem -t $THREADS -o ${OUTPREF}.bwamem.sam $GENDIR/$GENNAME $READDIR/${READPREF}1${READSUFF} $READDIR/${READPREF}2${READSUFF}
samtools sort -m ${THREADMEM}G -@ $THREADS -o ${OUTPREF}.bwamem.sorted.bam -O bam ${OUTPREF}.bwamem.sam
samtools index ${OUTPREF}.bwamem.sorted.bam

java -Xmx160G -jar $PILONDIR/pilon-1.22.jar --genome $GENDIR/$GENNAME --bam ${OUTPREF}.bwamem.sorted.bam --changes --vcf --diploid --threads $THREADS --output $OUTPREF

Firstly: you’ll want to use a few CPUs and a decent bit of memory. Jobs didn’t take much longer than 40-60hrs on average with these settings, but allowing some give is a good idea.

Secondly: You need a read mapper installed. Pilon’s developers recommend BWA, and from looking at benchmarks of BWA versus Bowtie2, it seems like BWA is better at handling indels. You will expect to see a fair few indels in your PacBio assembly even after 3 iterations of Arrow polishing, so I think BWA is probably the best choice of read mapper here. However you choose to install this program is up to you. BWA is quite easy from what I recall. As detailed at the github page, you just clone the repository (“git clone https://github.com/lh3/bwa“) and type “make” in that directory.

Finally: you’ll need to change the file name/directory values to your respective values. If your Illumina reads aren’t in a format like I made mine (species_illum_R1.fastq and species_illum_R2.fastq) you can either change the program call, or change your file names to be similar.

I found that two iterations of Pilon was sufficient. Doing a third wouldn’t hurt, and if you have the time to do so I’d recommend it, but I don’t think you get much benefit from running more than this. I was getting impatient with this process and thus used two iterations.

Overall

This process was a bit of an arduous journey. It seems like every PacBio made software is just unnecessarily difficult to install and I don’t know why. The result is a very high quality assembly, though.

The biggest issue other than the difficulties of getting the programs running is the actual running time of the whole process. If you run three Arrow iterations and two Pilon iterations, you might expect this to take close to a month assuming you have ~100X coverage of genomes about 300Mbp in size.

 

6 thoughts on “Polishing PacBio assemblies with Arrow and Pilon

  1. Nice report. Though, in many cases, blasr and quiver/arrow can be installed much easier (at least on Ubuntu 16.04 this worked straigt away):

    # install miniconda2
    wget https://repo.anaconda.com/miniconda/Miniconda2-latest-Linux-x86_64.sh
    bash Miniconda2-latest-Linux-x86_64.sh

    # create an environment with blasr and quiver/arrow
    export PATH=$HOME/miniconda2/bin:$PATH # assuming that miniconda is in your home directory, needs to be done each time you want to activate your environment
    conda config –add channels defaults
    conda config –add channels bioconda
    conda config –add channels conda-forge
    conda create -c bioconda -n pacbio python=2.7 blasr genomicconsensus

    # enter the environment
    source activate pacbio

    # run blasr and quiver/arrow
    ./myScript.sh

    # leave the environment
    source deactivate

    [I hope that I not forget anything, otherwise you will find this instruction also on the GitHub repos from PacBio]

    Best regards

    Like

  2. A way to speed up the polishing (if you have the resources) is to split the polishing step into polishing per contig. In my case that cut the time down from about a month to about a day.
    But I have a acces to the university cluster, so this might not be something for others.
    I did check whether the results were the same, and not suprisingly, they were.

    The mapping step could also be parallelised by splitting up the input read files, and afterwards merging the BAM files, but as that took only half a day in my case, I didn’t investigate that further.

    Liked by 1 person

    1. Hi Jan,

      Good suggestions to speed up the arrow polish.

      I am currently doing polish by contig in cluster, and want to know do you have a script to do this? What tools do you use to merge the resulted fasta, fastq and gff files?

      Bests,

      Tao

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s