Running fastStructure and associated difficulties

Recently, I have been providing a bit of bioinformatic assistance to another researcher at QUT. As part of this, I have had to learn how to format data inputs, among other things, to get fastStructure to work. fastStructure is a program used for population genetics analysis (a fair bit outside of what I am currently doing). Below are the main things I recall finding a bit tricky. Most of this difficulty is simply because the program is not extensively documented, and you need to search around online and use some rational deduction to figure out what exactly you must do.

Program installation

This part of the process was not altogether difficult to me, but I can see it being the worst part for anyone not familiar with using Linux.

There are a few requirements detailed on the official github. Specifically, this is a distribution of Python 2.7 with the Numpy and Scipy packages, as well as the GNU Scientific Library (GSL). For Python, I think installing Anaconda2 is the best way to handle this. You can download the installer from https://www.anaconda.com/download/, making sure to grab the correct version for your computer (Linux, then 32/64 bits). On the HPC environment available at QUT, the GSL package can be loaded by running this.

module load gsl/2.1-foss-2016a

Which, importantly, does this equivalent of this.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/pkg/suse12/software/gsl/2.1-foss-2016a/lib

The other two things you need to specify for installation include (directories are my perspective, yours will likely differ).

export CFLAGS="-I/pkg/suse12/software/gsl/2.1-foss-2016a/include"
export LDFLAGS="-L/pkg/suse12/software/gsl/2.1-foss-2016a/lib"

If something like this isn’t available then you’ll need to install GSL yourself. Hopefully it isn’t too difficult. Otherwise, the instructions on the fastStructure github should be reasonably easy to follow just by copy-pasting what’s instructed.

Data inputs

This was the most difficult part of using the fastStructure program to me. In the file distribution package an example .bed file is provided for testing, but it gives little assistance since it is encoded as a binary file and thus isn’t altogether accessible to figure out how to recreate this format. The program does accept .str format, however. In this context, .str refers to STRUCTURE format which itself isn’t clearly and succinctly described either. This format was, luckily, already mostly formatted by the researcher. I did encounter a few errors initially trying to get the program to read this in, which I’ll detail below.

Firstly, the program adds the “.str” suffix to your input file name. Thus, if you specify –input=example_file.str, it will attempt to open “example_file.str.str” and likely result in this error:

Traceback (most recent call last):
File "structure.py", line 172, in
G = parse_str.load(params['inputfile'])
File "parse_str.pyx", line 7, in parse_str.load
handle = open(file+'.str','r')
IOError: [Errno 2] No such file or directory: 'example_file.str.str'

The key bit is the handle = open(file+’.str’,’r’) part. It artificially adds “.str” to the end of your input file name (file). Thus, you need to make sure the file is called “example_file.str”, not “example_file.txt” or something similar, and you need to specify –input=example_file and not –input=example_file.str. This is a little bit annoying, and reminds me of what I did when I first learned coding and wasn’t used to writing file extensions. Perhaps the intention was precisely to make it “easier” for people unfamiliar with command-line to use the program, but I think this backfired.

The next problem looked like this.

Traceback (most recent call last):
  File "structure.py", line 172, in 
    G = parse_str.load(params['inputfile'])
  File "parse_str.pyx", line 10, in parse_str.load
    L = loci.shape[1]
IndexError: tuple index out of range

This was caused by having two blank headers at the end of the file, resulting from conversion of another format to the one the researcher was using. Thus, make sure to look through the table and check that the headers are all present, in their correct position, and there are no empty cells when viewing the file in Excel.

A final thing that could be of help: this is an example of the input file’s format.

#	#	#	#	Smpl_ID	Pop_ID	Locus_1	Locus_2	Locus_3	Locus_4	Locus_5	Locus_6	Locus_7	Locus_8	Locus_9	Locus_10
#	#	#	#	Sample1	1	1	-9	0	1	0	1	1	1	1	0
#	#	#	#	Sample1	1	1	-9	1	0	1	0	0	0	0	1
#	#	#	#	Sample2	2	0	0	1	1	1	0	1	1	0	1
#	#	#	#	Sample2	2	1	0	1	0	1	1	0	1	0	1

This was a little difficult to figure out. I ended up looking through the code and found that the first 6 columns are ignored by default when reading in the .str file. This is further detailed in this somewhat helpful Google forum discussion with one of the program’s authors fastSTRUCTURE running but how to build .str or .bed file. Since the file I was given only had two columns with details (Smpl_ID and Pop_ID) I just added #’s to fill in empty cells within Excel and saved the file as a tab-delimited file (remembering to make sure the extension is .str afterwards).

For this, the Smpl_ID column doesn’t seem to matter. What does is your Pop_ID column, which should be a simple system to define your sampling groups such as a 1-10(00) system. Following this are the SNP values for your loci. -9 means absent data, 0 means no SNP, 1 means SNP is present. There is meant to be two rows per sample (as detailed in the above Google forum link) which I assume relates to the paternal/maternal strands, but I’m not actually sure…

I also don’t think the headers matter at all, so long as you have a header row. I can’t find in the code where it handles this header row (presumably by ignoring it?) but the author in the above Google forum link suggests that having a blank header row is correct so I must assume this is correct. No errors get raised so I guess it is?.

Overall, this was unnecessarily difficult to get right. It would be nice if the author of the code detailed the expected file format. It’d also help to make file input more sensible with additional checks in place to make sure everything is properly formatted rather than just raising obscure errors or, worse, cutting out columns with actual data in them without the user having any clue this is happening! Most of these problems could be easily fixed with a few hours adding extra checks and conditions throughout the code (even I could do it) which I think should be essential for any code going out to wide use in the scientific community.

chooseK

The next thing which was a bit confusing to me was running the chooseK.py script. I was initially unsure why I was getting results when running chooseK.py on the test data but not on the “real” data. The error looked like this.

Traceback (most recent call last):
  File "chooseK.py", line 111, in 
    print "Model complexity that maximizes marginal likelihood = %d"%Ks[np.argmax(marginal_likelihoods)]
  File "/home/n8942188/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 1004, in argmax
    return _wrapfunc(a, 'argmax', axis=axis, out=out)
  File "/home/n8942188/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 62, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/home/n8942188/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 42, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
ValueError: attempt to get argmax of an empty sequence

Admittedly, on re-reading the github page, I see that the problem is mentioned in passing. Before running chooseK you need to generate results from the structure.py script using the simple prior and different K values. Then your –input value to chooseK.py is the same as your –output value from the structure.py script. From running the structure.py script with simple prior and different K values, you should have a number of files in the directory that look like OUTPUTPREFIX.1.log, OUTPUTPREFIX.2.log, etc., which chooseK will read.

Distruct plot

The final hurdle was getting the distruct.py script to produce an admixture plot. When running this with just the –input and –output arguments, I received the below error.

Traceback (most recent call last):
  File "distruct.py", line 170, in 
    figure = plot_admixture(admixture, population_indices, population_labels, title)
  File "distruct.py", line 16, in plot_admixture
    figure = plot.figure(figsize=(5,3))
  File "/home/n8942188/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.py", line 539, in figure
    **kwargs)
  File "/home/n8942188/anaconda2/lib/python2.7/site-packages/matplotlib/backend_bases.py", line 171, in new_figure_manager
    return cls.new_figure_manager_given_figure(num, fig)
  File "/home/n8942188/anaconda2/lib/python2.7/site-packages/matplotlib/backend_bases.py", line 177, in new_figure_manager_given_figure
    canvas = cls.FigureCanvas(figure)
  File "/home/n8942188/anaconda2/lib/python2.7/site-packages/matplotlib/backends/backend_qt5agg.py", line 35, in __init__
    super(FigureCanvasQTAggBase, self).__init__(figure=figure)
  File "/home/n8942188/anaconda2/lib/python2.7/site-packages/matplotlib/backends/backend_qt5.py", line 235, in __init__
    _create_qApp()
  File "/home/n8942188/anaconda2/lib/python2.7/site-packages/matplotlib/backends/backend_qt5.py", line 122, in _create_qApp
    raise RuntimeError('Invalid DISPLAY variable')
RuntimeError: Invalid DISPLAY variable

Luckily I have already encountered similar errors from using matplotlib in Python. What fixed this was manually editing the distruct.py script to add this line  “plot.switch_backend(‘agg’)” in such that the first few lines of this file looked like this.

import numpy as np
import matplotlib.pyplot as plot
plot.switch_backend('agg')
import colorsys
import getopt
import sys, pdb

This let me produce output!

Overall

This program was very frustrating to use, and if it weren’t for the fact that I decided to write a detailed blog post about it, I may not have discovered that I was cutting out the first four loci columns from my input file. The fact that I needed to manually edit a Python script to get output is also not a good thing – I can imagine a lot of people getting stuck here and having no idea what to do.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s