Recently, I have been providing a bit of bioinformatic assistance to another researcher at QUT. As part of this, I have had to learn how to format data inputs, among other things, to get fastStructure to work. fastStructure is a program used for population genetics analysis (a fair bit outside of what I am currently doing). Below are the main things I recall finding a bit tricky. Most of this difficulty is simply because the program is not extensively documented, and you need to search around online and use some rational deduction to figure out what exactly you must do.
This part of the process was not altogether difficult to me, but I can see it being the worst part for anyone not familiar with using Linux.
There are a few requirements detailed on the official github. Specifically, this is a distribution of Python 2.7 with the Numpy and Scipy packages, as well as the GNU Scientific Library (GSL). For Python, I think installing Anaconda2 is the best way to handle this. You can download the installer from https://www.anaconda.com/download/, making sure to grab the correct version for your computer (Linux, then 32/64 bits). On the HPC environment available at QUT, the GSL package can be loaded by running this.
module load gsl/2.1-foss-2016a
Which, importantly, does this equivalent of this.
The other two things you need to specify for installation include (directories are my perspective, yours will likely differ).
export CFLAGS="-I/pkg/suse12/software/gsl/2.1-foss-2016a/include" export LDFLAGS="-L/pkg/suse12/software/gsl/2.1-foss-2016a/lib"
If something like this isn’t available then you’ll need to install GSL yourself. Hopefully it isn’t too difficult. Otherwise, the instructions on the fastStructure github should be reasonably easy to follow just by copy-pasting what’s instructed.
This was the most difficult part of using the fastStructure program to me. In the file distribution package an example .bed file is provided for testing, but it gives little assistance since it is encoded as a binary file and thus isn’t altogether accessible to figure out how to recreate this format. The program does accept .str format, however. In this context, .str refers to STRUCTURE format which itself isn’t clearly and succinctly described either. This format was, luckily, already mostly formatted by the researcher. I did encounter a few errors initially trying to get the program to read this in, which I’ll detail below.
Firstly, the program adds the “.str” suffix to your input file name. Thus, if you specify –input=example_file.str, it will attempt to open “example_file.str.str” and likely result in this error:
Traceback (most recent call last): File "structure.py", line 172, in G = parse_str.load(params['inputfile']) File "parse_str.pyx", line 7, in parse_str.load handle = open(file+'.str','r') IOError: [Errno 2] No such file or directory: 'example_file.str.str'
The key bit is the handle = open(file+’.str’,’r’) part. It artificially adds “.str” to the end of your input file name (file). Thus, you need to make sure the file is called “example_file.str”, not “example_file.txt” or something similar, and you need to specify –input=example_file and not –input=example_file.str. This is a little bit annoying, and reminds me of what I did when I first learned coding and wasn’t used to writing file extensions. Perhaps the intention was precisely to make it “easier” for people unfamiliar with command-line to use the program, but I think this backfired.
The next problem looked like this.
Traceback (most recent call last): File "structure.py", line 172, in G = parse_str.load(params['inputfile']) File "parse_str.pyx", line 10, in parse_str.load L = loci.shape IndexError: tuple index out of range
This was caused by having two blank headers at the end of the file, resulting from conversion of another format to the one the researcher was using. Thus, make sure to look through the table and check that the headers are all present, in their correct position, and there are no empty cells when viewing the file in Excel.
A final thing that could be of help: this is an example of the input file’s format.
# # # # Smpl_ID Pop_ID Locus_1 Locus_2 Locus_3 Locus_4 Locus_5 Locus_6 Locus_7 Locus_8 Locus_9 Locus_10 # # # # Sample1 1 1 -9 0 1 0 1 1 1 1 0 # # # # Sample1 1 1 -9 1 0 1 0 0 0 0 1 # # # # Sample2 2 0 0 1 1 1 0 1 1 0 1 # # # # Sample2 2 1 0 1 0 1 1 0 1 0 1
This was a little difficult to figure out. I ended up looking through the code and found that the first 6 columns are ignored by default when reading in the .str file. This is further detailed in this somewhat helpful Google forum discussion with one of the program’s authors fastSTRUCTURE running but how to build .str or .bed file. Since the file I was given only had two columns with details (Smpl_ID and Pop_ID) I just added #’s to fill in empty cells within Excel and saved the file as a tab-delimited file (remembering to make sure the extension is .str afterwards).
For this, the Smpl_ID column doesn’t seem to matter. What does is your Pop_ID column, which should be a simple system to define your sampling groups such as a 1-10(00) system. Following this are the SNP values for your loci. -9 means absent data, 0 means no SNP, 1 means SNP is present. There is meant to be two rows per sample (as detailed in the above Google forum link) which I assume relates to the paternal/maternal strands, but I’m not actually sure…
I also don’t think the headers matter at all, so long as you have a header row. I can’t find in the code where it handles this header row (presumably by ignoring it?) but the author in the above Google forum link suggests that having a blank header row is correct so I must assume this is correct. No errors get raised so I guess it is?.
Overall, this was unnecessarily difficult to get right. It would be nice if the author of the code detailed the expected file format. It’d also help to make file input more sensible with additional checks in place to make sure everything is properly formatted rather than just raising obscure errors or, worse, cutting out columns with actual data in them without the user having any clue this is happening! Most of these problems could be easily fixed with a few hours adding extra checks and conditions throughout the code (even I could do it) which I think should be essential for any code going out to wide use in the scientific community.
The next thing which was a bit confusing to me was running the chooseK.py script. I was initially unsure why I was getting results when running chooseK.py on the test data but not on the “real” data. The error looked like this.
Traceback (most recent call last): File "chooseK.py", line 111, in print "Model complexity that maximizes marginal likelihood = %d"%Ks[np.argmax(marginal_likelihoods)] File "/home/n8942188/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 1004, in argmax return _wrapfunc(a, 'argmax', axis=axis, out=out) File "/home/n8942188/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 62, in _wrapfunc return _wrapit(obj, method, *args, **kwds) File "/home/n8942188/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 42, in _wrapit result = getattr(asarray(obj), method)(*args, **kwds) ValueError: attempt to get argmax of an empty sequence
Admittedly, on re-reading the github page, I see that the problem is mentioned in passing. Before running chooseK you need to generate results from the structure.py script using the simple prior and different K values. Then your –input value to chooseK.py is the same as your –output value from the structure.py script. From running the structure.py script with simple prior and different K values, you should have a number of files in the directory that look like OUTPUTPREFIX.1.log, OUTPUTPREFIX.2.log, etc., which chooseK will read.
The final hurdle was getting the distruct.py script to produce an admixture plot. When running this with just the –input and –output arguments, I received the below error.
Traceback (most recent call last): File "distruct.py", line 170, in figure = plot_admixture(admixture, population_indices, population_labels, title) File "distruct.py", line 16, in plot_admixture figure = plot.figure(figsize=(5,3)) File "/home/n8942188/anaconda2/lib/python2.7/site-packages/matplotlib/pyplot.py", line 539, in figure **kwargs) File "/home/n8942188/anaconda2/lib/python2.7/site-packages/matplotlib/backend_bases.py", line 171, in new_figure_manager return cls.new_figure_manager_given_figure(num, fig) File "/home/n8942188/anaconda2/lib/python2.7/site-packages/matplotlib/backend_bases.py", line 177, in new_figure_manager_given_figure canvas = cls.FigureCanvas(figure) File "/home/n8942188/anaconda2/lib/python2.7/site-packages/matplotlib/backends/backend_qt5agg.py", line 35, in __init__ super(FigureCanvasQTAggBase, self).__init__(figure=figure) File "/home/n8942188/anaconda2/lib/python2.7/site-packages/matplotlib/backends/backend_qt5.py", line 235, in __init__ _create_qApp() File "/home/n8942188/anaconda2/lib/python2.7/site-packages/matplotlib/backends/backend_qt5.py", line 122, in _create_qApp raise RuntimeError('Invalid DISPLAY variable') RuntimeError: Invalid DISPLAY variable
Luckily I have already encountered similar errors from using matplotlib in Python. What fixed this was manually editing the distruct.py script to add this line “plot.switch_backend(‘agg’)” in such that the first few lines of this file looked like this.
import numpy as np import matplotlib.pyplot as plot plot.switch_backend('agg') import colorsys import getopt import sys, pdb
This let me produce output!
This program was very frustrating to use, and if it weren’t for the fact that I decided to write a detailed blog post about it, I may not have discovered that I was cutting out the first four loci columns from my input file. The fact that I needed to manually edit a Python script to get output is also not a good thing – I can imagine a lot of people getting stuck here and having no idea what to do.