Introducing fastGEAR

This month I had a paper accepted on a project I had done in collaboration with Pekka Marttinen at Aalto University in Finland, and several other colleagues. The paper introduces a new method for the analysis of bacterial genomes called fastGEAR [1]. There are so many new methods being published these days. That's why I thought it would be useful to write a short article about what fastGEAR is, how it works, and how to use it. Ok, here we go!

What is fastGEAR and why we created it?

Recombination, broadly defined as exchange of genetic information between two lineages, is overwhelmingly common in nature. Its ubiquity has been underappreciated for decades. Only when genetic sequencing became cheaper did we start realising that almost every organism on this planet can exchange DNA in one form or another. And bacteria are no exception.

Having said that, bacteria are remarkably different and diverse, and the impact of recombination on their evolutionary history can vary [2]. Genomic data these days are used not only by evolutionary biologists, but more and more often by microbiologists, clinicians and public health representatives to infer the epidemiological process which drives the spread of microbial pathogens. As most models used by researchers assume no recombination, a good understanding of how genetic exchange affects genomic data is crucial for the correct interpretation of the data.

Many commonly used methods which detect bacterial recombination specialise in detecting imports from external sources and are aimed at applying to a single lineage at a time [3-5]. These methods have thus limited ability to detect between-lineage or between-species recombinations. Given the growing size of genomic databases, approaches which can tackle this problem that are of increasing interest.

This is where fastGEAR comes in. It's an approach which identifies population genetic structure of an alignment in question, and detects recombinations between the inferred lineages as well as from external origins. Let me say a little more about how it works.

How fastGEAR works?

Before I explain, let's define two terms I'm going to use here: recent recombinations and ancestral recombinations. Let's look at the figure below.

On the left we see a phylogenetic tree, which gives rise to two populations (or lineages), marked by two colours. Green arrows show the flow of three recombinations. On the right hand side you can see the genetic alignments with colours representing the ancestral patterns. As you can see, the three recombinations give rise to three admixture blocks. Recombination 1 affects all isolates in the lineage, and represents a kind of event I will term ‘ancestral recombination’. Recombinations 2 and 3 affect some but not all isolates in the lineage, and I will call them ‘recent recombinations’. It’s important to realise that these definitions are technical: ‘ancestral’ need not be very old, while ‘recent’ need not be very recent; they are recent/ancestral with respect to lineages we define.

Given an alignment, fastGEAR is an approach which performs four steps:

identification of lineages,
identification of recent recombinations,
identification of ancestral recombinations,
test of significance.

First three steps are done using a hidden markov model (HMM). To start, we identify genetic clusters using BAPS [6]. To put simply, BAPS is a genetic-data-clustering approach which proposes the “best” number of clusters. One major limitation of BAPS is that it does not account for recombination, and thus with two clonal population A and B and a mosaic population A/B, BAPS would detect three different populations. Here were want to go a step further and distinguish the mosaics from non-mosaics.

To this end, we start with BAPS clusters and then detect ‘lineages’. Here’s how the first step works. Given the inferred clusters, fastGEAR compares each pair of clusters site-by-site using HMM. Have a look at the figure . The observations are nucleotide frequencies at polymorphic sites (i, i+1, etc) in each cluster. The latent variable (i.e., true characteristic we want to infer) is whether the two clusters are same or different at each site. Thus, for each pair of clusters, fastGEAR would tell us at which sites the two clusters are the same or different. Then, fastGEAR would collapse the two clusters into a single lineage if it finds them to be the same at at least 50% of the sites. This is is the difference between ‘cluster’ and ‘lineage’: clusters are groups which are genetically distinct, while lineages are groups which are genetically distinct in at least 50% of the alignment. (Because if they are very similar in, say, 60% of the alignment and different in 40%, they probably differ due to recombination).

The second step is to find recent recombinations, as defined earlier. We do this by applying another HMM approach, this time sequence-by-sequence.hmm2 Within each lineage, we now look for recombinations by comparing a sequence in question (target sequence) to all remaining lineages. Basically, the HMM compares every site in the sequence in lineage X to everything else and asks: is it more similar to something else then other strains in the same lineage X (see on the right). If there is a fragment which originated in another lineage, then the HMM chain will detect this signal and assign the origin to another lineage. If the analysed fragment didn’t originate in another lineage but it is highly different from other isolates in the lineage, then the HMM chain will detect this signal and assign its origin as external. Everything in the sequence which comes from outside the lineage can be a potential recombination.

The third step is to look for ancestral recombinations, as defined earlier. This is done in almost the same way as the first step, except that this time we’re comparing lineages (not clusters), and with recent recombinations removed. In this way if two lineages are considered the same in part of the alignment by the HMM, this points to an ancestral recombination between the two lineages. In contrast to step two, fastGEAR doesn’t detect ancestral recombinations from external sources, which is important to keep in mind (see below).

The fourth and final step is to remove false-positive recombinations. Such errors would appear with low detection power: in the case of step two in outlier strains, or in the case of step three in regions which happen to be similar but not because of recombination (e.g., selection). This is done by a diversity test: is the diversity of the fragment in question different compared to its background.

Basically, the approach of fastGEAR can be thought as the opposite of ClonalFrame-like approaches. These approaches look for genetic outliers in clusters of similar data. In contrast, fastGEAR gains power from diversity and seeks recombinations by looking for similar segments between diverse clusters of data.

What fastGEAR isn’t and doesn’t do?

Before one starts to use fastGEAR, it’s very important to be aware of its limitations. Here are three major things to keep in mind.

First, fastGEAR usually won’t give you a list of all recombinations it can detect. This is because the results are conditional on the clusters detected, and thus depend on the population structure. Recombinations (both recent and ancestral) which are reported are those which occur between detected lineages, which means that within-lineage recombinations will not be detected. At the moment, fastGEAR does not have the option of exploring the hierarchical population structure, however there is an option for a user to define his/her own clusters. Just please note that this option is at user’s own risk: if clusters don’t make sense, results probably won’t make sense either.

Second, fastGEAR can’t detect the direction of ancestral recombinations. This is kind of obvious when you think about it: ancestral recombinations are essentially segments in the alignment which are hypothesised to share an ancestry, but there’s no way of knowing which one was the origin, or if any of them were. One could make assumptions of course, for example that the lineage with fewer strains is the recipient, but fastGEAR can’t and won’t truly detect which one is.

Third, fastGEAR can’t detect external ancestral recombinations. Consider the following scenario.external On the left we have the truth: three lineages (red, blue, orange) and the blue lineage receives an ancestral recombination from the orange lineage. Consider now that we only have the red and blue lineage, and that we analyse the segment in the black box with fastGEAR. What will happen is that fastGEAR will detect two lineages, and then it will detect an ancestral recombination marked by the green shared ancestry. However, in reality the purple segment is an external recombination which cannot be detected. Thus it’s not always possible to determine which part of the sequence represents clonal inheritance. Therefore, to avoid over-interpretations we recommend interpreting the results in the phylogenetic context.

How to run fastGEAR?

The code has been written in Matlab, and the source code is available here. However, the easiest way to use the method is to install freely available Matlab Runtime Component (MCR), download the precompiled version and use it. At the moment fastGEAR is available for Windows and Linux. The details of running the software can be found in the manual. For Linux users, I have included an example bash script, which can be used to launch fastGEAR in a console (having specified paths first). The files, including MCR, are available here.

Examples of use

It’s best to understand how to use fastGEAR by looking at some practical applications. I’m going to describe three examples here, and they’re all based on a dataset of 616 whole genomes of Streptococcus pneumoniae from Massachusetts [7]. S. pnuemoniae (or the pneumococcus) is a bacterial pathogen which causes pneumonia, and we analysed a study where samples from children in primary care were taken over the period of seven years, and then they were genetically sequenced. What’s interesting about the pneumococcus is that it’s highly recombinogenic. Thus, we wanted to use fastGEAR to better understand the impact of recombination on the population genetic structure of different genes.

Example 1: Impact of recombination on most conserved genes

In the first example, we wanted to investigate the impact of recombination on the bacterium’s genome-wide population structure. To this end, we focused on the 96 most conserved genes across species and ran fastGEAR independently on each of them. The results can be seen below.

On the left you can see the maximum-likelihood phylogeny based on the core genome (1,194 genes singly present in all isolates). Coloured clades correspond to 15 monophyletic BAPS clusters. On the right you can see a mosaic of colours, which is a result of running fastGEAR on these 96 ‘housekeeping’ genes. The colours were reordered to ease visual comparison of the population structure of different genes. One would expect that different BAPS clusters would align very well with different fastGEAR lineages in individual genes. What we see however is that the distribution of detected lineages varies a lot in different genes, and they kind of “average out” to form clusters. This because of the impact of recombination – homologous recombination shuffles alleles at different genes over time, and this indeed is what we have know for quite some time [8]. However, what’s interesting is that now for the first time we can visualise such a variation, and compare the degree of mosaicism at different genes (here visually).

Example 2: Comparing degree of mosaicism across genes

We next ran fastGEAR separately on each homology group (COG) and ask how many recombinations are detected (recombinations with same start/end-points were counted singly). We have done so for this dataset and here are the results. The two axes show the number of recent and ancestral recombinations detected. The genes which score the highest are known recombination hotspots: they include surface antigens like pspC, capsule synthesis sugar rmlA, and antibiotic resistance gene pbp2X. Even though we have already known that these genes are hotspots, there are two exciting things about this discovery. First, knowledge that these genes are recombination hotspots was gained using a much more time-consuming, lineage-by-lineage analyses, and here we simply pooled all these genes and looked at the “recombination score”. Second, here we combined data from all lineages in the dataset, which confirms that these recombinations have been exchanged between different lineages. Thus, fastGEAR can yield quick insight into the mosaicism of different bacterial genes, which comes in particularly useful when dealing with pan-genome datasets.

Example 3: High-resolution insight into between-species recombinations

What’s perhaps most exciting about fastGEAR is its ability to analyse between-species data. Here’s an output of the analysis of the penicillin binding protein encoding gene, pbp2X. For the analysis, we used 132 strains of pbp2X from closely related streptococci (S. mitis, S. oralis, S. infantis, S. pseudopneumoniae, S. dysgalacitae and others). In this analysis we excluded S. pneumoniae sequences but results including those are included in the original publication (see figure 5 in [1]).

On the left we have a fastGEAR output. On the right we have the same analysis using BratNextGen [4]. Since BratNextGen was designed to detect recombinations within a single lineage, inclusion of such diverse sequences results in a very low-resolution view of the mosaicism at this gene. In contrast, fastGEAR does a lot better in inferring such a mosaic structure. Thanks to computer simulations we know that fastGEAR does better at decoding such structures (we can believe what it shows!), and this figure shows how much better it does that than other commonly used methods for the analysis of bacterial recombination.

Summary

To summarise, fastGEAR is a tool to detect mosaicism in bacterial genomes. It can be used on individual genes as well as concatenations of genes. However, as we argued in the original publication it’s probably best used on a gene-by-gene basis as it then deals best with varying levels of diversity across different genes. For this reason, fastGEAR can work very well as a tool to analyse outputs of standard bacterial pan-genome production pipelines. It is exceptionally well suited for analyses of genes originating from different species. Its limitations are important to keep in mind when using the software, and these limitations can be best summarised by calling fastGEAR by what it really is: a tool to analyse patterns of shared ancestry within homologous segments of bacterial genes.

References

[1] Mostowy, R., Croucher, N. J., Andam, C. P., Corander, J., Hanage, W. P., & Marttinen, P. (2017). Efficient inference of recent and ancestral recombination within bacterial populations. Molecular biology and evolution.

[2] Hanage, W. P. (2016). Not so simple after all: bacteria, their population genetics, and recombination. Cold Spring Harbor perspectives in biology, 8(7), a018069.

[3] Didelot, X., & Falush, D. (2007). Inference of bacterial microevolution using multilocus sequence data. Genetics, 175(3), 1251-1266.

[4] Marttinen, P., Hanage, W. P., Croucher, N. J., Connor, T. R., Harris, S. R., Bentley, S. D., & Corander, J. (2012). Detection of recombination events in bacterial genomes from large population samples. Nucleic acids research, 40(1), e6-e6.

[5] Croucher, N. J., Page, A. J., Connor, T. R., Delaney, A. J., Keane, J. A., Bentley, S. D., et al. (2014). Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic acids research, gku1196.

[6] Corander, J., Waldmann, P., & Sillanpää, M. J. (2003). Bayesian analysis of genetic differentiation between populations. Genetics, 163(1), 367-374.

[7] Croucher, N. J., Finkelstein, J. A., Pelton, S. I., Mitchell, P. K., Lee, et al. (2013). Population genomics of post-vaccine changes in pneumococcal epidemiology. Nature genetics, 45(6), 656-663.

[8] Feil, E. J., & Spratt, B. G. (2001). Recombination and the population structures of bacterial pathogens. Annual Reviews in Microbiology, 55(1), 561-590.