I remember it was Andrew Read who first sparked my interest in Streptococcus pneumoniae. During my visit at Penn State in 2011, he told me that Marc Lipsitch had been working on 'serotype switching', whereby pneumococcal bacteria using recombination were swapping surface structures – capsules – thus escaping vaccines. Eventually this persuaded me to come to Imperial College London to work on S. pneumoniae with Christophe Fraser.
Not that long after I'd arrived, I had a discussion with Christophe about pneumococcal capsules. He then told me about the fascinating problem of capsule diversity. There's around 100 of different pneumococcal serotypes, and they are generated by different combinations of genes – a bit like Lego bricks. Christophe suggested that they could be evolving to form new serotypes, and that this would be interesting from a medical point of view. He then said: "Could we build a tree of all these serotypes to reconstruct how they evolved? If I were you, I'd take scissors, glue and get to work."
Now, over four years later, the results of this complex work have been published in Molecular Biology and Evolution . Turns out that scissors and glue didn't quite help, but instead I used a number of other tools. So here's a summary of what capsules are, what I found and why it's important.
The making of capsules
Streptococcus pneumoniae, as many other pathogenic bacteria, is surrounded by the polysaccharide capsule. You can see a picture below on the left.
This "capsule" is essentially a dense layer of polysaccharides (chains of monosaccharides). Polysaccharides are present on surfaces of all bacteria, but sometimes they are so dense and thick that they form capsules. Such capsule is synthesised by a group of genes found in the polysaccharide synthesis locus, cps.
In a nutshell, the capsule is made by:
synthesising a repeat unit (monosaccharide),
transporting the unit via the membrane to the surface,
repeating this by stacking the units together to create a polysaccharide chain, many times.
The function enzymes coded by the genes in the cps locus is to perform these tasks. Of particular importance are the serotype-specific genes (blue and orange genes above), which are specialised to attach one sugar to another. Their specific combination determines the polymer, and thus the surface antigen. These genes are by far the most diverse get of genes in the pneumococcus, forming hundreds of unrelated gene families (homology groups).
Visualising genetic diversity of cps
The goal of this study was to try to reconstruct how this enormous diversity in the cps emerged. To do so, I used 4,469 isolates of pneumococcal sequences, most of which came from two carriage collections: Mae La in Thailand  and Massachusetts in the US . From these data I extracted 3,813 full sequences of cps. I then used these serotype-specific genes to create a serotype network based on genetic similarity. The network below visualises the genetic diversity of the cps locus, with serotypes shown as nodes and edges reflecting genetic similarity; size of nodes shows the number of serotypes in the dataset, and the colour shows the genetic diversity of the serotype.
In particular, red edges are defined as min. 58% of shared gene families between the pair, and black edges as min. 36% of shared gene families. In fact, this graph shows that it's impossible to reconstruct a reliable evolutionary tree of all serotypes: most serotypes have so little in common with other serotypes that there is not enough shared genetic diversity to related them via a phylogenetic model. Thus, I decided to focus on closely related groups which share a large fraction of genetic diversity (clusters with red edges and red labels), and for which phylogenetic reconstruction is actually possible.
Recombination drives emergence of new serotypes
"Possible" doesn't mean "easy", and the main problem was mosaicism of these cps loci. Due to amount of recombination, a basic phylogeny was simply unreliable. To resolve this, I applied a new approach by combining different methods for recombination detected (briefly, based on a both Structure and Gubbins), and I'd encourage the interested to read the original publication. In the end I could infer the clonal phylogeny and map recombination events on the tree. Here's a summary for the four major serogroups: 6, 19, 23 and 14/15.
Red branches are those where recombinations occurred, and one can see that the emergence of new serotypes is typically associated with a recombination event. In the case of serogroup 19 we found that the emergence of 19B/19C was likely the result of recombination between 19F and a strain of S. mitis.
Origin of recombination events
Where are the recombinations coming from? We can answer this question because the recombination sequences can be BLASTed against the sequenced diversity. The results can be visualised as a network, where '?' essentially means "no strong similarity to anything known at the time of the publication".
Diversification is thus driven by recombination with other serotypes, and occasionally other species. The patterns of exchanges is also quite interesting: in the case of serogroups 6, 9, 10, 11, 18 and 23 we found the distribution of recombinations significantly different from a random distribution. We're not entirely sure why, but the most likely explanation is that some serotypes are much more/less likely to co-colonise with other serotypes.
Increased molecular clock rate of cps
Next, I wanted to know whether the molecular clock of the cps is faster than the average clock rate in the rest of the genome. To this end, I used three main lineages used in the study, PMEN1 , PMEN2  and PMEN14 , which are highly epidemiologically successful multi-drug resistant lineages. Recombinations were removed by Gubbins . I then used BEAST2 to compare the cps clock with the rest of the genome ("background"). As you can see on the left below, the clock of the capsule is greater than the mean clock in the genome. On the right the clock of the cps locus is compared to the 'null distribution' of clocks from the genome. These results show that the cps is one of the fastest evolving regions in the pneumococcal genome.
Diversifying selection at wzd/wze
While testing the hypothesis that the increased clock rate is driven by selection acting on the cps, I stumbled upon an interesting finding. While the cps showed evidence of increased dN/dS compared to the rest of the genome, the signature of selection was highly focused on the two regulatory genes located upstream within the cps: wzd and wze. Such a high dN/dS (note the log scale!) suggests rather strong diversifying selection. We are not sure what explains these, but these genes have been previously shown to impact the expression of the capsule , so it might have to do something with selective pressures acting on the bacterium to downregulate or upregulate the capsule production during colonisation of the host.
Previous studies found elevated recombination rates at the cps in particular lineages [4,6], but one genomic analysis of multiple lineages found that not all of them have undergone 'serotype switching' . Turns out that when you pool them together, the cps undergoes recombination around 2.5 times more frequently than the rest of the genome.
Mosaic production factory?
Finally, two mosaics were found in the Mae La cohort. One of them was termed 10X due to its strong similarity to serogroup 10 (and in fact had been reported earlier in another study ), and the second one was termed 39X due to the similarity to serogroup 39. Putative serotype 39X was sent to Statenst Serum Institut in Copenhagen and positively tested for the presence of capsule, but it had a previously unseen serological profile. It is thus quite convincing that it might be a new serotype. Notably, both of them seem to be mosaics of other, previously known serotypes. It is also interesting that these mosaics were found for the first time in the study where the mean frequency of recombination is generally higher than anywhere else, suggesting that pneumococci might be constantly producing novel recombinants which do not rise in frequency.
So in summary, here's what we found:
the cps locus is highly mosaic, with a complex genotype/phenotype map,
recombination underlies the emergence of most diversity we observe today,
these recombinations originate in other serogroups and other species of streptococci,
the cps locus has increased both molecular clock rate and recombination rate compared to the rest of the genome,
the increased clock is a result of both purifying and diversifying selection acting on the capsule,
in densely sampled populations one finds mosaic serotypes, which have not been observed in other locations.
So it thus seems as the cps locus is an evolutionary hotspot in the pneumococcal genome, almost as it was "designed" to adapt to rapidly changing selective pressures.
Importance of these findings
Are these findings merely an evolutionary curiosity? Not at all. The capsule is the target of all licensed pneumococcal conjugate vaccines. These vaccines target 7, 10 or 13 of ~100 serotypes. It has been argued that broadening these vaccines might eventually lead to the eradication of the pneumococcal disease. However, if the capsule is able to rapidly evolve in response to changing selective pressures, it leaves an open question about what this means for the future of pneumococcal vaccines. One potential danger is that new serotypes are constantly generated in the background, and the vaccines, by clearing the dominant serotypes, will promote new variants which haven't been able to rise in frequency previously. These are only speculations, but one important conclusion is that infectious diseases evolve quickly, and to understand the long-term consequences of medical interventions we must fully realise the adaptive potential of pathogenic bacteria.
The second important conclusion is that closely related commensal bacteria, by many considered 'medically non-interesting', can play an important role in the evolution of their pathogenic cousins: they can supply new genes (including antibiotic resistance genes), thereby accelerating their evolution. Our results thus underlie the importance for genetic sequencing of commensal bacteria which coexist with pathogens, and perceiving them as a part of the entire ecosystem.
All of this is really good news. The declining cost of genetic sequencing, and development of big data approaches to studying rising genomic databases both mean that soon we might be able to predict the "genetic future" and stay one step ahead of lethal bacteria.
 Mostowy, R. J., Croucher, N. J., De Maio, N., Chewapreecha, C., Salter, S. J., Turner, P., et al. (2017). Pneumococcal capsule synthesis locus cps as evolutionary hotspot with potential to generate novel serotypes by recombination. Molecular Biology and Evolution. http://doi.org/10.1093/molbev/msx173
 Hammerschmidt, S., Wolff, S., Hocke, A., Rosseau, S., Müller, E., & Rohde, M. (2005). Illustration of pneumococcal polysaccharide capsule during adherence and invasion of epithelial cells. Infection and Immunity, 73(8), 4653–4667. http://doi.org/10.1128/IAI.73.8.4653-4667.2005
 Geno, K. A., Gilbert, G. L., Song, J. Y., Skovsted, I. C., Klugman, K. P., Jones, C., et al. (2015). Pneumococcal Capsules and Their Types: Past, Present, and Future. Clinical Microbiology Reviews, 28(3), 871–899. http://doi.org/10.1128/CMR.00024-15
 Chewapreecha, C., Harris, S. R., Croucher, N. J., Turner, C., Marttinen, P., Cheng, L., et al. (2014). Dense genomic sampling identifies highways of pneumococcal recombination. Nature Genetics, 46(3), 305–309. http://doi.org/10.1038/ng.2895
 Croucher, N. J., Finkelstein, J. A., Pelton, S. I., Mitchell, P. K., Lee, G. M., Parkhill, J., et al. (2013). Population genomics of post-vaccine changes in pneumococcal epidemiology. Nature Genetics. http://doi.org/10.1038/ng.2625
 Croucher, N. J., Harris, S. R., Fraser, C., Quail, M. A., Burton, J., van der Linden, M., et al. (2011). Rapid pneumococcal evolution in response to clinical interventions. Science, 331(6016), 430–434. http://doi.org/10.1126/science.1198545
 Croucher, N. J., Hanage, W. P., Harris, S. R., McGee, L., van der Linden, M., de Lencastre, H., et al. (2014). Variable recombination dynamics during the emergence, transmission and “disarming” of a multidrug-resistant pneumococcal clone. BMC Biology, 12, 49. http://doi.org/10.1186/1741-7007-12-49
 Croucher, N. J., Chewapreecha, C., Hanage, W. P., Harris, S. R., McGee, L., van der Linden, M., et al. (2014). Evidence for soft selective sweeps in the evolution of pneumococcal multidrug resistance and vaccine escape. Genome Biology and Evolution, 6(7), 1589–1602. http://doi.org/10.1093/gbe/evu120
 Croucher, N. J., Page, A. J., Connor, T. R., Delaney, A. J., Keane, J. A., Bentley, S. D., et al. (2014). Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Research. http://doi.org/10.1093/nar/gku1196
 Kadioglu, A., Weiser, J. N., Paton, J. C., & Andrew, P. W. (2008). The role of Streptococcus pneumoniae virulence factors in host respiratory colonization and disease. Nature Reviews Microbiology, 6(4), 288–301. http://doi.org/10.1038/nrmicro1871
 van Tonder, A. J., Bray, J. E., Quirk, S. J., Haraldsson, G., Jolley, K. A., Maiden, M. C. J., et al. (2016). Putatively novel serotypes and the potential for reduced vaccine effectiveness: capsular locus diversity revealed among 5405 pneumococcal genomes. Microbial Genomics, 2(10), 000090. http://doi.org/10.1099/mgen.0.000090