Infrastructure development, databases, and approach

 

The scale of the endeavour

The prokaryotes, bacteria and archea, evolved some 3.5 billion years ago and exploit every environment on the planet. With the exception of their viruses, the bacteriophages, bacteria and archea represent the most abundant biological replicators on earth and most of the biomass. With the emergence of eukaryotes (about 2 billion years ago), and especially multicellular organisms (approximately a billion years ago) numerous bacteria, but intriguingly no archea, evolved pathogenic lifestyles. These pathogens have evolved from many parts of the bacterial domain and, consequently, the study of bacterial pathogens entails the accommodation of an extraordinary range of biological diversity. The development sequencing technology, enabling data generation at ever increasing speeds and in extremely high volumes, gives us unprecedented opportunities to exploit bacterial diversity in understanding biological processes, but simultaneously presents challenges in managing, storing, and interpreting the data which are such a rich source for biological inference. For nearly 30 years we have tackled these issues by the generation of structured data sets and the development of publicly available databases using custom built software. This provides us with an infrastructure that underpins our research and its translation.

Population annotation

Understanding the genetic variation that underlies phenotype differences at the population level requires a comprehensive catalogue of the ‘pan genome’, that is: (i) defining the complete genetic repertoire available to each member of the population in question; (ii) knowledge of how the members of this repertoire are distributed within the population; and, (iii) an understanding of the extent of this variation and the mechanisms by which it arises. We refer to the assembly of this information as ‘population annotation’. The pan genome of the entire bacterial domain is quite literally vast, given the number of taxa, combined with the number of genes and variants of them, but indexing it is an inherently scalable approach which can be broken down into manageable tasks.  By limiting investigations to populations of defined sizes, it is possible to define a pan-genome of, for example, a single group of closely related bacteria. The endeavour is also cumulative, so that once a particular gene and its variants have been defined there is no need to repeat the analysis and an ever enlarging catalogue of variation is gradually assembled.

An obvious starting point for an exploration of genetic variation at the population level is that subset of the pan genome which is universally, or nearly universally, present in the members of the population, or the ‘core genome’. Like the pan genome, the members of a particular core genome will depend on the definition of the population under consideration. For a single clone, virtually all the genes are core, but a core genome can also be defined for a subspecies grouping, a species, a genus and so on up to the whole domain or even across all three domains. Within the domain bacteria there is a definable core genome, comprising those genes that are universally present, and we have exploited a subset of these, the genes encoding the 50 or so ribosomal protein subunits in our universal typing scheme ribosomal MLST (rMLST).

BIGSdb, PubMLST and population annotation

Population annotation requires de novo assembled genome sequence, as it is necessary to index variation not only in known genes but also to identify novel genes. These assembled data can then be explored with annotation software and catalogues of known genes. Our Bacterial Isolate Genome Sequence database (BIGSdb) was developed to realize this gene-by-gene approach to the analysis of bacterial sequences at the population level. Based on the concepts developed in the MLSTdbNet software that we built to store and analyze MLST data, BIGSdb comprises isolate provenance and phenotype data (‘metadata’), catalogues of known genes and their variants, and a ‘sequence bin’ that holds the genome sequence data, which need not be a complete closed genome. Annotation tools within the web interface of the database identify the presence of known genes, mark them in the sequence for future reference, and identify or define the variant present. This can be linked to automated gene-finding approaches to search for genes in untagged regions.

Currently, BIGSdb integrates whole genome sequence (WGS) data and other single or multilocus sequence typing methods within the PubMLST.org website, which hosts more than 80 conventional MLST schemes and includes the rMLST database that indexes and defines the rMLST loci and sequence types. We are currently defining rMLST types for the whole bacterial domain and are working on cgMLST schemes, and increasingly pan genome schemes, for Neisseria and Campylobacter.

Exploring population structure within diverse recombining bacteria

The ability to exchange DNA is an important feature of many bacteria, including our principal organisms of interest members of the genera Neisseria and Campylobacter.  There are a number of mechanisms that can mediate this horizontal genetic transfer (HGT), with transformation a principal mechanism in these two genera.  An important feature of HGT is that it disrupts clonal signal within populations, which makes studying the population structure much more difficult, but also much more interesting!  For example, both Neisseria meningitidis and Campylobacter jejuni have essentially non-clonal population structures with clonal signal disrupted by extensive HGT over the longer term, but within these populations genotypes (linages which can be recognised as clonal complexes by MLST analyses) can be identified.  These lineages are surprisingly stable over long periods of time and during global spread and are associated with interesting phenotypes such as the ability to cause invasive disease in meningococci and host association in C. jejuni.  Understanding this population structure and exploiting it to explore the phenotypes of these organisms is a major theme that spans our interests in the various bacteria that we work on.