by Christoph Weigel
Summer's almost gone. Imagine you're strolling along the shores of a lake enjoying nature's colors during sunset. Sparkle catches your eyes where the lake languidly laps against the shore. You start pondering whether microbes — and if so which ones, and how many different — cause these glistening, somewhat slimy foam flakes at the shore. Sure enough, you take samples! A bit of mud, some spoons of water, and separately a little of that slimy foam. Home again, your microscope reveals the splendid plenitude of small things but considering that only a negligible percentage is cultivatable, you pass the samples to the sequencing guys of your department. They had just unboxed their next-generation sequencing machines and avidly get them running with your samples.
A couple of days later you get notified by the sequencers that they collected ~80 gigabytes of data for you, representing roughly 10,000,000 contigs with an average length of 4 kb, assembled from the sequence reads. During sample preparation, they had first filtered out all larger protists and subsequently removed the smaller virus-size particles. So far, they had only processed your 'mud sample' but this pile of sequences could account — by a back-of-the-envelope calculation — for 2,000 bacterial genomes, each with a 10-fold coverage, but certainly with a heavy over-representation of the more abundant organisms. Whack!
Now 'metagenomic binning' enters the stage. Obviously, analyzing and sorting your bulky data set calls for highly efficient software (= algorithms + parameter settings + data input + 'start' button) that does not paralyze the computing facilities of your institute for days. And you better forget about your laptop. Metagenomics is, according to a definition by Chen &Pachter, "the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species." The initial step in analyzing the sequence data in order to get the taxonomic diversity profile of your environmental sample is referred to as 'binning'. As a computational process, binning is conceptually analogous to established machine learning techniques as it involves classifying and/or clustering the sequence contigs into specific bins.
Established binning methods can be classified as either 'guided' or 'unguided', which refers to whether they are 'trained' by a pre-assembled set of known sequences or not. Such 'training sets' can be collections of rRNA sequences — presently considered the 'gold standard' to assess microbial diversity in an environmental sample — or genes representing particular metabolic pathways. Binning methods can also be guided by reference data representing the various bacterial taxa ('taxonomy-dependent') or be deliberately 'taxonomy independent', which is particularly appealing for not always "rounding up the usual suspects." Finally, existing binning methods are — with respect to the algorithms applied — either alignment-based or composition-based, or they combine both approaches. Compositional properties of DNA sequences like GC percentage and oligonucleotide usage patterns — reflecting codon usage in the mostly compact bacterial genomes — can be mathematically vectorized. Results from alignment-free binning procedures are best described as higher‑dimensional vectorial spaces that can be compared to known sequences or models present in reference databases. Alignment-free methods are faster and require lesser computational resources as compared to alignment-based methods, but they require longer input sequences — the ~4 kb contigs of your 'mud sample' are fine here.
Now, supposing you chose an alignment-free binning procedure, how can you represent the results as a graphical image? For your own previous work, you probably relied on already 'classical' methods for alignment-based similarity searches, for example BLAST. Briefly, pairwise comparisons of a set of aligned sequences are used to construct a distance matrix by cluster analysis, for example by CustalW. To generate the now familiar and intuitively interpreted graphic image, the distance matrix can then be converted into a bifurcating, rooted or unrooted phylogenetic tree by grouping the most closely related pairs of sequences. Since results from alignment-free binning cannot be reduced to a bifurcating tree, other routes are necessary to obtain a two-dimensional picture that is easy to 'read' by humans. By borrowing from informaticians working on artificial neuronal networks, bioinformaticians came up with an algorithmically guided procedure that 'reduces' the higher-dimensional vectorial space to a two-dimensional Emergent Self-Organizing Map (ESOM). Note in the example ESOM shown (Fig. 1) the different qualities of the 'border regions' of the various patches of grouped scaffolds (= grouped overlapping contigs) that could be assigned to different bacteria: sharp in most cases but at times rather fuzzy. This demonstrates a quality of these maps regretfully missing from conventional trees where it is almost impossible to indicate quantitatively nested patches of sequence similarities due to shared genes among more distantly related organisms and divergent sequence stretches between closely related ones. As an analytical tool, such ESOMs can be used, for example, to aid in supervising the reconstruction of various complete genomes from your highly complex environmental 'mud sample'.
You may come up with the uncomfortable notion that handing over your samples to the sequencing guys was actually a poor trade: swapping a complex microbial community for a data set just as complex. You start wondering whether the novel technique of single-cell sequencing could overcome the problem of choosing the appropriate binning method for your data set. Not so! This fascinating technique is exceedingly prone to all sorts of DNA contaminations, and preparing single cells is — probably an understatement — demanding (more on single-cell sequencing here and here). Also, genome reconstructions by single-cell sequencing require ~100 "identical" cells (in parentheses because it depends on the threshold you set for identity) to achieve full coverage. Hard to come by in most cases, or even impossible as in the case of two favorites of this blog, the mealybug endosymbionts: specialized abdominal cells (bacteriocytes) of the insect host, Planococcus citri, harbor cells of the β‑proteobacterium Tremblaya princeps, which in turn harbor cells of the γ‑proteobacterium Moranella endobia. Sure enough, both bacterial genomes were sequenced by the shotgun-sequencing approach from partially purified insect tissue. Binning of the sequence contigs was not thus complicated here but for more complex microorganismal consortia the single-cell sequencing approach is less attractive. We can safely expect, however, that single-cell sequencing and shotgun-sequencing will neatly complement each other in the long term.
All the above did not answer your question which microbes — and how many different — caused those glistening, somewhat slimy foam flakes at the lakeshore. Hopefully, though, you have gained an initial albeit superficial understanding of the powerful tools bioinformaticians are developing not only to uncover but finally to quantitatively understand the big picture, the plenitude and diversity of the small things.
Christoph Weigel is lecturer at the Life Science Engineering faculty of HTW, Berlin’s University for Applied Sciences and an Associate Blogger of STC