After our scientists receive samples, extract, amplify, clean and sequence DNA, our work is about half done. The bioinformatic processing of the millions of sequences from samples is likely the most difficult part of the process. Millions of sequences need to be collapsed and filtered to provide our clients with a trustworthy, concise data set.
In the past, we had been collapsing the data into OTUs based on similarity. For example, if an amplicon is 100 bp long, anything within 3 bp of a given sequence gets lumped together as representing the same OTU. This is a conservative approach in many ways, but is unable to separate closely related sequences and generates a long tail of OTUs that likely represents mostly noise.
Here at Jonah, we are working to upgrade our bioinformatic processing away from OTUs. Instead of clustering data based on similarity, sequence data is analyzed to determine what variation appears to be real and what appears to be noise. Starting with a particular abundant sequence, all other sequences are compared to it. If variation around that sequence appears to be random, it is considered noise and tossed. If there is consistent variation in the sequence, even of just one base, it is considered a unique, real sequence.
This approach of denoising generates what are called “exact sequence variants” or ESVs. Empirically, we find that this approach has two benefits over OTUs. First, we are able to distinguish similar sequences, instead of lumping them together. Second, the long tail of low abundance sequences largely disappears. Compared to OTUs, we generally have 60% fewer ESVs than OTUs, while only reducing the number of reads by 20%. This 60% reduction comes despite an increase in many closely related sequences now being considered unique.*
Here, we are investing in rapidly transforming all our bioinformatic pipelines over to ESVs. For each primer pair, we perform a number of tests to optimize the intensity of denoising and then reprocess all our past data to provide continuity for repeat clients. Functionally, ESVs and OTUs generate the same data format, but clients should notice greater accuracy, less uncertainty about taxonomic assignments, and more condensed result tables.