HIGHLIGHTED
PROJECTS
CompMutTB
Mixed infections in genotypic drug-resistant Mycobacterium tuberculosis (GMM4TB)
Malaria-Profiler
TB-Profiler
PAST PROJECTS
COVID-profiler
Using deep learning to identify recent positive selection in malaria parasite sequence data
Geographical classification of malaria parasites through applying machine learning
MTBrowse
malaria-genomaps
Poly-TB
CompMutTB
We have developed the CompMut-TB framework which can assist with identifying compensatory mutations which is important for more precise genome-based profiling of drug-resistant TB strains and to further understanding of the evolutionary mechanisms that underpin drug-resistance.
People who worked on this project:
nbillows
jphelan
tgclark
Mixed infections in genotypic drug-resistant Mycobacterium tuberculosis (GMM4TB)
Tuberculosis disease (TB), caused by Mycobacterium tuberculosis, is a major global public health problem, resulting in more than 1 million deaths each year. Drug resistance (DR), including multi-drug (MDR-TB), is making TB control difficult and accounts for 16% of new and 48% of previously treated cases. To further complicate treatment decision-making, many clinical studies have reported patients harbouring multiple distinct strains of M. tuberculosis across the main lineages (L1 to L4). The extent to which drug-resistant strains can be deconvoluted within mixed strain infection samples is understudied. Here, we analysed M. tuberculosis isolates with whole genome sequencing data (n = 50,723), which covered the main lineages (L1 9.1%, L2 27.6%, L3 11.8%, L4 48.3%), with genotypic resistance to isoniazid (HR-TB; n = 9546 (29.2%)), rifampicin (RR-TB; n = 7974 (24.4%)), and at least MDR-TB (n = 5385 (16.5%)). TB-Profiler software revealed 531 (1.0%) isolates with potential mixed sub-lineage infections, including some with DR mutations (RR-TB 21/531; HR-TB 59/531; at least MDR-TB 173/531). To assist with the deconvolution of such mixtures, we adopted and evaluated a statistical Gaussian Mixture model (GMM) approach. By simulating 240 artificial mixtures of different ratios from empirical data across L1 to L4, a GMM approach was able to accurately estimate the DR profile of each lineage, with a low error rate for the estimated mixing proportions (mean squared error 0.012) and high accuracy for the DR predictions (93.5%). Application of the GMM model to the clinical mixtures (n = 531), found that 33.3% (188/531) of samples consisted of DR and sensitive lineages, 20.2% (114/531) consisted of lineages with only DR mutations, and 40.6% (229/531) consisted of lineages with genotypic pan-susceptibility. Overall, our work demonstrates the utility of combined whole genome sequencing data and GMM statistical analysis approaches for providing insights into mono and mixed M. tuberculosis infections, thereby potentially assisting diagnosis, treatment decision-making, drug resistance and transmission mapping for infection control.
The runnable programme can be accessed from author’s github: https://github.com/linfeng-wang
People who worked on this project:
lwang
jphelan
tgclark
scampino
Malaria-Profiler
Malaria-Profiler - a pipeline which allows users to analyse Plasmodium malaria whole genome sequencing data to predict species and potential drug resistance. Follow the instructions below to upload a new sample or view analysed runs. The pipeline searches for small variants (SNPs and indels) in genes associated with drug resistance. It will also report the species and geographical location. By default it uses Trimmomatic to trim the reads, BWA (or minimap2 for nanopore) to align to the reference genome and freebayes to call variants.
People who worked on this project:
jthorpe
emanko
aturkiewicz
jphelan
TB-Profiler
Drug resistance in Mycobacterium tuberculosis is primarily due to small polymorphisms within the genome. Current molecular tests examine limited number of loci. Whole genome sequencing has potential to overcome such problems but the complexity of data interpretation has, thus far, restricted its application. We have compiled a library of more than 1,300 mutations predictive of resistance for 15 anti-tuberculosis drugs (isoniazid, rifampicin, ethambutol, streptomycin, pyrazinamide, ethionamide, moxifloxacin, ofloxacin, amikacin, capreomycin, kanamycin, para-aminosalicylic acid, linezolid, clofazimine and bedaquiline).(Drug Resistance Mutations Library) To progress sequencing towards real time management of patients we have developed TB profiler that rapidly analyses raw sequence data and predicts resistance and lineage. In addition to identifying known drug resistance conferring mutations, the tool also identifies other mutations in the candidate regions. TB profiler processes fastq files at a rate of 80,000 sequence reads per second. Access to rapid, accurate and user friendly analytical tools like TB profiler are required to accelerate the uptake of next generation sequencing for detecting drug resistance, improving access to effective therapy under clinical settings.
People who worked on this project:
gnapier
jphelan
COVID-profiler
SARS-CoV-2 virus sequencing has been applied to track the COVID-19 pandemic spread and assist the development of PCR-based diagnostics, serological assays, and vaccines. With sequencing becoming routine globally, bioinformatic tools are needed to assist in the robust processing of resulting genomic data. We developed a web-based bioinformatic pipeline (“COVID-Profiler”) that inputs raw or assembled sequencing data, displays raw alignments for quality control, annotates mutations found and performs phylogenetic analysis. The pipeline software can be applied to other (re-) emerging pathogens.
People who worked on this project:
jphelan
dward
tgclark
scampino
Using deep learning to identify recent positive selection in malaria parasite sequence data
Using simulated genomic data, DeepSweep could detect recent sweeps with high predictive accuracy (areas under ROC curve > 0.95). DeepSweep was applied to Plasmodium falciparum (n = 1125; genome size 23 Mbp) and Plasmodium vivax (n = 368; genome size 29 Mbp) WGS data, and the genes identified overlapped with two established extended haplotype homozygosity methods (within-population iHS, across-population Rsb) (~ 60–75% overlap of hits at P < 0.0001). DeepSweep hits included regions proximal to known drug resistance loci for both P. falciparum (e.g. pfcrt, pfdhps and pfmdr1) and P. vivax (e.g. pvmrp1).
People who worked on this project:
emanko
jphelan
tgclark
scampino
Geographical classification of malaria parasites through applying machine learning
Malaria, caused by Plasmodium parasites, is a major global health challenge. Whole genome sequencing (WGS) of Plasmodium falciparum and Plasmodium vivax genomes is providing insights into parasite genetic diversity, transmission patterns, and can inform decision making for clinical and surveillance purposes. Advances in sequencing technologies are helping to generate timely and big genomic datasets, with the prospect of applying Artificial Intelligence analytical techniques (e.g., machine learning) to support programmatic malaria control and elimination. Here, we assess the potential of applying deep learning convolutional neural network approaches to predict the geographic origin of infections (continents, countries, GPS locations) using WGS data of P. falciparum (n = 5957; 27 countries) and P. vivax (n = 659; 13 countries) isolates. Using identified high-quality genome-wide single nucleotide polymorphisms (SNPs) (P. falciparum: 750 k, P. vivax: 588 k), an analysis of population structure and ancestry revealed clustering at the country-level. When predicting locations for both species, classification (compared to regression) methods had the lowest distance errors, and > 90% accuracy at a country level. Our work demonstrates the utility of machine learning approaches for geo-classification of malaria parasites. With timelier WGS data generation across more malaria-affected regions, the performance of machine learning approaches for geo-classification will improve, thereby supporting disease control activities.
People who worked on this project:
emanko
jphelan
tgclark
scampino
MTBrowse
MTBrowse is an unprecedented analysis webpage for Mycobacterium tuberculosis RNA sequencing data. Sourcing data directly from the National Centre for Biotechnology Information (NCBI), our platform provides an unparalleled analytical toolset for dissecting, visualising, and understanding RNA-Seq data. With a wide array of features ranging from detailed plots and graphs to intuitive analytics, MTBrowse is your key to the future of RNA sequencing analysis. The web-based tool offers numerous visualisation options, including the ability to view differential expressions or discern differences between lineages or drug resistances via visually appealing box plots and scatter plots. Join us on this journey to the cutting edge of bioinformatics today.
People who worked on this project:
jthorpe
nbillows
malaria-genomaps
PlasmoMaps is a web-based tool that can be used to explore genomic sequence data of the non-falciparum malaria parasites, including P. malariae (n>200; XX SNPs; X countries), P. vivax (n>800; XX SNPs), P. ovale (n>50; XX SNPs), and P. knowlesi (n>200; XX SNPs). The tool integrates the presentation and summarisation (e.g., allele frequencies) of genomic variants with geographical location on maps, along with viewing extra variant information provided in the integrated genomics viewer (IGV). The tool comes with 2 separate maps to either search genomic variants via chromosome or via a selected gene.
People who worked on this project:
jthorpe
Poly-TB
Tuberculosis (TB) caused by Mycobacterium tuberculosis (Mtb) is the second major cause of death from an infectious disease worldwide. Recent advances in DNA sequencing are leading to the ability to generate whole genome information in clinical isolates of M. tuberculosis complex (MTBC). The identification of informative genetic variants such as phylogenetic markers and those associated with drug resistance or virulence will help barcode Mtb in the context of epidemiological, diagnostic and clinical studies. Mtb genomic datasets are increasingly available as raw sequences, which are potentially difficult and computer intensive to process, and compare across studies. Here we have processed the raw sequence data (>1500 isolates, eight studies) to compile a catalogue of SNPs (n = 74,039, 63% non-synonymous, 51.1% in more than one isolate, i.e. non-private), small indels (n = 4810) and larger structural variants (n = 800). We have developed the PolyTB web-based tool (http://pathogenseq.lshtm.ac.uk/polytb) to visualise the resulting variation and important meta-data (e.g. in silico inferred strain-types, location) within geographical map and phylogenetic views. This resource will allow researchers to identify polymorphisms within candidate genes of interest, as well as examine the genomic diversity and distribution of strains. PolyTB source code is freely available to researchers wishing to develop similar tools for their pathogen of interest.
People who worked on this project:
fcoll