HIGHLIGHTED
PROJECTS
CompMutTB
Mixed infections in genotypic drug-resistant Mycobacterium tuberculosis (GMM4TB)
Multi-platform whole genome sequencing for tuberculosis clinical and surveillance applications
TB-Profiler
PAST PROJECTS
Global Drug resistance study
Kampala Study
Karonga Diversity Study
MTB Sequencing Variability
MTBrowse
PE/PPE Genes
Poly-TB
Philippines TB Drug Resistance Study
TB-ML
CompMutTB
We have developed the CompMut-TB framework which can assist with identifying compensatory mutations which is important for more precise genome-based profiling of drug-resistant TB strains and to further understanding of the evolutionary mechanisms that underpin drug-resistance.
People who worked on this project:
nbillows
jphelan
tgclark
Mixed infections in genotypic drug-resistant Mycobacterium tuberculosis (GMM4TB)
Tuberculosis disease (TB), caused by Mycobacterium tuberculosis, is a major global public health problem, resulting in more than 1 million deaths each year. Drug resistance (DR), including multi-drug (MDR-TB), is making TB control difficult and accounts for 16% of new and 48% of previously treated cases. To further complicate treatment decision-making, many clinical studies have reported patients harbouring multiple distinct strains of M. tuberculosis across the main lineages (L1 to L4). The extent to which drug-resistant strains can be deconvoluted within mixed strain infection samples is understudied. Here, we analysed M. tuberculosis isolates with whole genome sequencing data (n = 50,723), which covered the main lineages (L1 9.1%, L2 27.6%, L3 11.8%, L4 48.3%), with genotypic resistance to isoniazid (HR-TB; n = 9546 (29.2%)), rifampicin (RR-TB; n = 7974 (24.4%)), and at least MDR-TB (n = 5385 (16.5%)). TB-Profiler software revealed 531 (1.0%) isolates with potential mixed sub-lineage infections, including some with DR mutations (RR-TB 21/531; HR-TB 59/531; at least MDR-TB 173/531). To assist with the deconvolution of such mixtures, we adopted and evaluated a statistical Gaussian Mixture model (GMM) approach. By simulating 240 artificial mixtures of different ratios from empirical data across L1 to L4, a GMM approach was able to accurately estimate the DR profile of each lineage, with a low error rate for the estimated mixing proportions (mean squared error 0.012) and high accuracy for the DR predictions (93.5%). Application of the GMM model to the clinical mixtures (n = 531), found that 33.3% (188/531) of samples consisted of DR and sensitive lineages, 20.2% (114/531) consisted of lineages with only DR mutations, and 40.6% (229/531) consisted of lineages with genotypic pan-susceptibility. Overall, our work demonstrates the utility of combined whole genome sequencing data and GMM statistical analysis approaches for providing insights into mono and mixed M. tuberculosis infections, thereby potentially assisting diagnosis, treatment decision-making, drug resistance and transmission mapping for infection control.
The runnable programme can be accessed from author’s github: https://github.com/linfeng-wang
People who worked on this project:
lwang
jphelan
tgclark
scampino
Multi-platform whole genome sequencing for tuberculosis clinical and surveillance applications
Whole genome sequencing (WGS) of Mycobacterium tuberculosis offers valuable insights for tuberculosis (TB) control. High throughput platforms like Illumina and Oxford Nanopore Technology (ONT) are increasingly used globally, although ONT is known for higher error rates and is less established for genomic studies. Here we present a study comparing the sequencing outputs of both Illumina and ONT platforms, analysing DNA from 59 clinical isolates in highly endemic TB regions of Thailand. The resulting sequence data were used to profile the M. tuberculosis pairs for their lineage, drug resistance and presence in transmission chains, and were compared to publicly available WGS data from Thailand (n = 1456). Our results revealed isolates that are predominantly from lineages 1 and 2, with consistent drug resistance profiles, including six multidrug-resistant strains; however, analysis of ONT data showed longer phylogenetic branches, emphasising the technologies higher error rate. An analysis incorporating the larger dataset identified fifteen of our samples within six potential transmission clusters, including a significant clade of 41 multi-drug resistant isolates. ONT’s extended sequences also revealed strain-specific structural variants in pe/ppe genes (e.g. ppe50), which are candidate loci for vaccine development. Despite some limitations, our results show that ONT sequencing is a promising approach for TB genomic research, supporting precision medicine and decision-making in areas with less developed infrastructure, which is crucial for tackling the disease’s significant regional burden.
People who worked on this project:
jthorpe
TB-Profiler
Drug resistance in Mycobacterium tuberculosis is primarily due to small polymorphisms within the genome. Current molecular tests examine limited number of loci. Whole genome sequencing has potential to overcome such problems but the complexity of data interpretation has, thus far, restricted its application. We have compiled a library of more than 1,300 mutations predictive of resistance for 15 anti-tuberculosis drugs (isoniazid, rifampicin, ethambutol, streptomycin, pyrazinamide, ethionamide, moxifloxacin, ofloxacin, amikacin, capreomycin, kanamycin, para-aminosalicylic acid, linezolid, clofazimine and bedaquiline).(Drug Resistance Mutations Library) To progress sequencing towards real time management of patients we have developed TB profiler that rapidly analyses raw sequence data and predicts resistance and lineage. In addition to identifying known drug resistance conferring mutations, the tool also identifies other mutations in the candidate regions. TB profiler processes fastq files at a rate of 80,000 sequence reads per second. Access to rapid, accurate and user friendly analytical tools like TB profiler are required to accelerate the uptake of next generation sequencing for detecting drug resistance, improving access to effective therapy under clinical settings.
People who worked on this project:
gnapier
jphelan
Global Drug resistance study
This study is an academic collaboration, bringing together expertise and knowledge from 18 laboratories around the world to form a GDRS partnership. The study is coordinated by Taane Clark at the London School of Hygiene & Tropical Medicine (LSHTM). The GDRS is whole genome sequencing and analysing the sequence and drug susceptibility testing data from a large collection of Mycobacterium tuberculosis isolates. Sequencing of the first 1500 isolates is completed and includes strains that are drug susceptible, mono resistant, MDR and XDR. Data on more than 6,000 isolates in the public domain and from other collaborations has also been analysed.
People who worked on this project:
jphelan
tgclark
Kampala Study
In collaboration with the Wellcome Trust Sanger Institute, Makerere University Medical School and the TB lab at the Joint Clinical Research Centre, Uganda we have analysed the genome of 51 clinical isolates of M. tuberculosis. The sample includes drug resistant strains and longitudinal samples from individual patients. They were collected during a project Strategies for the management of multi-drug resistant tuberculosis in Kampala, Uganda funded by the Wellcome Trust. It was a four year project and was a collaboration of scientists from Uganda, UK and USA.
People who worked on this project:
tgclark
Karonga Diversity Study
There are few population-based studies in a high prevalence area that can apply long-term large-scale whole genome sequencing. It is more challenging to interpret transmission networks when there are many possible sources of infection. Understanding transmission in these high prevalence areas would have the greatest public health benefit. The Karonga Prevention Study (KPS) in northern Malawi has been conducting research on mycobacterial infections in the region since the 1980s, with incidence of new smear positive TB around 100/100,000 and HIV prevalence is 10%. Currently, we have over 2,000 Mycobacterium tuberculosis whole genome sequences and substantial meta data available, including HIV, household membership, contact histories and GPS data, which can be used to model explicitly important questions associated with transmission, including the role of HIV and of M. tuberculosis (sub)lineage on transmissibility. We have published initial analyses from this area (Guerra-Assuncão, et al. 2015) showing decreasing transmission over time and variation between M. tuberculosis lineages 1–4, which are not confounded by host differences. By applying more advanced computational techniques to time-labelled phylogeny and transmission chain construction it will be possible to gain greater insights into molecular evolution and multiple outbreaks, including the estimation of between-patient mutation rate and ages of mutations, as well as identifying potential significant variants associated with transmission.
People who worked on this project:
tgclark
MTB Sequencing Variability
To asses the analytical robustness of individual Mycobacterium tuberculosis DNA samples we sequenced replicates from the well characterised reference strain (H37Rv) and clinical isolates with resistance to up to 13 drugs. Sequencing was performed using the Illumina MiSeq and Ion Torrent PGM™ sequencing platforms. Two in-silico resistance calling pipelines were used to generate profiles. Results were compared to phenotypic drug susceptibility testing. The MiSeq and Ion PGM system accurately predicted drug resistance profiles and there was high reproducibility between biological and technical sample replicates. MiSeq provided superior coverage in GC rich regions, which translated into incremental detection of putative genotypic drug-specific resistance, including for resistance to para-aminosalicylic acid and pyrazinamide. The TBProfiler bioinformatics pipeline was concordant with reported phenotypic susceptibility for all drugs tested except pyrazinamide and PAS, with an overall concordance of 95.3%. In summary we have demonstrated high comparative reproducibility of two sequencing platforms. However, platform-specific variability in coverage of some genome regions may have implications for predicting resistance to specific drugs. These findings may have implications for future clinical practice and thus deserve further scrutiny, set within larger studies and using updated mutation libraries.
People who worked on this project:
jphelan
MTBrowse
MTBrowse is an unprecedented analysis webpage for Mycobacterium tuberculosis RNA sequencing data. Sourcing data directly from the National Centre for Biotechnology Information (NCBI), our platform provides an unparalleled analytical toolset for dissecting, visualising, and understanding RNA-Seq data. With a wide array of features ranging from detailed plots and graphs to intuitive analytics, MTBrowse is your key to the future of RNA sequencing analysis. The web-based tool offers numerous visualisation options, including the ability to view differential expressions or discern differences between lineages or drug resistances via visually appealing box plots and scatter plots. Join us on this journey to the cutting edge of bioinformatics today.
People who worked on this project:
jthorpe
nbillows
PE/PPE Genes
Approximately 10% of the Mycobacterium tuberculosis genome is made up of two families of genes that are poorly characterized due to their high GC content and highly repetitive nature. The PE and PPE families are typified by their highly conserved N-terminal domains that incorporate proline-glutamate (PE) and proline-proline-glutamate (PPE) signature motifs. They are hypothesised to be important virulence factors involved with host-pathogen interactions, but their high genetic variability and complexity of analysis means they are typically disregarded in genome studies. To elucidate the structure of these genes, 518 genomes from a diverse international collection of clinical isolates were de novo assembled. A further 21 reference M. tuberculosis complex genomes and long read sequence data were used to validate the approach. SNP analysis revealed that variation in the majority of the 168 pe/ppe genes studied was consistent with lineage. Several recombination hotspots were identified, notably pe_pgrs3 and pe_pgrs17. Evidence of positive selection was revealed in 65 pe/ppe genes, including epitopes potentially binding to major histocompatibility complex molecules. This, the first comprehensive study of the pe and ppe genes, provides important insight into M. tuberculosis diversity and has significant implications for vaccine development.
People who worked on this project:
jphelan
Poly-TB
Tuberculosis (TB) caused by Mycobacterium tuberculosis (Mtb) is the second major cause of death from an infectious disease worldwide. Recent advances in DNA sequencing are leading to the ability to generate whole genome information in clinical isolates of M. tuberculosis complex (MTBC). The identification of informative genetic variants such as phylogenetic markers and those associated with drug resistance or virulence will help barcode Mtb in the context of epidemiological, diagnostic and clinical studies. Mtb genomic datasets are increasingly available as raw sequences, which are potentially difficult and computer intensive to process, and compare across studies. Here we have processed the raw sequence data (>1500 isolates, eight studies) to compile a catalogue of SNPs (n = 74,039, 63% non-synonymous, 51.1% in more than one isolate, i.e. non-private), small indels (n = 4810) and larger structural variants (n = 800). We have developed the PolyTB web-based tool (http://pathogenseq.lshtm.ac.uk/polytb) to visualise the resulting variation and important meta-data (e.g. in silico inferred strain-types, location) within geographical map and phylogenetic views. This resource will allow researchers to identify polymorphisms within candidate genes of interest, as well as examine the genomic diversity and distribution of strains. PolyTB source code is freely available to researchers wishing to develop similar tools for their pathogen of interest.
People who worked on this project:
fcoll
Philippines TB Drug Resistance Study
The Philippines is a high burden country for both TB and multi-drug resistant-TB (MDR-TB) (WHO, 2015). Whole genome sequencing of M. tuberculosis can be used to characterise strain-types linked to virulence, identify drug resistance mutations, assist clinical treatment decision-making, as well as establish who may have transmitted to whom and thus allow targeted resources to hotspot areas to reduce transmission. The UK-Philippine collaboration proposes to characterise M. tuberculosis samples from subjects from the recently completed Philippine-wide 2nd DR Survey (n>1,000). M. tuberculosis strains at the Research Institute for Tropical Medicine’s National Tuberculosis Reference Laboratory (RITM-NTRL, Philippines) will be sequenced with the assistance of the Philippines Genome Centre (PGC) using cutting-edge whole genome sequencing technology. This project aims to determine the genomic diversity of M. tuberculosis in the Philippines.
People who worked on this project:
tgclark
TB-ML
Motivation: Machine learning (ML) has shown impressive performance in predicting antimicrobial resistance (AMR) from sequence data, including for Mycobacterium tuberculosis, the causative agent of tuberculosis. However, current ML development and publication practices make it difficult for researchers and clinicians to use, test or reproduce published models.
Results: We packaged a number of published and unpublished ML models for predicting AMR of M. tuberculosis into Docker containers. Similarly, the pipelines required for pre-processing genomic data into the formats required by the models were also packaged into separate containers. By following a minimal container I/O standard, we ensured as much interoperability as possible. We also created a command-line application, TB-ML, which can be used to easily combine pre-processing and prediction containers into complete pipelines ready for predicting resistance from novel, raw data with a single command. As long as there is adherence to this minimal standard for the container interface, containers produced by researchers holding new models can likewise be included in these pipelines, making benchmark comparisons of different models simple and facilitating faster uptake in the clinic.
People who worked on this project:
jphelan
lwang