Cancer is caused by the accumulation of somatic mutations, some of which are responsible for the disease’s progression (drivers) while others are functionally neutral (passengers). Although several methods have been developed to distinguish between the two classes of mutations, very few have concentrated on using the neighbourhood nucleotide sequences as potential discrimination features. In this study, we utilize Natural Language Processing and AI techniques to show that driver mutations’ neighbourhood is significantly different from that of passengers. We further develop a novel machine learning tool, NBDriver, which is highly efficient at identifying pathogenic variants from multiple independent test datasets. Efficient and accurate identification of novel pathogenic variants from sequenced cancer genomes would help facilitate more effective therapies tailored to patients’ mutational profiles. In an effort to cater to a much wider audience, I would also discuss how one can utilize AI and machine learning to analyze large unstructured biological datasets with a specific focus on some of the ongoing omics-based computational projects currently going on in our lab.
I recently completed my master's in computational biology from IIT Madras. I have always been fascinated by how one can use computational modelling of genomic data to understand the underlying biology of human diseases. I am also a huge fan of community-driven open-source bioinformatics software and have participated in several hackathons and open-source summer projects (such as the Google Summer of Code 2019).