publications | Mihir Bafna

Periodically updated from google scholar.

(†=equal contribution)

2026

ICLR
Learning residue level protein dynamics with multiscale Gaussians

Mihir Bafna, Bowen Jing, and Bonnie Berger

International Conference on Learning Representations 2026

Abs arXiv Bib

Many methods have been developed to predict static protein structures, however understanding the dynamics of protein structure is essential for elucidating biological function. While molecular dynamics (MD) simulations remain the in silico gold standard, its high computational cost limits scalability. We present DynaProt, a lightweight, SE(3)-invariant framework that predicts rich descriptors of protein dynamics directly from static structures. By casting the problem through the lens of multivariate Gaussians, DynaProt estimates dynamics at two complementary scales: (1) per-residue marginal anisotropy as 3×3 covariance matrices capturing local flexibility, and (2) joint scalar covariances encoding pairwise dynamic coupling across residues. From these dynamics outputs, DynaProt achieves high accuracy in predicting residue-level flexibility (RMSF) and, remarkably, enables reasonable reconstruction of the full covariance matrix for fast ensemble generation. Notably, it does so using orders of magnitude fewer parameters than prior methods. Our results highlight the potential of direct protein dynamics prediction as a scalable alternative to existing methods.
@article{bafna2025learningresiduelevelprotein, title = {Learning residue level protein dynamics with multiscale Gaussians}, author = {Bafna, Mihir and Jing, Bowen and Berger, Bonnie}, year = {2026}, eprint = {2509.01038}, archiveprefix = {arXiv}, primaryclass = {q-bio.BM}, journal = {International Conference on Learning Representations}, short = {ICLR}, shorttitle = {ICLR}, }

2025

MLSB

Oral

SwitchCraft: Programmatic Design of State-Switching Proteins

Bowen Jing^†, Mihir Bafna^†, Adam Klivans, and Bonnie Berger

Machine Learning for Structural Biology Dec 2025

Abs

Complex multistate functional mechanisms are observed ubiquitously in natural proteins, yet the path towards systematic de novo rational design of such mechanisms remains unclear, despite significant advancements in protein language models and structure diffusion models. We introduce SwitchCraft, a versatile and programmatic framework for state-switching proteins based on backpropagation through compositional design constraints parameterized by structure prediction models. Our in silico evaluations demonstrate success on a wide range of state-switching design specifications, from allosteric regulation of functional motifs to discrimination of bound ligand identities. Notably, one design exhibits a 3.8 A conformational change upon oxygenation of heme, mimicking mechanisms of cooperativity in hemoglobin. These results position SwitchCraft at the inception of a powerful paradigm for higher-order functional protein design.
biorxiv
Generating functional and multistate proteins with a multimodal diffusion transformer

Bowen Jing^†, Anna Sappington^†, Mihir Bafna, Ravi Shah, Adrina Tang, Rohith Krishna, Adam Klivans, Daniel J Diaz, and Bonnie Berger

bioRxiv Sep 2025

Abs bioRXiv Bib

Generating proteins with the full diversity and complexity of functions found in nature is a grand challenge in protein design. Here, we present ProDiT, a multimodal diffusion model that unifies sequence and structure modeling paradigms to enable the design of functional proteins at scale. Trained on sequences, 3D structures, and annotations for 214M proteins across the evolutionary landscape, ProDiT generates diverse, novel proteins that preserve known active and binding site motifs and can be successfully conditioned on a wide range of molecular functions, spanning 465 Gene Ontology terms. We introduce a diffusion sampling protocol to design proteins with multiple functional states, and demonstrate this protocol by scaffolding active sites from enzymes such as carbonic anhydrase and lysozyme to be allosterically deactivated by a calcium effector. Our results showcase ProDiT9s unique capacity to satisfy design specifications inaccessible to existing generative models, thereby expanding the protein design toolkit.
@article{jing2025generating, title = {Generating functional and multistate proteins with a multimodal diffusion transformer}, author = {Jing, Bowen and Sappington, Anna and Bafna, Mihir and Shah, Ravi and Tang, Adrina and Krishna, Rohith and Klivans, Adam and Diaz, Daniel J and Berger, Bonnie}, equal = {Jing, Bowen; Sappington, Anna}, journal = {bioRxiv}, year = {2025}, month = sep, publisher = {Cold Spring Harbor Laboratory}, short = {biorxiv}, shorttitle = {biorxiv}, biorxiv = {10.1101/2025.09.03.672144v2} }
PNAS
Sparse autoencoders uncover biologically interpretable features in protein language model representations

Onkar Gujral, Mihir Bafna, Eric Alm, and Bonnie Berger

Proceedings of the National Academy of Sciences Sep 2025

Abs Paper Bib

Foundation models in biology—particularly protein language models (PLMs)—have enabled ground-breaking predictions in protein structure, function, and beyond. However, the “black-box” nature of these representations limits transparency and explainability, posing challenges for human–AI collaboration and leaving open questions about their human-interpretable features. Here, we leverage sparse autoencoders (SAEs) and a variant, transcoders, from natural language processing to extract, in a completely unsupervised fashion, interpretable sparse features present in both protein-level and amino acid (AA)-level representations from ESM2, a popular PLM. Unlike other approaches such as training probes for features, the extraction of features by the SAE is performed without any supervision. We find that many sparse features extracted from SAEs trained on protein-level representations are tightly associated with Gene Ontology (GO) terms across all levels of the GO hierarchy. We also use Anthropic’s Claude to automate the interpretation of sparse features for both protein-level and AA-level representations and find that many of these features correspond to specific protein families and functions such as the NAD Kinase, IUNH, and the PTH family, as well as proteins involved in methyltransferase activity and in olfactory and gustatory sensory perception. We show that sparse features are more interpretable than ESM2 neurons across all our trained SAEs and transcoders. These findings demonstrate that SAEs offer a promising unsupervised approach for disentangling biologically relevant information present in PLM representations, thus aiding interpretability. This work opens the door to safety, trust, and explainability of PLMs and their applications, and paves the way to extracting meaningful biological insights across increasingly powerful models in the life sciences.
@article{gujral2025sparse, author = {Gujral, Onkar and Bafna, Mihir and Alm, Eric and Berger, Bonnie}, journal = {Proceedings of the National Academy of Sciences}, volume = {122}, number = {34}, pages = {e2506316122}, year = {2025}, publisher = {National Academy of Sciences}, short = {PNAS}, shorttitle = {PNAS}, }
biorxiv
DANGO: Predicting Higher-Order Genetic Interactions

Ruochi Zhang, Mihir Bafna, Jianzhu Ma, and Jian Ma

biorxiv Jan 2025

Abs bioRXiv Paper Bib

Higher-order genetic interactions, which have profound implications for understanding the molecular mechanisms of phenotypic variation, remain poorly characterized. Most studies to date have focused on pairwise interactions, as designing high-throughput experimental screenings for the vast combinatorial search space of higher-order molecular interactions is dauntingly challenging. Here, we develop DANGO, a computational method based on a self-attention hypergraph neural network, designed to effectively predict higher-order genetic interaction among groups of genes. As a proof-of-concept, we provide comprehensive predictions for over 400 million trigenic interactions in the yeast S. cerevisiae, significantly expanding the quantitative characterization of such interactions. Our results demonstrate that DANGO accurately predicts trigenic interactions, uncovering both known and novel biological functions related to cell growth. We further incorporate protein embeddings and model uncertainty scoring to enhance the biological relevance and interpretability of the predicted interactions. Moreover, the predicted interactions can serve as powerful genetic markers for growth response under diverse conditions. Together, DANGO enables a more complete map of complex genetic interactions that impinge upon phenotypic diversity.Competing Interest StatementThe authors have declared no competing interest.
@article{Zhang2020.11.26.400739, title = {DANGO: Predicting Higher-Order Genetic Interactions}, author = {Zhang, Ruochi and Bafna, Mihir and Ma, Jianzhu and Ma, Jian}, month = jan, year = {2025}, journal = {biorxiv}, biorxiv = {10.1101/2020.11.26.400739v2}, publisher = {bioRxiv}, doi = {10.1101/2020.11.26.400739}, elocation-id = {2020.11.26.400739}, short = {biorxiv}, shorttitle = {biorxiv}, }

2023

MLSB

DiffRNAFold: Generating RNA Tertiary Structures with Latent Space Diffusion

Mihir Bafna, Vikranth Keerthipati, Subhash Kanaparthi, and Ruochi Zhang

NeurIPS Workshop on Machine Learning for Structural Biology Dec 2023

Abs

Abstract RNA molecules provide an exciting frontier for novel therapeutics. Accurate determination of RNA structure could accelerate development of therapeutics through an improved understanding of function. However, the extremely large conformation space has kept the RNA 3D structure space largely unresolved. Using recent advances in generative modeling, we propose DiffRNAFold, a latent space diffusion model for RNA tertiary structure design. Our preliminary results suggest that DiffRNAFold generated molecules are similar in 3D space to true RNA molecules, providing an important first step towards accurate structure and function prediction in vivo.
ISMB
CLARIFY: Cell-cell interaction and gene regulatory network refinement from spatially resolved transcriptomics

Mihir Bafna, Hechen Li, and Xiuwei Zhang

ISMB and Bioinformatics Jun 2023

Abs Paper Code Press Bib

Motivation: Gene regulatory networks (GRNs) in a cell provide the tight feedback needed to synchronize cell actions. However, genes in a cell also take input from, and provide signals to other neighboring cells. These cell-cell interactions (CCIs) and the GRNs deeply influence each other. Many computational methods have been developed for GRN inference in cells. More recently, methods were proposed to infer CCIs using single cell gene expression data with or without cell spatial location information. However, in reality, the two processes do not exist in isolation and are subject to spatial constraints. Despite this rationale, no methods currently exist to infer GRNs and CCIs using the same model. Results: We propose CLARIFY, a tool that takes GRNs as input, uses them and spatially resolved gene expression data to infer CCIs, while simultaneously outputs refined cell-specific GRNs. CLARIFY uses a novel multi-level graph autoencoder, which mimics cellular networks at a higher level and cell-specific GRNs at a deeper level. We applied CLARIFY to two real spatial transcriptomic datasets, one using seqFISH and the other using MERFISH, and also tested on simulated datasets from scMultiSim. We compared the quality of predicted GRNs and CCIs with state-of-the-art baseline methods that inferred either only GRNs or only CCIs. The results show that CLARIFY consistently outperforms the baseline in terms of commonly used evaluation metrics. Our results point to the importance of co-inference of CCIs and GRNs and to the use of layered graph neural networks as an inference tool for biological networks.
@article{Clarify, author = {Bafna, Mihir and Li, Hechen and Zhang, Xiuwei}, title = {CLARIFY: Cell-cell interaction and gene regulatory network refinement from spatially resolved transcriptomics}, booktitle = {31st Conference on Intelligent Systems for Molecular Biology}, short = {ISMB}, year = {2023}, journal = {Bioinformatics}, volume = {39}, number = {Supplement_1}, pages = {i484-i493}, month = jun, press = {https://www.cc.gatech.edu/news/award-winning-computer-models-propel-research-cellular-differentiation}, publisher = {Oxford University Press}, appear = {true}, }

2022

NSUR

Benchmarking and Refining Cell-Cell Interactions with Spatial Transcriptomics and Deep Learning

Mihir Bafna, and Xiuwei Zhang

St. Jude National Symposium for Undergraduate Research Oct 2022

Abs Poster

The (mal-)functioning of human tissues can be attributed to genes that are active (expressed) or repressed relative to expectations. New genomic technologies allow measurements not only at single cell (sc) resolution, but also retain information on the spatial location of the cell. These Spatial Transcriptomics (ST) technologies could revolutionize human health. For example, they offer an unprecedented look at the tumor microenvironment, revealing the infiltrating immune cells and their interactions with their cancerous counterparts. Many computational methods now analyze the complex, high dimensional ST data for inferring these cell-cell interactions (CCIs). However, the ST community lacks a centralized ground truth to holistically evaluate these tools. Here, (a) we systematically benchmark existing methods and (b) suggest a deep learning method for refining ST-CCI prediction. We evaluated 7 methods, including CellPhoneDB and DeepLinc, on 10 simulated datasets at 5 noise levels as well as 2 real datasets generated using SeqFISH and MerFISH. CellPhoneDB achieved an average precision/recall of 0.79/0.75 respectively, with the recall reducing to 0.65 for certain datasets. DeepLinc only achieved 0.68 prediction accuracy on SeqFISH data after being trained with labeled data. Additionally, the ROC curves were surprisingly linear, suggesting that increasing true positive rate comes only with increasing false positive rate. All other methods resulted in similar performance and the same pitfalls: failing to properly utilize either the spatial information or downstream gene regulatory interactions (GRN), thus increasing the false positive interactions. Our work provides, for the first time, a curated data resource for future tool comparisons and a systematic analysis of the shortcomings of existing methods. Lastly, to address these drawbacks, we deployed a preliminary version of a subgraph neural network, where we represent each cell by a subgraph of its underlying GRN, obtaining lower-dimensional representations of each cell with GRN information embedded. These embeddings incorporate gene expression, spatial information, and GRN activity thereby allowing us to refine ST-CCI inference. This has myriad applications like revealing interactions between co-located immune cells and tumor cells, addressing the central biological problem of why certain tumors are “immunologically hot” and respond better to immuno-oncotherapy.
ACM-BCB
DeepViFi: Detecting Oncoviral Infections in Cancer Genomes Using Transformers

Utkrisht Rajkumar, Sara Javadzadeh, Mihir Bafna, Dongxia Wu, Rose Yu, Jingbo Shang, and Vineet Bafna

Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics Oct 2022

Abs Paper Code Bib

We consider the problem of identifying viral reads in human host genome data. We pose the problem as open-set classification as reads can originate from unknown sources such as bacterial and fungal genomes. Sequence-matching methods have low sensitivity in recognizing viral reads when the viral family is highly diverged. Hidden Markov models have higher sensitivity but require domain-specific training and are difficult to repurpose for identifying different viral families. Supervised learning methods can be trained with little domain-specific knowledge but have reduced sensitivity in open-set scenarios. We present DeepViFi, a transformer-based pipeline, to detect viral reads in short-read whole genome sequence data. At 90% precision, DeepViFi achieves 90% recall compared to 15% for other deep learning methods. DeepViFi provides a semi-supervised framework to learn representations of viral families without domain-specific knowledge, and rapidly and accurately identify target sequences in open-set settings.
@inproceedings{DeepVifi, author = {Rajkumar, Utkrisht and Javadzadeh, Sara and Bafna, Mihir and Wu, Dongxia and Yu, Rose and Shang, Jingbo and Bafna, Vineet}, title = {DeepViFi: Detecting Oncoviral Infections in Cancer Genomes Using Transformers}, year = {2022}, isbn = {9781450393867}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3535508.3545551}, doi = {10.1145/3535508.3545551}, booktitle = {Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics}, shorttitle = {ACM-BCB}, short = {ACM-BCB}, articleno = {2}, numpages = {8}, keywords = {neural networks, viral detection, natural language processing, open-set classification}, location = {Northbrook, Illinois}, series = {BCB '22}, }

2021

Patent

Computer-implemented methods for quantitation of features of interest in whole slide imaging

Nam Nguyen, Lorena Mora-Blanco, Kristen Turner, Julie Weise, Jason Christiansen, and Mihir Bafna

Provisional patent. PCT/US2021/022308 Mar 2021

Bib

@patent{MetaDetect,
  title = {Computer-implemented methods for quantitation of features of interest in
  whole slide imaging},
  author = {Nguyen, Nam and Mora-Blanco, Lorena and Turner, Kristen and Weise, Julie and Christiansen, Jason and Bafna, Mihir},
  year = {2021},
  number = {022308},
  filing = {Provisional patent. PCT/US2021/022308},
  month = mar,
  notes = {filed March 15, 2021},
  abs = {},
  short = {Patent}
}