
Confusion matrices comparing cell-type classification performance across four models on human immune cell data. GREmLN (top left) shows the darkest diagonal, indicating superior accuracy in correctly identifying cell types compared to scGPT, Geneformer and scFoundation. For more, see the full bioRxiv paper.
A new foundation model called GREmLN from a Columbia and Chan Zuckerberg Biohub team, delivers superior cell-type classification with only 10.3 million parameters, outpacing rivals like the 100-million-parameter scFoundation. Released July 9 on bioRxiv, it taps gene regulatory networks to achieve a 0.929 macro F1 score on immune cell data.
“Instead of using large language models, which are based on sequential data, we had to solve some very complicated math to extend the concept to what we call a large graph model,” explained Andrea Califano, Ph.D., president of the Chan Zuckerberg Biohub New York and the paper’s senior author. “In a cell, there is no sequence,” he said. “Gene number one and gene number two are not related in any inherent order. The order is created by the graph-like structure of how gene products regulate each other.”

Andrea Califano, Ph.D.
The GREmLN paper reported that it achieved superior performance relative to established foundation models in cell-type classification. For human immune cells, GREmLN achieved a macro F1 score of 0.929, outperforming scGPT, the influential 33-million-parameter model from Bo Wang, Ph.D.’s lab published in Nature Methods in 2024 (0.924±0.002); Geneformer, the 30-million-parameter transfer learning model from Christina Theodoris, M.D., Ph.D. et al. published in Nature in 2023 (0.792); and scFoundation, the 100-million-parameter model published in Nature Methods in 2024 (0.879).
The model also demonstrated a knack at reconstructing gene expression, with R² scores clocking in at 0.883 on immune cells and 0.861 on cancer-infiltrating myeloid cells. “This is critical because otherwise you need a huge amount of data and you need a huge amount of computational resources to train the model,” Califano said.
GREmLN’s approach builds on decades of research into “master regulators”: hub proteins that integrate the effects of diverse mutations. “For the last 20 years, the mantra in oncology has been to target specific mutations, an approach that has not worked very well. Only about 11% of cancer patients benefit, and often the benefit is short-lived,” he noted. “The reason is that every cell in a tumor has a different set of mutations. If you target one mutation, you only kill the cells that depend on it.”
Targeting the hub, not the spokes
His solution involves finding the cellular equivalent of a telephone exchange: “Think of it like a telephone exchange where all calls go through the same hub. Instead of figuring out which individual conversation overloaded the system, we find the hub and fix the problem there. Those hubs are the master regulators: the proteins that integrate the effects of all the mutations. By targeting a small number of these proteins, typically about ten, we can address a cancer problem caused by countless different mutational patterns.”
Under the hood: The GREmLN blueprint
The model’s power comes from incorporating ARACNe algorithm-generated gene regulatory networks. ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) works by analyzing patterns in gene expression data to identify which genes control others. It does this by measuring how strongly genes are co-expressed (mutual information), using statistical sampling (bootstrapping) to ensure reliability, and removing spurious connections. These networks have been effective in shedding light on master regulator proteins and determining their sensitivity to small molecules, including use in clinical trials.
Training the GREmLN model required just 1 full epoch on 8 Nvidia H100 80G GPUs in parallel. While the hardware is powerful, the training time is efficient compared to typical foundation models that often require weeks of training. The pre-training dataset consisted of 11 million scRNA-seq (single-cell RNA sequencing) profiles spanning 19,000 genes from healthy human cells, sourced from the CELLxGENE dataset, covering 162 cell-types from various tissues. Each profile represents a snapshot of which genes are active in an individual cell, providing a large atlas of cellular states across the human body.
Looking ahead, Califano’s team plans to enhance GREmLN with sizable perturbational datasets unique to CZ Biohub NY. Those will include profiles where individual regulatory genes have been systematically silenced across millions of cells. “For this first iteration, we used the same data as other models for a fair comparison,” Califano noted. “But moving forward, we will use massive amounts of perturbational data we are generating in-house.” This approach could refine identification of master regulators, building on Califano’s prior work that has already informed successful cancer trials.
The Chan Zuckerberg Initiative has set an ambitious goal: to cure, prevent or manage all diseases by the end of this century. Califano’s vision for supporting that goal aligns with CZI’s framework-based approach. “The way I interpret ‘curing all diseases’ is not that we will develop a specific drug for every one of the 20,000+ rare genetic diseases. Instead, we will create the framework that allows us to solve these problems.” This framework approach represents what he calls “bucketization,” finding universal foundational elements rather than treating each disease as unique.
This work also dovetails with the Chan Zuckerberg Initiative’s work to build AI-powered virtual cells that can predict cellular behavior. Current AI models treat genes in cells like words in a sentence, but genes don’t have a natural order—they’re more like a network of interconnected components. “The cell is literally like a computer,” Califano explained. “If you can figure out its logic, the network of molecular interactions determining its behavior, you can predict what it will do in response to a perturbation with dramatic accuracy.” With GREmLN now available on CZI’s virtual cell platform, researchers can begin to decode these cellular circuits—moving beyond simply reading the genetic code to understanding their logic.
Filed Under: Drug Discovery and Development, Genomics/Proteomics, Immunology, machine learning and AI, Omics/sequencing



