ProteinBindingGNN (44M Parameters)
ProteinBindingGNN is a 44-million parameter hybrid Graph Neural Network (GNN) and Transformer architecture designed for node-level protein binding site prediction. It processes both the 3D geometric structure of a protein and its biochemical sequence context (via ESM-2 embeddings) to predict the probability of each amino acid residue belonging to a binding pocket.
The model combines the local geometric awareness of Equivariant Graph Neural Networks (EGNNs) with the global, pairwise reasoning capabilities of Evoformer blocks (inspired by AlphaFold 2), executed through a multi-step recycling mechanism.
Usage
You can find everything related to this model at GraphBind
🧬 Model Architecture
The architecture is explicitly designed to balance rich feature preservation (high hidden dimensionality) with controlled reasoning depth to prevent over-smoothing on 3D protein graphs.
- Parameters: ~44.1 Million
- Local Geometry: 3 EGNN Layers (prevents message-passing over-smoothing, strictly capturing the 3-hop micro-environment).
- Global Reasoning: 8 Evoformer Blocks with Coordinate-Aware Edge Updates.
- Recycling: 5 iterations (the representations and coordinates are iteratively refined).
Full Configuration
CONFIG = {
"node_input_dim": 1305, # Designed for pre-trained ESM-2 embeddings
"edge_input_dim": 4, # e.g., distances, bond types, angles
"hidden_dim": 512, # Preserves rich biochemical context
"num_egnn_layers": 3, # Local 3D message passing
"num_evoformer_blocks": 8, # Global pairwise attention
"num_heads": 16, # Attention heads (d_k = 32)
"dropout": 0.3, # High regularization for generalization
"update_coords": True, # Dynamic coordinate refinement
"num_recycles": 5, # AlphaFold 2 style recycling loop
"alpha": 0.3, # Edge update mixing coefficient
}
Performance & Training Details
This model was trained on a highly curated dataset of 6,700 protein complexes. You can see the full run and details at Weights and Biases
Class Imbalance Handling
Protein binding datasets suffer from extreme class imbalance (typically ~5% of residues are true binders, 95% are non-binders). The model is optimized for high ranking capability, achieving:
Validation AUROC: 0.8898
Validation F1 (Standard Threshold): 0.3875 (Note: F1 is highly sensitive to the massive negative class; threshold tuning is recommended for downstream applications).
Validation Recall: ~0.50 - 0.57
Future work
Due to the limitations on gpu's, the model wans't trained enough to show the expected results, in the future with a sponsorship it is planned to keep training and fine tunning.