ProteinBindingGNN (44M Parameters)

ProteinBindingGNN is a 44-million parameter hybrid Graph Neural Network (GNN) and Transformer architecture designed for node-level protein binding site prediction. It processes both the 3D geometric structure of a protein and its biochemical sequence context (via ESM-2 embeddings) to predict the probability of each amino acid residue belonging to a binding pocket.

The model combines the local geometric awareness of Equivariant Graph Neural Networks (EGNNs) with the global, pairwise reasoning capabilities of Evoformer blocks (inspired by AlphaFold 2), executed through a multi-step recycling mechanism.

Usage

You can find everything related to this model at GraphBind

🧬 Model Architecture

The architecture is explicitly designed to balance rich feature preservation (high hidden dimensionality) with controlled reasoning depth to prevent over-smoothing on 3D protein graphs.

  • Parameters: ~44.1 Million
  • Local Geometry: 3 EGNN Layers (prevents message-passing over-smoothing, strictly capturing the 3-hop micro-environment).
  • Global Reasoning: 8 Evoformer Blocks with Coordinate-Aware Edge Updates.
  • Recycling: 5 iterations (the representations and coordinates are iteratively refined).

Full Configuration

CONFIG = {
    "node_input_dim":        1305,   # Designed for pre-trained ESM-2 embeddings
    "edge_input_dim":        4,      # e.g., distances, bond types, angles
    "hidden_dim":            512,    # Preserves rich biochemical context
    "num_egnn_layers":       3,      # Local 3D message passing
    "num_evoformer_blocks":  8,      # Global pairwise attention
    "num_heads":             16,     # Attention heads (d_k = 32)
    "dropout":               0.3,    # High regularization for generalization
    "update_coords":         True,   # Dynamic coordinate refinement
    "num_recycles":          5,      # AlphaFold 2 style recycling loop
    "alpha":                 0.3,    # Edge update mixing coefficient
}

Performance & Training Details

This model was trained on a highly curated dataset of 6,700 protein complexes. You can see the full run and details at Weights and Biases

Class Imbalance Handling

Protein binding datasets suffer from extreme class imbalance (typically ~5% of residues are true binders, 95% are non-binders). The model is optimized for high ranking capability, achieving:

  • Validation AUROC: 0.8898

  • Validation F1 (Standard Threshold): 0.3875 (Note: F1 is highly sensitive to the massive negative class; threshold tuning is recommended for downstream applications).

  • Validation Recall: ~0.50 - 0.57

Future work

Due to the limitations on gpu's, the model wans't trained enough to show the expected results, in the future with a sponsorship it is planned to keep training and fine tunning.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support