standardmodelbio/smb-mntp-llama-3.1-8b-v1

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

LLM2Vec is a simple recipe to convert decoder-only LLMs into text encoders. It consists of 3 simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. The model can be further fine-tuned to achieve state-of-the-art performance.

Repository: https://github.com/McGill-NLP/llm2vec
Paper: https://arxiv.org/abs/2404.05961

Usage

from llm2vec import LLM2Vec

model = LLM2Vec.from_pretrained(
    "standardmodelbio/model-model-smb-mntp-llama-3.1-8b-v1",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    max_length=4096,
    attn_implementation="flash_attention_2",
)

text = ["StandardModel"] * 8
embeddings = model.encode(text)
print(embeddings)
print(embeddings.shape)

"""
tensor([[ 1.1250,  0.7070, -0.1475,  ...,  0.8320,  0.2852, -0.3691],
        [ 1.1250,  0.7070, -0.1475,  ...,  0.8320,  0.2852, -0.3691],
        [ 1.1250,  0.7070, -0.1475,  ...,  0.8320,  0.2852, -0.3691],
        ...,
        [ 1.1250,  0.7070, -0.1475,  ...,  0.8320,  0.2852, -0.3691],
        [ 1.1250,  0.7070, -0.1475,  ...,  0.8320,  0.2852, -0.3691],
        [ 1.1250,  0.7070, -0.1475,  ...,  0.8320,  0.2852, -0.3691]])
torch.Size([8, 4096])
"""

License

This model is proprietary and for internal use only.

Training Data

We employ a comprehensive training dataset comprising proprietary real-world EHR records spanning fifteen distinct clinical indications which are heavy oncology. This extensive collection includes over 1.2M patients with approximately 200M clinical events, providing rich longitudinal data for training our models. The dataset’s diversity enables evaluation across 10 distinct predictive tasks, allowing thorough assessment of temporal reasoning capabilities across varied clinical scenarios. See https://arxiv.org/abs/2509.25591 for more details.