LLM2Vec is a simple recipe to convert decoder-only LLMs into text encoders. It consists of 3 simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. The model can be further fine-tuned to achieve state-of-the-art performance.
- Repository: https://github.com/McGill-NLP/llm2vec
- Paper: https://arxiv.org/abs/2404.05961
Usage
from llm2vec import LLM2Vec
model = LLM2Vec.from_pretrained(
"standardmodelbio/model-model-smb-mntp-llama-3.1-8b-v1",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
max_length=4096,
attn_implementation="flash_attention_2",
)
text = ["StandardModel"] * 8
embeddings = model.encode(text)
print(embeddings)
print(embeddings.shape)
"""
tensor([[ 1.1250, 0.7070, -0.1475, ..., 0.8320, 0.2852, -0.3691],
[ 1.1250, 0.7070, -0.1475, ..., 0.8320, 0.2852, -0.3691],
[ 1.1250, 0.7070, -0.1475, ..., 0.8320, 0.2852, -0.3691],
...,
[ 1.1250, 0.7070, -0.1475, ..., 0.8320, 0.2852, -0.3691],
[ 1.1250, 0.7070, -0.1475, ..., 0.8320, 0.2852, -0.3691],
[ 1.1250, 0.7070, -0.1475, ..., 0.8320, 0.2852, -0.3691]])
torch.Size([8, 4096])
"""
License
This model is proprietary and for internal use only.
Training Data
We employ a comprehensive training dataset comprising proprietary real-world EHR records spanning fifteen distinct clinical indications which are heavy oncology. This extensive collection includes over 1.2M patients with approximately 200M clinical events, providing rich longitudinal data for training our models. The dataset’s diversity enables evaluation across 10 distinct predictive tasks, allowing thorough assessment of temporal reasoning capabilities across varied clinical scenarios. See https://arxiv.org/abs/2509.25591 for more details.
- Downloads last month
- 6
Model tree for standardmodelbio/smb-mntp-llama-3.1-8b-v1
Base model
meta-llama/Llama-3.1-8B