---
license: apache-2.0
language:
- en
library_name: transformers
---

# Model Card for Indus (indus-sde-v0.2)
This model was further pre-trained on full Science Discovery Engine (SDE) website data from [nasa-smd-ibm-v0.1](https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1) after extending its context size with Masked Language Modelling task.

## Model Details
- **Base Model**: nasa-impact/nasa-smd-ibm-v0.1
- **Tokenizer**: nasa-impact/nasa-smd-ibm-v0.1
- **Parameters**: 125M
- **Pretraining Strategy**: Masked Language Modeling (MLM)

## Training Data
- Full Science Discovery Engine (SDE) Website Data

## Training Procedure 
- **transformers Version**: 4.48.3
- **Strategy**: Masked Language Modeling (MLM)
- **Stage 1 Training**: Increase the context size from 512 tokens to 1024 tokens and pre-train slowly only the position encoding layer for 1 epoch. (We do this to make sure we still retain the representation learned  in the original upstream Indus model that was trained on huge scientific corpus.)
- **Stage 2 Training**: Full training with cosine warmup LR Scheduler for 5 epoch
- **Masking Strategy**:
  - Weighted Dynamic Masking based on Keyword Importance (YAKE) and Random Masking
    - The idea for masking important keywords is to force the model to generalize for "science" keywords that gives a high signal for the document
  - Masked Language Model Probability: 30%
- Batch Size: 6
- Learning rate: 5e-5
- Warmup ratio: 0.1

## Dataset
- Total Data Size: 545,717
- Validation Data Size: 10% of total size
- Test Data Size: 10% of total size

## Evaluation
- Top-k Test Mask Accuracy: {'top1': 0.7814, 'top2': 0.8319, 'top3': 0.8548}


![image](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F63f0e7de9cf89c9ed1bf92a2%2FChCQKcyDNu62_Nrvgp2Fh.png)