--- license: apache-2.0 language: - en library_name: transformers --- # Model Card for Indus (indus-sde-v0.2) This model was further pre-trained on full Science Discovery Engine (SDE) website data from [nasa-smd-ibm-v0.1](https://huggingface.co/nasa-impact/nasa-smd-ibm-v0.1) after extending its context size with Masked Language Modelling task. ## Model Details - **Base Model**: nasa-impact/nasa-smd-ibm-v0.1 - **Tokenizer**: nasa-impact/nasa-smd-ibm-v0.1 - **Parameters**: 125M - **Pretraining Strategy**: Masked Language Modeling (MLM) ## Training Data - Full Science Discovery Engine (SDE) Website Data ## Training Procedure - **transformers Version**: 4.48.3 - **Strategy**: Masked Language Modeling (MLM) - **Stage 1 Training**: Increase the context size from 512 tokens to 1024 tokens and pre-train slowly only the position encoding layer for 1 epoch. (We do this to make sure we still retain the representation learned in the original upstream Indus model that was trained on huge scientific corpus.) - **Stage 2 Training**: Full training with cosine warmup LR Scheduler for 5 epoch - **Masking Strategy**: - Weighted Dynamic Masking based on Keyword Importance (YAKE) and Random Masking - The idea for masking important keywords is to force the model to generalize for "science" keywords that gives a high signal for the document - Masked Language Model Probability: 30% - Batch Size: 6 - Learning rate: 5e-5 - Warmup ratio: 0.1 ## Dataset - Total Data Size: 545,717 - Validation Data Size: 10% of total size - Test Data Size: 10% of total size ## Evaluation - Top-k Test Mask Accuracy: {'top1': 0.7814, 'top2': 0.8319, 'top3': 0.8548} ![image](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F63f0e7de9cf89c9ed1bf92a2%2FChCQKcyDNu62_Nrvgp2Fh.png)