Loaded loader_megatron_core as the loader. Loaded saver_llama2_hf_bf as the saver. Starting saver... Starting loader... fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function. /usr/local/lib/python3.12/dist-packages/modelopt/torch/utils/import_utils.py:31: UserWarning: Failed to import apex plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin. warnings.warn( /usr/local/lib/python3.12/dist-packages/modelopt/torch/utils/import_utils.py:31: UserWarning: Failed to import huggingface plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin. warnings.warn( /usr/local/lib/python3.12/dist-packages/modelopt/torch/utils/import_utils.py:31: UserWarning: Failed to import megatron plugin due to: AttributeError("module 'transformers.modeling_utils' has no attribute 'Conv1D'"). You may ignore this warning if you do not need this plugin. warnings.warn( Setting num_layers to 28 from checkpoint Setting hidden_size to 5120 from checkpoint Setting ffn_hidden_size to 27648 from checkpoint Setting seq_length to 131072 from checkpoint Setting num_attention_heads to 40 from checkpoint Setting num_query_groups to 8 from checkpoint Setting group_query_attention to True from checkpoint Setting kv_channels to 128 from checkpoint Setting max_position_embeddings to 131072 from checkpoint Setting position_embedding_type to rope from checkpoint Setting add_position_embedding to True from checkpoint Setting use_rotary_position_embeddings to True from checkpoint Setting rotary_base to 500000 from checkpoint Setting rotary_percent to 1.0 from checkpoint Setting rotary_interleaved to False from checkpoint Setting add_bias_linear to False from checkpoint Setting add_qkv_bias to False from checkpoint Setting squared_relu to False from checkpoint Setting swiglu to True from checkpoint Setting untie_embeddings_and_output_weights to True from checkpoint Setting apply_layernorm_1p to False from checkpoint Setting normalization to RMSNorm from checkpoint Setting apply_query_key_layer_scaling to False from checkpoint Setting attention_dropout to 0.0 from checkpoint Setting hidden_dropout to 0.0 from checkpoint Checkpoint did not provide arguments hybrid_override_pattern Checkpoint did not provide arguments spec Setting hybrid_attention_ratio to 0.0 from checkpoint Setting hybrid_mlp_ratio to 0.0 from checkpoint Checkpoint did not provide arguments num_experts Setting moe_layer_freq to 1 from checkpoint Setting moe_router_topk to 2 from checkpoint Setting moe_router_pre_softmax to False from checkpoint Setting moe_grouped_gemm to False from checkpoint Checkpoint did not provide arguments moe_shared_expert_intermediate_size Setting mamba_state_dim to 128 from checkpoint Setting mamba_head_dim to 64 from checkpoint Setting mamba_num_groups to 8 from checkpoint Checkpoint did not provide arguments mamba_num_heads Setting is_hybrid_model to False from checkpoint Checkpoint did not provide arguments heterogeneous_layers_config_path Checkpoint did not provide arguments heterogeneous_layers_config_encoded_json Setting tokenizer_type to SFTTokenizer from checkpoint Setting tokenizer_model to /cpfs01/users/wzhang/iquest-coder-v1.1/RepoData-Ucoder-32B-128k-from2.5.2/97.09B_instruct_iquest-coder from checkpoint Checkpoint did not provide arguments tiktoken_pattern Setting padded_vocab_size to 76800 from checkpoint INFO:megatron.core.num_microbatches_calculator:setting number of microbatches to constant 1 WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it building GPT model ... (TP, PP) mismatch after resume ((1, 1) vs (8, 1) from checkpoint): RNG state will be ignored sharded_state_dict metadata loaded from the checkpoint: {'distrib_optim_sharding_type': 'dp_reshardable', 'singleton_local_shards': False, 'chained_optim_avoid_prefix': True} Job sharding has changed: Rerun state will be ignored loading distributed checkpoint from /tmp/megatron_convert_iter1616_node0_pid360_42a53cb4 at iteration 1616 /volume/pt-train/users/wzhang/wjj-workspace/code-sft/src/training/Megatron-LM/megatron/core/dist_checkpointing/strategies/torch.py:956: FutureWarning: `load_state_dict` is deprecated and will be removed in future versions. Please use `load` instead. checkpoint.load_state_dict( /usr/local/lib/python3.12/dist-packages/torch/distributed/checkpoint/planner_helpers.py:406: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. device = getattr(value, "device", None) /usr/local/lib/python3.12/dist-packages/torch/distributed/checkpoint/default_planner.py:454: FutureWarning: Please use DTensor instead and we are deprecating ShardedTensor. and md.size != obj.size() checkpoint version 3.0 successfully loaded checkpoint from /tmp/megatron_convert_iter1616_node0_pid360_42a53cb4 [ t 1/1, p 1/1 ] at iteration 1616 sending embeddings sending transformer layer 0 sending transformer layer 1 sending transformer layer 2 sending transformer layer 3 sending transformer layer 4 sending transformer layer 5 sending transformer layer 6 sending transformer layer 7 sending transformer layer 8 sending transformer layer 9 sending transformer layer 10 sending transformer layer 11 sending transformer layer 12 sending transformer layer 13 sending transformer layer 14 sending transformer layer 15 sending transformer layer 16 sending transformer layer 17 sending transformer layer 18 sending transformer layer 19 sending transformer layer 20 sending transformer layer 21 sending transformer layer 22 sending transformer layer 23 sending transformer layer 24 sending transformer layer 25 sending transformer layer 26 sending transformer layer 27 sending final norm sending output layer Waiting for saver to complete... fused_indices_to_multihot has reached end of life. Please migrate to a non-experimental function. received embeddings received transformer layer 0 received transformer layer 1 received transformer layer 2 received transformer layer 3 received transformer layer 4 received transformer layer 5 received transformer layer 6 received transformer layer 7 received transformer layer 8 received transformer layer 9 received transformer layer 10 received transformer layer 11 received transformer layer 12 received transformer layer 13 received transformer layer 14 received transformer layer 15 received transformer layer 16 received transformer layer 17 received transformer layer 18 received transformer layer 19 received transformer layer 20 received transformer layer 21 received transformer layer 22 received transformer layer 23 received transformer layer 24 received transformer layer 25 received transformer layer 26 received transformer layer 27 received final norm received output layer Saving model to disk ...