大模型微调BUG记录
1.CUDA error: CUBLAS_STATUS_INTERNAL_ERROR
- 问题日志
1 | CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) |
原因:CUDA版本与安装的PyTorch版本不匹配
排查
输入
nvcc -V
查看CUDA版本运行Python解释器,输入
1
2import torch
print(torch.version.cuda)如果两个结果的版本不一致,如一个是
12.1
,一个是12.4
,则说明版本不匹配
解决方法
前往PyTorch官网,查找匹配CUDA版本的PyTorch下载命令行
如下载匹配CUDA12.1的命令行如下:
1
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
如果原先已有PyTorch,可以在命令行后添加
--upgrade
覆盖安装
2.tokenizer.chat_template is not set and no template argument was passed
问题日志
1
Error in applying chat template from request: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating
原因:模型配置文件
tokenizer_config.json
中不包含chat_template
排查
前往模型路径,找到
tokenizer_config.json
文件打开查找是否包含
chat_template
字段,如下:1
2
3{
"chat_template": "{{ '<|begin_of_text|>' }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% endif %}{% if system_message is defined %}{{ '<|start_header_id|>system<|end_header_id|>\n\n' + system_message + '<|eot_id|>' }}{% endif %}{% for message in loop_messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|start_header_id|>user<|end_header_id|>\n\n' + content + '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<|eot_id|>' }}{% endif %}{% endfor %}"
}如果没有,则说明没有聊天模板,此时如果使用了
tokenizer.apply_chat_template
就会报错例如
LlaMa-3.1-8B
模型没有chat_template
,而LlaMa-3.1-8B-Instruct
模型没有chat_template
解决方案
- 待定
3.NotImplementedError: Using RTX 4000 series doesn’t support faster communication broadband via P2P or IB. Please set NCCL_P2P_DISABLE="1"
and NCCL_IB_DISABLE="1" or use
accelerate launch` which will do this automatically.
评论
匿名评论
评论条例