大模型微调BUG记录

1.CUDA error: CUBLAS_STATUS_INTERNAL_ERROR

问题日志

CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasGemmStridedBatchedEx(handle, opa, opb, (int)m, (int)n, (int)k, (void*)&falpha, a, CUDA_R_16BF, (int)lda, stridea, b, CUDA_R_16BF, (int)ldb, strideb, (void*)&fbeta, c, CUDA_R_16BF, (int)ldc, stridec, (int)num_batches, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)

原因：CUDA版本与安装的PyTorch版本不匹配
排查
- 输入nvcc -V查看CUDA版本
- 运行Python解释器，输入
  1
  2
  import torch
  print(torch.version.cuda)
- 如果两个结果的版本不一致，如一个是12.1，一个是12.4，则说明版本不匹配
解决方法
- 前往PyTorch官网，查找匹配CUDA版本的PyTorch下载命令行
- 如下载匹配CUDA12.1的命令行如下：
  1
  pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
- 如果原先已有PyTorch，可以在命令行后添加--upgrade覆盖安装

2.tokenizer.chat_template is not set and no template argument was passed

问题日志

Error in applying chat template from request: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating

原因：模型配置文件tokenizer_config.json中不包含chat_template

排查

前往模型路径，找到tokenizer_config.json文件打开

查找是否包含chat_template字段，如下：

1
2
3

{
    "chat_template": "{{ '<|begin_of_text|>' }}{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% endif %}{% if system_message is defined %}{{ '<|start_header_id|>system<|end_header_id|>\n\n' + system_message + '<|eot_id|>' }}{% endif %}{% for message in loop_messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|start_header_id|>user<|end_header_id|>\n\n' + content + '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<|eot_id|>' }}{% endif %}{% endfor %}"
}

如果没有，则说明没有聊天模板，此时如果使用了tokenizer.apply_chat_template就会报错
例如LlaMa-3.1-8B模型没有chat_template，而LlaMa-3.1-8B-Instruct模型没有chat_template

解决方案
- 待定

3.NotImplementedError: Using RTX 4000 series doesn’t support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use` accelerate launch` which will do this automatically.

大模型微调BUG记录

1.CUDA error: CUBLAS_STATUS_INTERNAL_ERROR

2.tokenizer.chat_template is not set and no template argument was passed

3.NotImplementedError: Using RTX 4000 series doesn’t support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use` accelerate launch` which will do this automatically.

主题设置

实验特性

播放设置

切换歌单

自定义歌单

默认背景

图片（PC端）

图片（移动端）

渐变色

纯色

1.CUDA error: CUBLAS_STATUS_INTERNAL_ERROR

2.tokenizer.chat_template is not set and no template argument was passed

3.NotImplementedError: Using RTX 4000 series doesn’t support faster communication broadband via P2P or IB. Please set NCCL_P2P_DISABLE="1" and NCCL_IB_DISABLE="1" or use accelerate launch` which will do this automatically.

主题设置

实验特性

播放设置

切换歌单

自定义歌单

默认背景

图片（PC端）

图片（移动端）

渐变色

纯色

3.NotImplementedError: Using RTX 4000 series doesn’t support faster communication broadband via P2P or IB. Please set `NCCL_P2P_DISABLE="1"` and `NCCL_IB_DISABLE="1" or use` accelerate launch` which will do this automatically.