LLMBox
Published:
LLMBox学习笔记
本文主要记录LLMBox框架的阅读笔记。主要内容参考LLMBox文档、LLMBox文档2、deepspeed博客和官方文档1,官方文档2
为什么要学习大模型训练框架?
大模型的训练非常耗时,因此加速训练并且保存模型训练的中间状态非常关键,如果训练失败则可以从某个检查点开始继续训练,并且可以评估模型在训练的不同阶段的性能。这样可以更灵活地调整超参数,以及对下游任务的微调。在推理阶段,大模型也可以通过KVCache等方式实现加速。这些复杂的操作引发了对统一的大模型训练和推理框架的需求。LLMBox统一了训练、评估和推理等生成式大模型框架,因此掌握LLMBox就能够掌握运用大模型开展科学研究。对于BERT等编码模型,可以用huggingface-trainer框架进行训练。
数据集和模型加载
在下载huggingface数据集的时候,目标地址可能被墙,下载不了,例如:
https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl
此时可以将前缀https://huggingface.co替换成https://hf-mirror.com,用国内的镜像站下载。 可以用huggingface-cli方法下载huggingface模型和数据集到指定的目录下:
# 环境变量
export HF_ENDPOINT=https://hf-mirror.com
# model
huggingface-cli download --resume-download gpt2 --local-dir gpt2
# dataset
huggingface-cli download --repo-type dataset --resume-download wikitext --local-dir wikitext
# 登录以下载项目
huggingface-cli download --token hf_*** --resume-download meta-llama/Llama-2-7b-hf --local-dir Llama-2-7b-hf
LLMBox: Download
在服务器上安装LLMBox所需的依赖有点儿麻烦。以python=3.11, module load cuda/12.1.1为例:
$ pip install -r requirements.txt
...
ModuleNotFoundError: No module named 'packaging'
遵循以下步骤解决:
- 先不安装vllm和flash_attention,把requirements.txt其他的包安装好
- pip install –upgrade pip
- pip install packaging whell setuptools
- pip install vllm flash-attn –no-build-isolation
如果没有预先module load cuda/12.1.1会出大问题,建议把这句写到bashrc文件中。
LLMBox: Training
为了训练一个大模型,llmbox提供了sft, pt, ppo和dpo等多种训练方式,并且提供了LoRA和QLoRA等低资源PEFT方法,同时提供了Flash Attention和Deepspeed等加速训练方法。
Trainer
LLMBox提供了sft, pt, ppo, dpo的数据集。自定义数据集可以参考他们的数据集的格式,但最好是jsonl的格式。
一个常见的用deepspeed+trainer来训练的script;如果需要用lora或qlora,只需要指定–lora True或–qlora True。
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export WANDB_MODE=disabled
OUTPUT_DIR=./output/alpaca-7b
torchrun --nproc_per_node=8 train.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--data_path data/ \
--dataset alpaca_data_1k.json \ # .txt为pretrain文件;可以指定多个文件,文件名需要包含alpaca等数据集名称,然后在training/dataset/sft_dataset中补充__init__.py中的注册字典、以及该目录下的数据集py文件处理数据集
--output_dir $OUTPUT_DIR \
--num_train_epochs 2 \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 2 \
--save_strategy "epoch" \
--save_steps 2 \
--save_total_limit 2 \
--learning_rate 1e-5 \
--lr_scheduler_type "constant" \
--logging_steps 1 \
--deepspeed configs/ds_z3_bf16.json \
deepspeed配置文件如下。train_batch_size为总batch_size,它由单张卡的批次大小、梯度累积次数和GPU的数量相乘得到。
zero优化的效率是一个trade-off。zero-0可被视为DDP,而此时是速度最快但显存占用最多;zero-3速度最慢但显存占用最少。在a800的条件下可以使用zero-1。
{
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
LLMBox: Utilization
为了实现大模型推理,llmbox提供了KVCache加速算法,并且提供了ICL、CoT等增强方法,以及vLLM, Flash Attention等加速推理方法、BnB/GPTQ等量化方法。
Dataset
官方文档给了一些常用数据集,但如果需要在自己的数据集上测试,则需要继承GenerationDataset或MultipleChoiceDataset类。以GenerationDataset为例,我们可以自定义生成的超参(温度、停止符等)、指标(BLEU, EM, F1, GPTEval, Pass@K等)。通用的单轮对话Dataset模版如下:
from .utils import load_raw_dataset_from_file, get_raw_dataset_loader
INSTRUCTION = """Answer the following question.
{question}
Answer:"""
NO_INSTRUCTION = "{question}"
class MyDataset(GenerationDataset):
instruction = None # 在下面定义(注意,这里的instruction是类属性,在初始化的时候再更新
metrics = [Em(), F1()]
evaluation_set = ... # 加载的数据集切片
load_args = ... # hf load_data的参数,也可以不定义,然后定义下面的load_raw_dataset加载本地数据集
...
def init_arguments(self):
if self.model_type == "base":
self.instruction = NO_INSTRUCTION
else:
self.instruction = INSTRUCTION
self.extra_model_args['max_tokens'] = 4096 # generation参数
def format_instance(self, instance):
# 用数据集中instance来填充instruction中的key,在这个例子中,需要填充question;除此之外需要返回这个样本的期望输出target。
return dict(
question=instance["question"],
target=instance["answer"],
)
@cached_property
def references(self):
references = []
for instance in self.evaluation_data:
references.append([{
"answer": instance["answer"],
"task": GAOKAO_TASKS[self.subset_name],
"score": instance["score"]
}])
return references
def load_raw_dataset(self, dataset_path, subset_name, evaluation_set, example_set):
self.evaluation_data = get_raw_dataset_loader(...)("test")
self.example_data = load_raw_dataset_from_file("examples.json") # 不一定需要example_data
Usage
python inference.py
可用参数:
--model_name_or_path MODEL_NAME_OR_PATH, --model MODEL_NAME_OR_PATH, -m MODEL_NAME_OR_PATH
The model name or path, e.g., davinci-002, meta-
llama/Llama-2-7b-hf, ./mymodel (default: None)
--model_type {base,instruction}
The type of the model, which can be chosen from `base`
or `instruction`. (default: base)
--model_backend {anthropic,dashscope,huggingface,openai,qianfan,vllm}
The model backend
--device_map DEVICE_MAP
The device map for model and data (default: auto)
--vllm [VLLM] Whether to use vllm (default: False)
--flash_attention [FLASH_ATTENTION]
Whether to use flash attention (default: True)
--no_flash_attention Whether to use flash attention (default: False)
--tokenizer_name_or_path TOKENIZER_NAME_OR_PATH, --tokenizer TOKENIZER_NAME_OR_PATH
The tokenizer name or path, e.g., cl100k_base, meta-llama/Llama-2-7b-hf, ./mymodel
为了评测api,可以传入secret key:
--openai_api_key OPENAI_API_KEY
The OpenAI API key (default: None)
--anthropic_api_key ANTHROPIC_API_KEY
The Anthropic API key (default: None)
--dashscope_api_key DASHSCOPE_API_KEY
The Dashscope API key (default: None)
--qianfan_access_key QIANFAN_ACCESS_KEY
The Qianfan access key (default: None)
--qianfan_secret_key QIANFAN_SECRET_KEY
The Qianfan secret key (default: None)
设置其余的生成参数和模型量化等超参数:
--max_tokens MAX_TOKENS
The maximum number of tokens for output generation
(default: None)
--max_length MAX_LENGTH
The maximum number of tokens of model input sequence
(default: None)
--temperature TEMPERATURE
The temperature for models (default: None)
--top_p TOP_P The model considers the results of the tokens with
top_p probability mass. (default: None)
--top_k TOP_K The model considers the token with top_k probability.
(default: None)
--frequency_penalty FREQUENCY_PENALTY
Positive values penalize new tokens based on their
existing frequency in the generated text, vice versa.
(default: None)
--repetition_penalty REPETITION_PENALTY
Values>1 penalize new tokens based on their existing
frequency in the prompt and generated text, vice
versa. (default: None)
--presence_penalty PRESENCE_PENALTY
Positive values penalize new tokens based on whether
they appear in the generated text, vice versa.
(default: None)
--stop STOP [STOP ...]
List of strings that stop the generation when they are
generated. E.g. --stop 'stop' 'sequence' (default:
None)
--no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE
All ngrams of that size can only occur once. (default:
None)
--best_of BEST_OF, --num_beams BEST_OF
The beam size for beam search (default: None)
--length_penalty LENGTH_PENALTY
Positive values encourage longer sequences, vice
versa. Used in beam search. (default: None)
--early_stopping [EARLY_STOPPING]
Positive values encourage longer sequences, vice
versa. Used in beam search. (default: None)
--system_prompt SYSTEM_PROMPT, -sys SYSTEM_PROMPT
The system prompt for chat-based models
--chat_template CHAT_TEMPLATE
The chat template for local chat-based models. Support model default chate template (choose from 'base', 'llama2', 'chatml', 'zephyr', 'phi3', 'llama3', ...) or standard HuggingFace tokenizers chat template
--bnb_config BNB_CONFIG
JSON string for BitsAndBytesConfig parameters.
--load_in_8bit [LOAD_IN_8BIT]
Whether to use bnb's 8-bit quantization to load the
model. (default: False)
--load_in_4bit [LOAD_IN_4BIT]
Whether to use bnb's 4-bit quantization to load the
model. (default: False)
--gptq [GPTQ] Whether the model is a gptq quantized model. (default:
False)
--vllm_gpu_memory_utilization VLLM_GPU_MEMORY_UTILIZATION
The maximum gpu memory utilization of vllm. (default:
None)
--torch_dtype {float16,bfloat16,float32}
The torch dtype for model input and output
与评测数据集相关的参数:
--dataset_names DATASET [DATASET ...], -d DATASET [DATASET ...], --dataset DATASET [DATASET ...]
Space splitted dataset names. If only one dataset is specified, it can be followed by
subset names or category names. Format: 'dataset1 dataset2', 'dataset:subset1,subset2', or
'dataset:[cat1],[cat2]', e.g., 'copa race', 'race:high', 'wmt16:en-ro,en-fr', or
'mmlu:[stem],[humanities]'. (default: None)
--batch_size BATCH_SIZE, -bsz BATCH_SIZE, -b BATCH_SIZE
The evaluation batch size. Specify an integer (e.g., '10') to use a fixed batch size for
all iterations. Alternatively, append ':auto' (e.g., '10:auto') to start with the specified
batch size and automatically adjust it in subsequent iterations to maintain constant CUDA
memory usage (default: 1)
--dataset_path DATASET_PATH
The path of dataset if loading from local. Supports
repository cloned from huggingface, dataset saved by
`save_to_disk`, or a template string e.g.
'mmlu/{split}/{subset}_{split}.csv'. (default: None)
--evaluation_set EVALUATION_SET
The set name for evaluation, supporting slice, e.g.,
validation, test, validation[:10] (default: None)
--example_set EXAMPLE_SET
The set name for demonstration, supporting slice,
e.g., train, dev, train[:10] (default: None)
--instruction INSTRUCTION
The format to format the instruction for each instance. Either f-string or jinja2 format is supported. E.g., 'Answer the following question: {question}\nAnswer:'"
--num_shots NUM_SHOTS, -shots NUM_SHOTS
The few-shot number for demonstration (default: 0)
--max_example_tokens MAX_EXAMPLE_TOKENS
The maximum token number of demonstration (default:
1024)
模型训练结合swanlab看板
在训练中,有太多需要跟踪的超参数和设置,因此我们可以利用swanlab的服务,这里简单介绍一下如何使用swanlab。swanlab是wandb的平替,但wandb由国外团队开发,因此可能存在被墙以及数据无法上传的问题。
首先,在虚拟环境中下载swanlab并且登录,注册并使用swanlab apikey:
pip install swanlab
swanlab login
可以使用swanlab login –relogin重新登录。 关于需要记录的内容,可以分为下列几个部分:
- swanlab.init(): project: 指定日志记录在swanlab中的哪个项目下,experiment_name: 实验名称,description: 关于实验的描述,config: 有关超参数的字典(后续可以通过swanlab.config.learning_rate=0.1的形式修改,或者直接将config字段设置成包含实验设置的json文件路径【config=”swanlab-init-config.json”】),最便捷的方式是传入argparse,即args=parser.parse_args(); config=args
- swanlab.log(): 以字典的形式记录训练过程中每一步的loss, acc等输出,并自动生成图表。注意这里字典的值必须为int、float或swanlab多媒体类。每一个键在每一次记录时自动累积到该键的历史中。可以通过”val/loss”的方式,将loss记录到val分组内。step参数可以指定该字典记录到哪个step下。多媒体类可以选择图片、音频或文本。以文本为例,可以通过swanlab.Text(“example”, caption=f’{i}’)转换,再通过swanlab.log()的字典值记录。
提升效率的tricks:
- 减少指标种类的数量:通过日志记录的指标种类如果太多(>10k),则可能拖慢渲染的速度。对于每个指标,建议记录点<50k。
accelerate+swanlab
from swanlab.integration.accelerate import SwanLabTracker
from accelerate import Accelerator
...
# 创建SwanLab日志记录器
tracker = SwanLabTracker("YOUR_SMART_PROJECT_NAME")
# 传入Accelerator
accelerator = Accelerator(log_with=tracker)
# 初始化所有日志记录器
accelerator.init_trackers("YOUR_SMART_PROJECT_NAME", config=config)
# 在训练过程中记录
accelerator.log({"training_loss": loss, "epoch_num": ep})
trainer+swanlab
from swanlab.integration.huggingface import SwanLabCallback
from transformers import Trainer, TrainingArguments
...
# 实例化SwanLabCallback,只需指定project名称
swanlab_callback = SwanLabCallback(project="hf-visualization")
trainer = Trainer(
...
# 传入callbacks参数
callbacks=[swanlab_callback],
)
trainer.train()
具体案例:
import evaluate
import numpy as np
import swanlab
from swanlab.integration.huggingface import SwanLabCallback
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenized_datasets = dataset.map(tokenize_function, batched=True)
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
metric = evaluate.load("accuracy")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
training_args = TrainingArguments(
output_dir="test_trainer",
# 如果只需要用SwanLab跟踪实验,则将report_to参数设置为”none“
report_to="none",
num_train_epochs=3,
logging_steps=50,
)
# 实例化SwanLabCallback
swanlab_callback = SwanLabCallback(experiment_name="TransformersTest")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
# 传入callbacks参数
callbacks=[swanlab_callback],
)
trainer.train()
可以通过继承callback自定义生命周期回调函数,例如在每个epoch结束之后,推理测试集并计算指标:
class NLPSwanLabCallback(SwanLabCallback):
def on_epoch_end(self, args, state, control, **kwargs):
# 测试集
test_text_list = ["example1", "example2"]
log_text_list = []
for text in test_text_list:
result = model(text)
log_text_list.append(swanlab.Text(result))
# 在该epoch结束时的step,记录在测试集上的推理结果
swanlab.log({"Prediction": test_text_list}, step=state.global_step)