LLMBox

7 minute read

Published:

LLMBox学习笔记

本文主要记录LLMBox框架的阅读笔记。主要内容参考LLMBox文档LLMBox文档2deepspeed博客官方文档1官方文档2

为什么要学习大模型训练框架?

大模型的训练非常耗时,因此加速训练并且保存模型训练的中间状态非常关键,如果训练失败则可以从某个检查点开始继续训练,并且可以评估模型在训练的不同阶段的性能。这样可以更灵活地调整超参数,以及对下游任务的微调。在推理阶段,大模型也可以通过KVCache等方式实现加速。这些复杂的操作引发了对统一的大模型训练和推理框架的需求。LLMBox统一了训练、评估和推理等生成式大模型框架,因此掌握LLMBox就能够掌握运用大模型开展科学研究。对于BERT等编码模型,可以用huggingface-trainer框架进行训练。

数据集和模型加载

在下载huggingface数据集的时候,目标地址可能被墙,下载不了,例如:

https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl

此时可以将前缀https://huggingface.co替换成https://hf-mirror.com,用国内的镜像站下载。 可以用huggingface-cli方法下载huggingface模型和数据集到指定的目录下:

# 环境变量
export HF_ENDPOINT=https://hf-mirror.com
# model
huggingface-cli download --resume-download gpt2 --local-dir gpt2
# dataset
huggingface-cli download --repo-type dataset --resume-download wikitext --local-dir wikitext
# 登录以下载项目
huggingface-cli download --token hf_*** --resume-download meta-llama/Llama-2-7b-hf --local-dir Llama-2-7b-hf

LLMBox: Download

在服务器上安装LLMBox所需的依赖有点儿麻烦。以python=3.11, module load cuda/12.1.1为例:

$ pip install -r requirements.txt
...
ModuleNotFoundError: No module named 'packaging'

遵循以下步骤解决:

  1. 先不安装vllm和flash_attention,把requirements.txt其他的包安装好
  2. pip install –upgrade pip
  3. pip install packaging whell setuptools
  4. pip install vllm flash-attn –no-build-isolation

如果没有预先module load cuda/12.1.1会出大问题,建议把这句写到bashrc文件中。

LLMBox: Training

为了训练一个大模型,llmbox提供了sft, pt, ppo和dpo等多种训练方式,并且提供了LoRA和QLoRA等低资源PEFT方法,同时提供了Flash Attention和Deepspeed等加速训练方法。

Trainer

LLMBox提供了sft, pt, ppo, dpo的数据集。自定义数据集可以参考他们的数据集的格式,但最好是jsonl的格式。

一个常见的用deepspeed+trainer来训练的script;如果需要用lora或qlora,只需要指定–lora True或–qlora True。

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export WANDB_MODE=disabled
OUTPUT_DIR=./output/alpaca-7b
torchrun --nproc_per_node=8 train.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \
    --data_path data/ \
    --dataset alpaca_data_1k.json \	# .txt为pretrain文件;可以指定多个文件,文件名需要包含alpaca等数据集名称,然后在training/dataset/sft_dataset中补充__init__.py中的注册字典、以及该目录下的数据集py文件处理数据集
    --output_dir $OUTPUT_DIR \
    --num_train_epochs 2 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 2 \
    --save_strategy "epoch" \
    --save_steps 2 \
    --save_total_limit 2 \
    --learning_rate 1e-5 \
    --lr_scheduler_type "constant" \
    --logging_steps 1 \
    --deepspeed configs/ds_z3_bf16.json \

deepspeed配置文件如下。train_batch_size为总batch_size,它由单张卡的批次大小、梯度累积次数和GPU的数量相乘得到。

zero优化的效率是一个trade-off。zero-0可被视为DDP,而此时是速度最快但显存占用最多;zero-3速度最慢但显存占用最少。在a800的条件下可以使用zero-1。

{
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

LLMBox: Utilization

为了实现大模型推理,llmbox提供了KVCache加速算法,并且提供了ICL、CoT等增强方法,以及vLLM, Flash Attention等加速推理方法、BnB/GPTQ等量化方法。

Dataset

官方文档给了一些常用数据集,但如果需要在自己的数据集上测试,则需要继承GenerationDataset或MultipleChoiceDataset类。以GenerationDataset为例,我们可以自定义生成的超参(温度、停止符等)、指标(BLEU, EM, F1, GPTEval, Pass@K等)。通用的单轮对话Dataset模版如下:

from .utils import load_raw_dataset_from_file, get_raw_dataset_loader
INSTRUCTION = """Answer the following question.

{question}
Answer:"""
NO_INSTRUCTION = "{question}"

class MyDataset(GenerationDataset):

    instruction = None	# 在下面定义(注意,这里的instruction是类属性,在初始化的时候再更新
    metrics = [Em(), F1()]
    evaluation_set = ...   # 加载的数据集切片
    load_args = ...	# hf load_data的参数,也可以不定义,然后定义下面的load_raw_dataset加载本地数据集
    ...

    def init_arguments(self):
        if self.model_type == "base":
            self.instruction = NO_INSTRUCTION
        else:
            self.instruction = INSTRUCTION
        self.extra_model_args['max_tokens'] = 4096	# generation参数

    def format_instance(self, instance):
        # 用数据集中instance来填充instruction中的key,在这个例子中,需要填充question;除此之外需要返回这个样本的期望输出target。
        return dict(
            question=instance["question"],
            target=instance["answer"],
        )
    @cached_property
    def references(self):
        references = []
        for instance in self.evaluation_data:
            references.append([{
                "answer": instance["answer"],
                "task": GAOKAO_TASKS[self.subset_name],
                "score": instance["score"]
            }])
        return references
    def load_raw_dataset(self, dataset_path, subset_name, evaluation_set, example_set):
        self.evaluation_data = get_raw_dataset_loader(...)("test")
        self.example_data = load_raw_dataset_from_file("examples.json")	# 不一定需要example_data

Usage

python inference.py

可用参数:

--model_name_or_path MODEL_NAME_OR_PATH, --model MODEL_NAME_OR_PATH, -m MODEL_NAME_OR_PATH
                      The model name or path, e.g., davinci-002, meta-
                      llama/Llama-2-7b-hf, ./mymodel (default: None)
--model_type {base,instruction}
                      The type of the model, which can be chosen from `base`
                      or `instruction`. (default: base)
--model_backend {anthropic,dashscope,huggingface,openai,qianfan,vllm}
                      The model backend
--device_map DEVICE_MAP
                      The device map for model and data (default: auto)
--vllm [VLLM]         Whether to use vllm (default: False)
--flash_attention [FLASH_ATTENTION]
                      Whether to use flash attention (default: True)
--no_flash_attention  Whether to use flash attention (default: False)
--tokenizer_name_or_path TOKENIZER_NAME_OR_PATH, --tokenizer TOKENIZER_NAME_OR_PATH
                      The tokenizer name or path, e.g., cl100k_base, meta-llama/Llama-2-7b-hf, ./mymodel

为了评测api,可以传入secret key:

--openai_api_key OPENAI_API_KEY
                      The OpenAI API key (default: None)
--anthropic_api_key ANTHROPIC_API_KEY
                      The Anthropic API key (default: None)
--dashscope_api_key DASHSCOPE_API_KEY
                      The Dashscope API key (default: None)
--qianfan_access_key QIANFAN_ACCESS_KEY
                      The Qianfan access key (default: None)
--qianfan_secret_key QIANFAN_SECRET_KEY
                      The Qianfan secret key (default: None)

设置其余的生成参数和模型量化等超参数:

--max_tokens MAX_TOKENS
                      The maximum number of tokens for output generation
                      (default: None)
--max_length MAX_LENGTH
                      The maximum number of tokens of model input sequence
                      (default: None)
--temperature TEMPERATURE
                      The temperature for models (default: None)
--top_p TOP_P         The model considers the results of the tokens with
                      top_p probability mass. (default: None)
--top_k TOP_K         The model considers the token with top_k probability.
                      (default: None)
--frequency_penalty FREQUENCY_PENALTY
                      Positive values penalize new tokens based on their
                      existing frequency in the generated text, vice versa.
                      (default: None)
--repetition_penalty REPETITION_PENALTY
                      Values>1 penalize new tokens based on their existing
                      frequency in the prompt and generated text, vice
                      versa. (default: None)
--presence_penalty PRESENCE_PENALTY
                      Positive values penalize new tokens based on whether
                      they appear in the generated text, vice versa.
                      (default: None)
--stop STOP [STOP ...]
                      List of strings that stop the generation when they are
                      generated. E.g. --stop 'stop' 'sequence' (default:
                      None)
--no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE
                      All ngrams of that size can only occur once. (default:
                      None)
--best_of BEST_OF, --num_beams BEST_OF
                      The beam size for beam search (default: None)
--length_penalty LENGTH_PENALTY
                      Positive values encourage longer sequences, vice
                      versa. Used in beam search. (default: None)
--early_stopping [EARLY_STOPPING]
                      Positive values encourage longer sequences, vice
                      versa. Used in beam search. (default: None)
--system_prompt SYSTEM_PROMPT, -sys SYSTEM_PROMPT
                      The system prompt for chat-based models
--chat_template CHAT_TEMPLATE
                      The chat template for local chat-based models. Support model default chate template (choose from 'base', 'llama2', 'chatml', 'zephyr', 'phi3', 'llama3', ...) or standard HuggingFace tokenizers chat template
--bnb_config BNB_CONFIG
                      JSON string for BitsAndBytesConfig parameters.
--load_in_8bit [LOAD_IN_8BIT]
                      Whether to use bnb's 8-bit quantization to load the
                      model. (default: False)
--load_in_4bit [LOAD_IN_4BIT]
                      Whether to use bnb's 4-bit quantization to load the
                      model. (default: False)
--gptq [GPTQ]         Whether the model is a gptq quantized model. (default:
                      False)
--vllm_gpu_memory_utilization VLLM_GPU_MEMORY_UTILIZATION
                      The maximum gpu memory utilization of vllm. (default:
                      None)
--torch_dtype {float16,bfloat16,float32}
                      The torch dtype for model input and output

与评测数据集相关的参数:

--dataset_names DATASET [DATASET ...], -d DATASET [DATASET ...], --dataset DATASET [DATASET ...]
                      Space splitted dataset names. If only one dataset is specified, it can be followed by
                      subset names or category names. Format: 'dataset1 dataset2', 'dataset:subset1,subset2', or
                      'dataset:[cat1],[cat2]', e.g., 'copa race', 'race:high', 'wmt16:en-ro,en-fr', or
                      'mmlu:[stem],[humanities]'. (default: None)
--batch_size BATCH_SIZE, -bsz BATCH_SIZE, -b BATCH_SIZE
                      The evaluation batch size. Specify an integer (e.g., '10') to use a fixed batch size for
                      all iterations. Alternatively, append ':auto' (e.g., '10:auto') to start with the specified
                      batch size and automatically adjust it in subsequent iterations to maintain constant CUDA
                      memory usage (default: 1)
--dataset_path DATASET_PATH
                      The path of dataset if loading from local. Supports
                      repository cloned from huggingface, dataset saved by
                      `save_to_disk`, or a template string e.g.
                      'mmlu/{split}/{subset}_{split}.csv'. (default: None)
--evaluation_set EVALUATION_SET
                      The set name for evaluation, supporting slice, e.g.,
                      validation, test, validation[:10] (default: None)
--example_set EXAMPLE_SET
                      The set name for demonstration, supporting slice,
                      e.g., train, dev, train[:10] (default: None)
--instruction INSTRUCTION
                      The format to format the instruction for each instance. Either f-string or jinja2 format is supported. E.g., 'Answer the following question: {question}\nAnswer:'"
--num_shots NUM_SHOTS, -shots NUM_SHOTS
                      The few-shot number for demonstration (default: 0)
--max_example_tokens MAX_EXAMPLE_TOKENS
                      The maximum token number of demonstration (default:
                      1024)

模型训练结合swanlab看板

在训练中,有太多需要跟踪的超参数和设置,因此我们可以利用swanlab的服务,这里简单介绍一下如何使用swanlab。swanlab是wandb的平替,但wandb由国外团队开发,因此可能存在被墙以及数据无法上传的问题。

首先,在虚拟环境中下载swanlab并且登录,注册并使用swanlab apikey

pip install swanlab
swanlab login

可以使用swanlab login –relogin重新登录。 关于需要记录的内容,可以分为下列几个部分:

  1. swanlab.init(): project: 指定日志记录在swanlab中的哪个项目下,experiment_name: 实验名称,description: 关于实验的描述,config: 有关超参数的字典(后续可以通过swanlab.config.learning_rate=0.1的形式修改,或者直接将config字段设置成包含实验设置的json文件路径【config=”swanlab-init-config.json”】),最便捷的方式是传入argparse,即args=parser.parse_args(); config=args
  2. swanlab.log(): 以字典的形式记录训练过程中每一步的loss, acc等输出,并自动生成图表。注意这里字典的值必须为int、float或swanlab多媒体类。每一个键在每一次记录时自动累积到该键的历史中。可以通过”val/loss”的方式,将loss记录到val分组内。step参数可以指定该字典记录到哪个step下。多媒体类可以选择图片、音频或文本。以文本为例,可以通过swanlab.Text(“example”, caption=f’{i}’)转换,再通过swanlab.log()的字典值记录。

提升效率的tricks:

  1. 减少指标种类的数量:通过日志记录的指标种类如果太多(>10k),则可能拖慢渲染的速度。对于每个指标,建议记录点<50k。

accelerate+swanlab

from swanlab.integration.accelerate import SwanLabTracker
from accelerate import Accelerator

...

# 创建SwanLab日志记录器
tracker = SwanLabTracker("YOUR_SMART_PROJECT_NAME")
# 传入Accelerator
accelerator = Accelerator(log_with=tracker)

# 初始化所有日志记录器
accelerator.init_trackers("YOUR_SMART_PROJECT_NAME", config=config)

# 在训练过程中记录
accelerator.log({"training_loss": loss, "epoch_num": ep})

trainer+swanlab

from swanlab.integration.huggingface import SwanLabCallback
from transformers import Trainer, TrainingArguments

...

# 实例化SwanLabCallback,只需指定project名称
swanlab_callback = SwanLabCallback(project="hf-visualization")

trainer = Trainer(
    ...
    # 传入callbacks参数
    callbacks=[swanlab_callback],
)

trainer.train()

具体案例:

import evaluate
import numpy as np
import swanlab
from swanlab.integration.huggingface import SwanLabCallback
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


dataset = load_dataset("yelp_review_full")

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

metric = evaluate.load("accuracy")

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

training_args = TrainingArguments(
    output_dir="test_trainer",
    # 如果只需要用SwanLab跟踪实验,则将report_to参数设置为”none“
    report_to="none",
    num_train_epochs=3,
    logging_steps=50,
)

# 实例化SwanLabCallback
swanlab_callback = SwanLabCallback(experiment_name="TransformersTest")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
    # 传入callbacks参数
    callbacks=[swanlab_callback],
)

trainer.train()

可以通过继承callback自定义生命周期回调函数,例如在每个epoch结束之后,推理测试集并计算指标:

class NLPSwanLabCallback(SwanLabCallback):    
    def on_epoch_end(self, args, state, control, **kwargs):
	# 测试集
        test_text_list = ["example1", "example2"]
        log_text_list = []
        for text in test_text_list:
            result = model(text)
            log_text_list.append(swanlab.Text(result))
        # 在该epoch结束时的step,记录在测试集上的推理结果
        swanlab.log({"Prediction": test_text_list}, step=state.global_step)