微调预训练模型

备注

阅读本篇前,请确保已按照 安装指南 准备好昇腾环境及transformers!

大模型微调本质是利用特定领域的数据集对已预训练的大模型进行进一步训练的过程。它旨在优化模型在特定任务上的性能,使模型能够更好地适应和完成特定领域的任务。 本文在使用transformers库选定相关数据集和预训练模型的基础上,通过超参数调优完成对模型的微调。

前置准备

安装必要库

1pip install transformers datasets evaluate accelerate scikit-learn

加载数据集

模型训练需要使用数据集,这里使用 Yelp Reviews dataset

1from datasets import load_dataset
2
3# load_dataset 会自动下载数据集并将其保存到本地路径中
4dataset = load_dataset("yelp_review_full")
5#输出数据集的第100条数据
6dataset["train"][100]

输出如下:

{'label': 0, 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\n
The cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the
person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after
me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have
their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe
manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that
I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect
bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone
in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}

预处理数据集

预处理数据集需要使用AutoTokenizer,它用来自动获取与模型匹配的分词器,分词器根据规则将文本拆分为标记,并转换为张量作为模型输入, 下面用到了Meta-Llama-3-8B-Instruct模型,下载模型请转至 模型获取,以下是一个示例:

1from transformers import AutoTokenizer
2
3tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
4#使用分词器处理文本
5encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
6print(encoded_input)

输出如下:

{'input_ids': [128000, 5519, 539, 1812, 91485, 304, 279, 22747, 315, 89263, 11, 369, 814, 527, 27545, 323, 4062, 311, 19788, 13],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

接着使用dataset.map方法对数据集进行预处理:

1def tokenize_function(examples):
2    return tokenizer(examples["text"], padding="max_length", truncation=True)
3
4tokenized_datasets = dataset.map(tokenize_function, batched=True)

初次进行预处理需要一定时间,内容如下:

1Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
2Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
3Map: 100%|████████████████████████████████████████████████████████████████████████| 650000/650000 [03:27<00:00, 3139.47 examples/s]
4Map: 100%|██████████████████████████████████████████████████████████████████████████| 50000/50000 [00:15<00:00, 3156.92 examples/s]

训练全部的数据集会耗费更长的时间,通常将其划分为较小的训练集和验证集,以提高训练速度:

1small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
2small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
3
4# 下面是加载全训练集和验证集
5# full_train_dataset = tokenized_datasets["train"]
6# full_eval_dataset = tokenized_datasets["test"]

训练

加载模型

使用AutoModelForCausalLM将自动加载模型:

1from transformers import AutoModelForCausalLM
2
3model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

超参数调优

超参数调优用于激活不同训练选项的标志,它定义了关于模型的更高层次的概念,例如模型复杂程度或学习能力,下边使用TrainingArguments类来加载:

1from transformers import TrainingArguments
2
3training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch")

模型评估

模型评估用于衡量模型在给定数据集上的表现,包括准确率,完全匹配速率,平均并交集点等,下面是使用方式:

 1import
 2import sklearn
 3import evaluate
 4
 5metric = evaluate.load("accuracy")
 6
 7#计算预测的准确性,并将预测传递给compute
 8def compute_metrics(eval_pred):
 9    logits, labels = eval_pred
10    predictions = np.argmax(logits, axis=-1)
11    return metric.compute(predictions=predictions, references=labels)

Trainer

使用已加载的模型、训练参数、训练和测试数据集以及评估函数创建一个Trainer对象,并调用trainer.train()来微调模型:

 1from transformers import Trainer
 2
 3trainer = Trainer(
 4    model=model,
 5    args=training_args,
 6    train_dataset=small_train_dataset,
 7    eval_dataset=small_eval_dataset,
 8    compute_metrics=compute_metrics,
 9)
10
11trainer.train()

预训练全流程

 1import torch
 2import torch_npu
 3import numpy as np
 4import sklearn
 5import evaluate
 6from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
 7from datasets import load_dataset
 8
 9model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
10device = "npu:0" if torch.npu.is_available() else "cpu"
11
12# 加载分词器和模型
13tokenizer = AutoTokenizer.from_pretrained(model_id)
14model = AutoModelForCausalLM.from_pretrained(
15    model_id,
16    torch_dtype=torch.bfloat16,
17    device_map="auto",
18).to(device)
19
20dataset = load_dataset("yelp_review_full")
21
22#分词函数
23def tokenize_function(examples):
24    return tokenizer(examples["text"], padding="max_length", truncation=True)
25
26tokenized_datasets = dataset.map(tokenize_function, batched=True)
27
28small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
29small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
30
31# 加载评估指标
32metric = evaluate.load("accuracy")
33
34# 定义评估指标的计算函数
35def compute_metrics(eval_pred):
36    logits, labels = eval_pred
37    predictions = np.argmax(logits, axis=-1)
38    return metric.compute(predictions=predictions, references=labels)
39
40training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch")
41
42trainer = Trainer(
43    model=model,
44    args=training_args,
45    train_dataset=small_train_dataset,
46    eval_dataset=small_eval_dataset,
47    compute_metrics=compute_metrics,
48)
49
50trainer.train()

训练完成后得到以下结果:

 1|█████████████████████████████████| [375/375 06:21, Epoch 3/3]
 2
 3=====  =============  ===============  ======
 4Epoch  Training Loss  Validation Loss  Accuracy
 5=====  =============  ===============  ======
 61       No log          1.155628    0.499000
 72       No log          0.994618    0.574000
 83       No log          1.026123    0.590000
 9=====  =============  ===============  ======
10
11TrainOutput(global_step=375, training_loss=1.0557311197916666, metrics={'train_runtime': 384.55, 'train_samples_per_second': 7.801,
12'train_steps_per_second': 0.975, 'total_flos': 789354427392000.0, 'train_loss': 1.0557311197916666, 'epoch': 3.0})