柚子快報(bào)邀請(qǐng)碼778899分享：對(duì)話問答 llama

Tokopedia印尼購綜合2025-05-05630

http://yzkb.51969.com/

LLAMA

論文

https://arxiv.org/pdf/2302.13971.pdf

模型結(jié)構(gòu)

LLAMA網(wǎng)絡(luò)基于 Transformer 架構(gòu)。提出了各種改進(jìn)，并用于不同的模型，例如 PaLM。以下是與原始架構(gòu)的主要區(qū)別：預(yù)歸一化。為了提高訓(xùn)練穩(wěn)定性，對(duì)每個(gè)transformer 子層的輸入進(jìn)行歸一化，而不是對(duì)輸出進(jìn)行歸一化。使用 RMSNorm 歸一化函數(shù)。 SwiGLU 激活函數(shù) [PaLM]。使用 SwiGLU 激活函數(shù)替換 ReLU 非線性以提高性能。使用 2 /3 4d 的維度而不是 PaLM 中的 4d。旋轉(zhuǎn)嵌入。移除了絕對(duì)位置嵌入，而是添加了旋轉(zhuǎn)位置嵌入 (RoPE)，在網(wǎng)絡(luò)的每一層。

算法原理

LLama是一個(gè)基礎(chǔ)語言模型的集合,參數(shù)范圍從7B到65B。在數(shù)萬億的tokens上訓(xùn)練出的模型，并表明可以專門使用公開可用的數(shù)據(jù)集來訓(xùn)練最先進(jìn)的模型，而不依賴于專有的和不可訪問的數(shù)據(jù)集。

環(huán)境配置

提供光源拉取推理的docker鏡像：

docker pull docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:fastertransformer-dtk23.04-latest

# 用上面拉取docker鏡像的ID替換

# 主機(jī)端路徑

# 容器映射路徑

docker run -it --name llama --shm-size=32G --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v : /bin/bash

鏡像版本依賴：

DTK驅(qū)動(dòng)：dtk23.04Pytorch: 1.10python: python3.8

激活鏡像環(huán)境： source /opt/dtk-23.04/env.sh

數(shù)據(jù)集

無

推理

編譯

mkdir build

cd build

cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_MULTI_GPU=ON -DCMAKE_CXX_COMPILER=nvcc ..

make -j12

模型下載

llama 7B

llama 13B

llama 30B

llama 65B

模型轉(zhuǎn)換

python ../examples/cpp/llama/huggingface_llama_convert.py \

-saved_dir=/data/models/llama-7b-infer/ \

-in_file=/data/models/llama-7b-hf/ \

-infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b

例如llama-7b的轉(zhuǎn)換：-in_file為模型輸入路徑，-saved_dir為模型輸出路徑，-infer_gpu_num為推理的tp大小，-weight_data_type為推理的數(shù)據(jù)類型，-model_name為模型名稱.若使用其他模型，對(duì)應(yīng)修改路徑和-model_name.

運(yùn)行 LLama-7b

生成gemm_config.in文件

data_type = 0 (FP32) or 1 (FP16)

./bin/gpt_gemm 1 1 20 32 128 11008 32000 1 1

上述參數(shù)對(duì)應(yīng)為

./bin/gpt_gemm

配置../examples/cpp/llama/llama_config.ini

data_type = 1時(shí)，data_type = fp16;data_type = 0時(shí)，data_type = fp32,tensor_para_size和模型轉(zhuǎn)換設(shè)置的tp數(shù)保持一致，model_name=llama_7B，model_dir為對(duì)應(yīng)的模型權(quán)重，request_batch_size為推理的batch_size數(shù)量，request_output_len為輸出長度,../examples/cpp/llama/start_ids.csv可以修改輸入的起始id.

運(yùn)行

./bin/llama_example

該程序會(huì)讀取../examples/cpp/llama//start_ids.csv中的id作為輸入tokens,生成的結(jié)果會(huì)保存在.out.

運(yùn)行 LLama-13b

./bin/gpt_gemm 1 1 20 40 128 13824 32000 1 1

./bin/llama_example

運(yùn)行 LLama-33b

./bin/gpt_gemm 1 1 20 52 128 17920 32000 1 2

mpirun --allow-run-as-root -np 2 ./bin/llama_example

運(yùn)行 LLama-65b

./bin/gpt_gemm 1 1 20 64 128 22016 32000 1 8

mpirun --allow-run-as-root -np 8 ./bin/llama_example

參數(shù)配置說明

llama-33b模型，使用fp16推理需要2張卡（32G）,llama-65b模型，使用fp16推理需要8張卡（32G）. 從huggingface下載llama模型，可以查看config.json文件，如下左邊為fastertransformer參數(shù)，后邊對(duì)應(yīng)config.son文件中的參數(shù)值.

head_num=num_attention_heads

size_per_head=hidden_size / num_attention_heads

inter_size=intermediate_size

num_layer=num_hidden_layers

rotary_embedding=size_per_head

layernorm_eps=rms_norm_eps

vocab_size=vocab_size

result

build/

out

執(zhí)行一下命令可以解析out結(jié)果：

pip install sentencepiece

python ../examples/cpp/llama/llama_tokenizer.py

其中，`tokenizer`為原模型路徑

測(cè)試數(shù)據(jù)："I believe the meaning of life is" (token id: 306, 4658, 278, 6593, 310, 2834, 338)，使用的加速卡:1張 DCU-Z100L-32G

數(shù)據(jù)類型batch sizetemperateinput lenoutput lenfp16107256

結(jié)果如下：

306 4658 278 6593 310 2834 338 304 5735 372 304 278 2989 342 29889 306 4658 393 591 526 599 1244 363 263 2769 322 393 591 526 599 1244 304 1371 1269 916 29889 306 4658 393 591 526 599 1244 304 5110 322 6548 322 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244

輸出內(nèi)容如下：

I believe the meaning of life is to live it to the fullest. I believe that we are all here for a reason and that we are all here to help each other. I believe that we are all here to learn and grow and that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here

精度

無

應(yīng)用場(chǎng)景

算法類別

對(duì)話問答

熱點(diǎn)應(yīng)用行業(yè)

金融,科研,教育

源碼倉庫及問題反饋

ModelZoo / LLama_fastertransformer · GitLab

參考資料

GitHub - NVIDIA/FasterTransformer: Transformer related optimization, including BERT, GPT

柚子快報(bào)邀請(qǐng)碼778899分享：對(duì)話問答 llama