柚子快報(bào)邀請(qǐng)碼778899分享:對(duì)話問答 llama
柚子快報(bào)邀請(qǐng)碼778899分享:對(duì)話問答 llama
LLAMA
論文
https://arxiv.org/pdf/2302.13971.pdf
模型結(jié)構(gòu)
LLAMA網(wǎng)絡(luò)基于 Transformer 架構(gòu)。提出了各種改進(jìn),并用于不同的模型,例如 PaLM。以下是與原始架構(gòu)的主要區(qū)別: 預(yù)歸一化。為了提高訓(xùn)練穩(wěn)定性,對(duì)每個(gè)transformer 子層的輸入進(jìn)行歸一化,而不是對(duì)輸出進(jìn)行歸一化。使用 RMSNorm 歸一化函數(shù)。 SwiGLU 激活函數(shù) [PaLM]。使用 SwiGLU 激活函數(shù)替換 ReLU 非線性以提高性能。使用 2 /3 4d 的維度而不是 PaLM 中的 4d。 旋轉(zhuǎn)嵌入。移除了絕對(duì)位置嵌入,而是添加了旋轉(zhuǎn)位置嵌入 (RoPE),在網(wǎng)絡(luò)的每一層。
算法原理
LLama是一個(gè)基礎(chǔ)語言模型的集合,參數(shù)范圍從7B到65B。在數(shù)萬億的tokens上訓(xùn)練出的模型,并表明可以專門使用公開可用的數(shù)據(jù)集來訓(xùn)練最先進(jìn)的模型,而不依賴于專有的和不可訪問的數(shù)據(jù)集。
環(huán)境配置
提供光源拉取推理的docker鏡像:
docker pull docker pull image.sourcefind.cn:5000/dcu/admin/base/custom:fastertransformer-dtk23.04-latest
#
#
#
docker run -it --name llama --shm-size=32G --device=/dev/kfd --device=/dev/dri/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --ulimit memlock=-1:-1 --ipc=host --network host --group-add video -v
鏡像版本依賴:
DTK驅(qū)動(dòng):dtk23.04Pytorch: 1.10python: python3.8
激活鏡像環(huán)境: source /opt/dtk-23.04/env.sh
數(shù)據(jù)集
無
推理
編譯
mkdir build
cd build
cmake -DSM=70 -DCMAKE_BUILD_TYPE=Release -DBUILD_MULTI_GPU=ON -DCMAKE_CXX_COMPILER=nvcc ..
make -j12
模型下載
llama 7B
llama 13B
llama 30B
llama 65B
模型轉(zhuǎn)換
python ../examples/cpp/llama/huggingface_llama_convert.py \
-saved_dir=/data/models/llama-7b-infer/ \
-in_file=/data/models/llama-7b-hf/ \
-infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b
例如llama-7b的轉(zhuǎn)換:-in_file為模型輸入路徑,-saved_dir為模型輸出路徑,-infer_gpu_num為推理的tp大小,-weight_data_type為推理的數(shù)據(jù)類型,-model_name為模型名稱.若使用其他模型,對(duì)應(yīng)修改路徑和-model_name.
運(yùn)行 LLama-7b
生成gemm_config.in文件
data_type = 0 (FP32) or 1 (FP16)
./bin/gpt_gemm 1 1 20 32 128 11008 32000 1 1
上述參數(shù)對(duì)應(yīng)為
./bin/gpt_gemm
配置../examples/cpp/llama/llama_config.ini
data_type = 1時(shí),data_type = fp16;data_type = 0時(shí),data_type = fp32,tensor_para_size和模型轉(zhuǎn)換設(shè)置的tp數(shù)保持一致,model_name=llama_7B,model_dir為對(duì)應(yīng)的模型權(quán)重,request_batch_size為推理的batch_size數(shù)量,request_output_len為輸出長度,../examples/cpp/llama/start_ids.csv可以修改輸入的起始id.
運(yùn)行
./bin/llama_example
該程序會(huì)讀取../examples/cpp/llama//start_ids.csv中的id作為輸入tokens,生成的結(jié)果會(huì)保存在.out.
運(yùn)行 LLama-13b
./bin/gpt_gemm 1 1 20 40 128 13824 32000 1 1
./bin/llama_example
運(yùn)行 LLama-33b
./bin/gpt_gemm 1 1 20 52 128 17920 32000 1 2
mpirun --allow-run-as-root -np 2 ./bin/llama_example
運(yùn)行 LLama-65b
./bin/gpt_gemm 1 1 20 64 128 22016 32000 1 8
mpirun --allow-run-as-root -np 8 ./bin/llama_example
參數(shù)配置說明
llama-33b模型,使用fp16推理需要2張卡(32G),llama-65b模型,使用fp16推理需要8張卡(32G). 從huggingface下載llama模型,可以查看config.json文件,如下左邊為fastertransformer參數(shù),后邊對(duì)應(yīng)config.son文件中的參數(shù)值.
head_num=num_attention_heads
size_per_head=hidden_size / num_attention_heads
inter_size=intermediate_size
num_layer=num_hidden_layers
rotary_embedding=size_per_head
layernorm_eps=rms_norm_eps
vocab_size=vocab_size
result
build/
out
執(zhí)行一下命令可以解析out結(jié)果:
pip install sentencepiece
python ../examples/cpp/llama/llama_tokenizer.py
其中,`tokenizer`為原模型路徑
測(cè)試數(shù)據(jù):"I believe the meaning of life is" (token id: 306, 4658, 278, 6593, 310, 2834, 338),使用的加速卡:1張 DCU-Z100L-32G
數(shù)據(jù)類型batch sizetemperateinput lenoutput lenfp16107256
結(jié)果如下:
306 4658 278 6593 310 2834 338 304 5735 372 304 278 2989 342 29889 306 4658 393 591 526 599 1244 363 263 2769 322 393 591 526 599 1244 304 1371 1269 916 29889 306 4658 393 591 526 599 1244 304 5110 322 6548 322 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244 304 1371 1269 916 5110 322 6548 29889 306 4658 393 591 526 599 1244
輸出內(nèi)容如下:
I believe the meaning of life is to live it to the fullest. I believe that we are all here for a reason and that we are all here to help each other. I believe that we are all here to learn and grow and that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here to help each other learn and grow. I believe that we are all here
精度
無
應(yīng)用場(chǎng)景
算法類別
對(duì)話問答
熱點(diǎn)應(yīng)用行業(yè)
金融,科研,教育
源碼倉庫及問題反饋
ModelZoo / LLama_fastertransformer · GitLab
參考資料
GitHub - NVIDIA/FasterTransformer: Transformer related optimization, including BERT, GPT
柚子快報(bào)邀請(qǐng)碼778899分享:對(duì)話問答 llama
精彩文章
本文內(nèi)容根據(jù)網(wǎng)絡(luò)資料整理,出于傳遞更多信息之目的,不代表金鑰匙跨境贊同其觀點(diǎn)和立場(chǎng)。
轉(zhuǎn)載請(qǐng)注明,如有侵權(quán),聯(lián)系刪除。