柚子快報激活碼778899分享：Hive語法，函數(shù)--學(xué)習(xí)筆記

Shopee海購坊綜合2025-05-27380

http://yzkb.51969.com/

1，排序處理

1.1cluster by排序

，在Hive中使用order by排序時是全表掃描，且僅使用一個Reduce完成。在海量數(shù)據(jù)待排序查詢處理時，可以采用【先分桶再排序】的策略提升效率。此時，就可以使用cluster by語法。 cluster by語句可以指定根據(jù)某字段將數(shù)據(jù)進行分桶，在桶內(nèi)再根據(jù)這個字段進行正序排序通俗地說，就是根據(jù)一個字段來排序，先分桶再排序。[分桶虛擬，自動處理] cluster by語句的語法：

select * from 表名 cluster by 字段名; # 正序排序

– 程序中動態(tài)設(shè)定reduce值 set mapreduce.job.reduces = 桶數(shù); – 查看reduce值 set mapreduce.job.reduces; 當(dāng)然了，若數(shù)據(jù)量較?。ū热缧∮赥B），Hive處理不占優(yōu)勢。

-- 查看reduce值

set mapreduce.job.reduces; -- 默認值是-1

set mapreduce.job.reduces = -1;

-- order by

select *

from tb_student

order by score; -- 數(shù)據(jù)量小: 效率高, 沒有分桶操作

-- cluster by

select

from tb_student

cluster by score; -- 海量數(shù)據(jù)查詢: 排序效率高

-- 看運行時間

-- 1.先直接測試order by與cluster by操作: 排序效果一樣; 2.設(shè)定桶數(shù),

看運行時間

當(dāng)要先分桶再排序處理時，可以使用hive的cluster by 一般地，cluster by僅對字段做正序排序，即升序。

1.2distribute by+sort by排序

先分組，再排序的使用

select * from 表名 distribute by 字段名 sort by 字段名;

說明：（1）distribute by表示先按字段名執(zhí)行分組；（2）sort by用于在分組內(nèi)負責(zé)對某字段進行排序；（3）當(dāng)且僅當(dāng)distribute by與sort by字段名一致時，等同于cluster by效果。創(chuàng)建分桶表設(shè)定排序字段

create [external] table 表名(

字段名字段類型 [comment '注釋'],

...

)

[clustered by (字段名) sorted by (字段名) into 分桶數(shù) buckets]

[row format delimited

fields terminated by '指定分隔符'];

2.排序操作： ①order by 普通排序 ②over(order by ^) 窗口函數(shù) ③cluster by 先分桶在排序 ④distribute by+ sort by 先分表后排序 ⑤clustered by + sorted by 創(chuàng)建分桶表+自動排序

-- 1

select

from tb_student

distribute by gender

sort by score;

-- 3

create table tb_bucket_student(

id int,

name string,

gender string,

score double

)

clustered by (gender) sorted by (score) into 3 buckets

row format delimited

fields terminated by ",";

show tables ;

-- 4

-- 5

load data inpath "/itheima/student_data.txt" into table

tb_bucket_student;

-- 導(dǎo)入數(shù)據(jù): hdfs

select * from tb_bucket_student;

（1）distribute by+sort by語句配合一起使用時，就是先分后排序的思想觀念；（2）注意：當(dāng)要提升對海量數(shù)據(jù)的訪問效率時，一般可以對表進行分區(qū)或分桶。

2.正則表達式

使用場景：在網(wǎng)站注冊新用戶時，對用戶名、手機號等的驗證就使用了正則表達式。在Hive中，可以使用RLIKE進行正則匹配

select *|字段名1,字段名2,... from 表名 where 字段名 rlike "正則內(nèi)容";

select

from tb_orders

where

userAddress rlike ".*廣東省.*"

and

totalMoney > 5000;

-- 5

select

from tb_orders

where userAddress rlike ".*省 .*市 .*區(qū).*";

正則就是一段特殊的字符串，而正則語法規(guī)范，需要多實踐、多思考，才能更加熟練化。

3，union與CTE語法

3.1union聯(lián)合

連接查詢的特點是多個表進行【橫向】合并在一起！也可以完成縱向合并或追加數(shù)據(jù)操作。 union聯(lián)合可用于將多個SELECT語句的結(jié)果集，組合形成單個完全結(jié)果集。一起看看union聯(lián)合，語法：

select 語句1

union [ all | distinct ]

select 語句2

[ union select 語句 ...];

說明：（1）union all不對數(shù)據(jù)內(nèi)容進行去重，默認使用union all；（2）union distinct可實現(xiàn)數(shù)據(jù)去重，但必須添加distinct關(guān)鍵字；（3）每個select語句返回的列數(shù)量、名稱必須一致，否則，將引發(fā)字段架構(gòu)錯誤。

-- 顯示所有

select * from tb_course1

union all

select * from tb_course2;

select * from tb_course1 union select * from tb_course2; --

默認去重

-- 去掉重復(fù)

select * from tb_course1

union distinct

select * from tb_course2;

-- 先聯(lián)合, 再根據(jù)條件篩選數(shù)據(jù)

select

from

(select * from tb_course1

union all

select * from tb_course2) temp_course

where name in ("周杰輪", "王力鴻");

-- where name="周杰輪" or name="王力鴻";

（1）union可以用于將多個SELECT結(jié)果集合并，但要注意結(jié)果集的字段名、類型等架構(gòu)要一致；當(dāng)使用union語句完成自動去除數(shù)據(jù)重復(fù)值時，記得設(shè)定為union distinct

3.2CTE語法

CTE（Common Table Expressions的縮寫）公用表表達式，表示臨時結(jié)果集。 CTE是一個在查詢中，定義的臨時命名結(jié)果集，并可在from子句中使用它。語法：

with 別名 as

(select查詢語句)

[別名 as (select查詢語句), ...]

select查詢語句;

說明：（1）每個CTE僅被定義一次，可被引用任意次，但是一旦此查詢語句結(jié)束，cte 就失效；（2）注意，CTE表達式僅在單個語句的執(zhí)行范圍內(nèi)定義，并取別名。[from前置]

with stu as (

select * from tb_student

)

select * from stu;

-- 3

-- 先取別名, 引用, 再過濾

with stu as (

select * from tb_student

)

select * from stu where stu.gender="男"; // 好理解

with stu as (

select * from tb_student

)

select * from stu where gender="男";

with語句可以配合union一起使用為了便于掌握union關(guān)鍵字，我們會發(fā)現(xiàn)：當(dāng)union聯(lián)合多表時，可以當(dāng)成是一張完整數(shù)據(jù)表

4. 抽樣、虛擬列

4.1抽樣tablesample

解決的問題：當(dāng)數(shù)據(jù)量特別大時，對全體數(shù)據(jù)進行處理存在困難時，就可以抽取部分數(shù)據(jù)來進行處理，則顯得尤為重要。我們已知曉，在大數(shù)據(jù)體系且是真實的企業(yè)環(huán)境中，很容易出現(xiàn)超大數(shù)據(jù)容量的表，比如體積達到TB/PB級別。

對這種表一個簡單的SELECT * 都會非常的慢，

哪怕LIMIT 10想要看10條數(shù)據(jù)，

我們發(fā)現(xiàn)，有可能也會走MapReduce計算流程。

這種時間等待是漫長且不合適的......

Hive支持抽樣，需要使用tablesample語法：

select * from 表名 tablesample (bucket x out of y [on colname字段名|rand()]);

說明：（1）y表示桶的數(shù)量，比如設(shè)定為值5，則表示5桶；（2）x是要抽樣的桶編號，桶編號從1開始計算，colname字段名表示抽樣的列（也就是按照那個字段分桶）；（3）使用rand()表明在整個行中抽取樣本而不是單個列；（4）翻譯為：按照colname字段名分成y桶，抽取其中的第x桶。

select

from tb_orders

-- tablesample ( bucket 1 out of 6 on userName); -- 數(shù)據(jù)傾斜

tablesample ( bucket 2 out of 6 on userName); -- 數(shù)據(jù)傾斜

-- 3

select

from tb_orders

tablesample ( bucket 4 out of 5 on orderNo);

-- 4

select

from tb_orders

tablesample ( bucket 2 out of 10 on rand());

當(dāng)要快速從海量數(shù)據(jù)表中采樣部分數(shù)據(jù)量，可以使用tablesample()；函數(shù)；（2）使用部分數(shù)據(jù)采樣形式，能提升獲取局部數(shù)據(jù)量的效率，便于在調(diào)試海量數(shù)據(jù)的程序時使用。

4.2虛擬列

虛擬列表示未在表中真正存在的字段，在創(chuàng)建分區(qū)表中，分區(qū)列就是虛擬列的一個體現(xiàn)！為了將Hive中的表進行分區(qū)（partition），這對每日增長的海量數(shù)據(jù)存儲而言，是非常有用的。為了保證HiveQL的高效運行，強烈推薦在where語句后，使用虛擬列（分區(qū)列）作為限定。[拿Web日志舉例說明。]

2，Hive中有3個可用的虛擬列：

（1）INPUT__FILE__NAME

顯示數(shù)據(jù)行所在的具體文件

（2）BLOCK__OFFSET__INSIDE__FILE

顯示數(shù)據(jù)行所在文件的偏移量

（3）ROW__OFFSET__INSIDE__BLOCK # 沒提示, 且默認不開啟-需設(shè)置參數(shù)

[單獨說明]

顯示數(shù)據(jù)所在HDFS塊的偏移量

# 偏移量指的是獲取數(shù)據(jù)時，指針?biāo)谖恢?/p>

對于 ROW__OFFSET__INSIDE__BLOCK 虛擬列，要設(shè)置參數(shù)：

-- 查看數(shù)據(jù)在HDFS塊的偏移量設(shè)置是否開啟

set hive.exec.rowoffset;

-- 設(shè)置開啟

set hive.exec.rowoffset=true;

-- 若要關(guān)閉, 則需要重新設(shè)置為false

set hive.exec.rowoffset=false;

-- 5

use sz41db_bucket;

show tables ;

select

INPUT__FILE__NAME,

BLOCK__OFFSET__INSIDE__FILE

from bucket_id_course;

（1）簡單地說，虛擬列就是Hive內(nèi)置在查詢語句中的幾個特殊標(biāo)記，可直接取用（2）當(dāng)要在查詢結(jié)果中顯示數(shù)據(jù)文件名信息，可以使用 INPUT__FILE__NAME虛擬列。

5，Hive基礎(chǔ)函數(shù)

了解Hive函數(shù)有哪些分類？在Hive中，有一些能直接被調(diào)用使用，比如類似于current_database()調(diào)用方式： Hive的函數(shù)，可分為兩大類：

（1）內(nèi)置函數(shù)（Built-in Functions）

數(shù)學(xué)函數(shù)

日期函數(shù)

字符串函數(shù)

條件函數(shù)

類型轉(zhuǎn)換函數(shù)

數(shù)據(jù)脫敏函數(shù)

（2）用戶定義函數(shù)（User-Defined Functions）

UDF（User Defined Functions）用戶定義功能函數(shù)

UDAF（User Defined Aggregate Functions）用戶定義聚合函數(shù)

UDTF（User Defined Table-generating Functions）用戶定義表生成函數(shù)

內(nèi)置函數(shù)屬于Hive基礎(chǔ)函數(shù)、用戶定義函數(shù)屬于Hive進階函數(shù)。

-- 查看可用的所有函數(shù)

show functions;

-- 查看函數(shù)的使用方式

desc function extended 函數(shù)名;

當(dāng)要查看某函數(shù)如何使用時，可以使用desc function extended 函數(shù)名語句查看幫助信息

在Hive中，當(dāng)要使用函數(shù)時, 語法為[select 函數(shù)名(xx);]。

5.1]數(shù)學(xué)函數(shù)

rand() 獲取一個完全隨機數(shù)，取值范圍0-1。 double round(x [, y]) 取整/設(shè)置小數(shù)精度(四舍五入)。 double

select round(3.141592654,2);

select round(3.141592654);

-- 3

select rand()*100;

select round(rand()*100);

當(dāng)要保留浮點數(shù)后幾位小數(shù)時，推薦使用round()函數(shù) 一般地，數(shù)學(xué)函數(shù)主要是用于處理各類數(shù)值型內(nèi)容項

5.2日期函數(shù)

select current_date();

desc function extended year; -- 有用

select year(`current_date`());

select year(`current_timestamp`());

select year("2023-11-14");

-- desc function extended month;

select month(current_date());

select day(current_date());

desc function extended hour;

select hour(current_timestamp());

select minute(current_timestamp());

select second(current_timestamp());

（2）通常情況下，當(dāng)要處理時間日期時，要想到Hive中常用的日期函數(shù)。

5.3字符串函數(shù)

在Hive中，常用的字符串函數(shù)有：

-- 1

select concat("hello","WORLD");

select concat_ws("=","hello","WORLD");

-- 1-10-100-20

select split("1-10-100-20","-");

select split("1-10-100-20","-")[0];

-- 2

-- Hello Heima

select length("Hello Heima");

select lower("Hello Heima");

select upper("Hello Heima");

-- 3

-- 2022-08-22 17:28:01

-- 通過日期函數(shù)year()

select year("2022-08-22 17:28:01");

-- 截取

select substr("2022-08-22 17:28:01",0,3); // 無法截取到結(jié)束位end

select substr("2022-08-22 17:28:01",0,4);

-- select substring()

-- 分割, 提取

select split("2022-08-22 17:28:01","-")[0];

字符串函數(shù)通常用于處理string、varchar等字符串類型的數(shù)據(jù)結(jié)果。

5.4條件函數(shù)、轉(zhuǎn)換類型

1,類型轉(zhuǎn)換函數(shù)有： cast(expras) 將expr值強制轉(zhuǎn)換為給定類型type。例如，cast(‘1’ as int會將字符串 ‘1’ 轉(zhuǎn)換為整數(shù)。

select current_database();

-- if

select if(1=1,"男","女");

select if(1=0,"男","女"); -- 等號 =; 后期編程語言中, 等號==

-- isnull

select isnull(null);

select isnull("hello"); -- 沒約束, 判斷

-- isnotnull

select isnotnull(null);

select isnotnull("hello");

select nvl(null,18); -- 沒有年齡值, 則默認為18歲

select nvl(20,18);

-- cast

select cast("100" as int);

select cast(12.14 as string); -- double

select cast("hello" as int);

-- 1700096276154

select cast(1700096276154/1000 as int); -- 1700096276 秒[10位數(shù)]-格式

強制類型轉(zhuǎn)換在Hive中不一定成功，若不成功，則會返回null值。

5.5 數(shù)據(jù)脫敏函數(shù)

我們知道，當(dāng)把元數(shù)據(jù)存儲在MySQL中，需要將元數(shù)據(jù)中敏感部分（如身份證、電話號碼等）進行脫敏處理，再供用戶使用通俗地說，就是進行掩碼處理，或者加密處理。

select mask_hash("123ABC");

select mask("123ABC");

select mask("AB12aa"); -- XXnnxx

-- 2

select mask_first_n("AA11nn8989AAAAAAA",4);

select mask_last_n("AA11nn8989AAAAAAA",4);

select mask_show_first_n("it66ABCDE",3);

select mask_show_last_n("it66ABCDE",3);

，要做數(shù)據(jù)脫敏操作，可以根據(jù)mask單詞看DataGrip的快捷提示，并選擇使用某個。

5.6其他函數(shù)

select hash("123456"); -- hash 哈希算法(散列算法) = 哈希碼

select md5("123456"); -- e10adc3949ba59abbe56e057f20f883e

32位/不可逆的動態(tài)值綁定了結(jié)果?

select sha1("123456"); --

7c4a8d09ca3762af61e59520943dc26494f8941b

-- 3

select length("e10adc3949ba59abbe56e057f20f883e");

select length("7c4a8d09ca3762af61e59520943dc26494f8941b");

-- 4 轉(zhuǎn)換日期格式轉(zhuǎn)換為年月日 1700096276154

desc function date_format;

desc function from_unixtime;

-- a.把毫秒轉(zhuǎn)換為秒, int

select cast(1700096276154/1000 as int);

-- b.使用函數(shù)即可

select from_unixtime(cast(1700096276154/1000 as int),"yyyyMM-dd");

select year(from_unixtime(cast(1700096276154/1000 as int),"yyyy-MM-dd"));

對于Hive函數(shù)的使用，若在應(yīng)用中，還發(fā)現(xiàn)有新需求，可以通過查閱Hive函數(shù)資料來解決。

6.Hive高階函數(shù)

用戶自定義函數(shù)有：

用戶定義函數(shù)（User-Defined Functions）

（1）UDF（User Defined Functions）用戶定義功能函數(shù)

（2）UDTF（User Defined Table-generating Functions）用戶定義表生成函數(shù)

（3）UDAF（User Defined Aggregate Functions）用戶定義聚合函數(shù)

說明：（1）最初，UDF、UDAF、UDTF這3個標(biāo)準(zhǔn)，是針對用戶自定義函數(shù)分類的；（2）目前，可以將這個分類標(biāo)準(zhǔn)直接擴大到Hive中的所有函數(shù)，包括內(nèi)置函數(shù)和自定義函數(shù)

（1）UDF（User Defined Functions）用戶定義功能函數(shù) UDF函數(shù)可以理解為：普通函數(shù)。用于一進一出，即當(dāng)輸入一行數(shù)據(jù)時，則輸出一行數(shù)據(jù)。比較常見的有split()分割函數(shù)。

select split("10-20-30-40","-");

-- 結(jié)果: ["10","20","30","40"]

（2）UDTF（User Defined Table-generating Functions）用戶定義表生成函數(shù) UDTF用于表生成函數(shù)。用于一進多出，即當(dāng)輸入一行時，則輸出多行數(shù)據(jù)。比較常見的有：explode()。（3）UDAF（User Defined Aggregate Functions）用戶定義聚合函數(shù) UDAF可表示為：聚合函數(shù)。用于多進一出，即當(dāng)輸入多行時，則輸出一行數(shù)據(jù)。

6.1窗口函數(shù)

select 字段名, … 窗口函數(shù)() over([partition by xx order by xx [asc | desc]]) from 表名; 說明：（1）窗口函數(shù)名可以是聚合函數(shù)，例如sum()、count()、avg()等，也可以是分析函數(shù)；（2）聚合函數(shù)有count()、sum()、avg()、min()、max()；（3）分析函數(shù)有row_number、rank、dense_rank等；（4）partition by用于分組、order by用于排序。當(dāng)要把某數(shù)據(jù)列添加到數(shù)據(jù)表時，可以使用窗口函數(shù)over()關(guān)鍵字

6.2json數(shù)據(jù)處理

JSON的全稱是：JavaScript Object Notation，即JS對象標(biāo)記法。在很多開發(fā)場景里，JSON數(shù)據(jù)傳輸很常見！（1）數(shù)組（Array）用中括號[ ]表示；（2）對象（0bject）用大括號{ }表示。說明：在Hive中，沒有json類的存在，一般使用string類型來修飾，叫做json字符串。

get_json_object(json_txt, path) 用于解析json字符串說明：path參數(shù)通?？捎糜讷@取json中的數(shù)據(jù)內(nèi)容，語法：“$.key”。

select

get_json_object(data,"$.device")

from json_device;

select

get_json_object(data,"$.device") device,

get_json_object(data,"$.deviceType") divece_type,

get_json_object(data,"$.signal") signal,

get_json_object(data,"$.time") int_time

from json_device;

split(from_unixtime(cast(get_json_object(data,"$.time")/1000

as int),"yyyy/MM/dd"),"/")[0] year,

6.3 炸裂函數(shù)

explode()可用于表生成函數(shù)，一進多出，即當(dāng)輸入一行時，則輸出多行數(shù)據(jù)。通俗地說，就是可以使用explode()炸開數(shù)據(jù)。 explode(array | mapdata) 用于炸裂數(shù)據(jù)內(nèi)容，并分開數(shù)據(jù)結(jié)果。通常情況下，炸裂函數(shù)會與側(cè)視圖配合一起使用。側(cè)視圖（lateral view）原理是：（1）將UDTF的結(jié)果構(gòu)建成一個類似于視圖的表；（2）然后，將原表中的每一行和UDTF函數(shù)輸出的每一行進行連接，生成一張新的虛擬表。 ateral view側(cè)視圖語法：

select ... from 表A 別名

lateral view

UDTF(xxx) 別名 as 列名1, 列名2, 列名3, ...;

create table table_nba(

team_name string,

champion_year array

) row format delimited

fields terminated by ','

collection items terminated by '|';

select * from tb_nba;

-- a.單獨獲取到冠軍年份

select

explode(champion_year)

from tb_nba;

-- b.顯示出來??

select

explode(champion_year) //報錯了

from tb_nba;

-- 對year進行一個升序排序處理

select

from

(select

a.team_name,

b.year

from tb_nba a

lateral view

explode(champion_year) b as year) temp_nba

order by temp_nba.year;

select

from

(select

a.team_name,

b.year

from tb_nba a

lateral view

explode(champion_year) b as year) temp_nba

order by cast(temp_nba.year as int);

炸裂函數(shù)把數(shù)據(jù)炸開后，若在處理時遇到一些問題，可以考慮引入側(cè)視圖配合使用

柚子快報激活碼778899分享：Hive語法，函數(shù)--學(xué)習(xí)筆記

http://yzkb.51969.com/

精彩文章

評論可見，查看隱藏內(nèi)容

本文內(nèi)容根據(jù)網(wǎng)絡(luò)資料整理，出于傳遞更多信息之目的，不代表金鑰匙跨境贊同其觀點和立場。

轉(zhuǎn)載請注明，如有侵權(quán)，聯(lián)系刪除。

本文鏈接：http://gantiao.com.cn/post/19025273.html

發(fā)布評論

取消回復(fù)

您暫未設(shè)置收款碼

請在主題配置——文章設(shè)置里上傳

金鑰匙跨境

掃描二維碼手機訪問

文章目錄

欧美free性护士vide0shd,老熟女,一区二区三区,久久久久夜夜夜精品国产,久久久久久综合网天天,欧美成人护士h版

柚子快報激活碼778899分享：Hive語法，函數(shù)--學(xué)習(xí)筆記

隨便看看

特朗普要求美國最高法院暫停執(zhí)行TikTok強制出售令

最新留言

您暫未設(shè)置收款碼

欧美free性护士vide0shd,老熟女,一区二区三区,久久久久夜夜夜精品国产,久久久久久综合网天天,欧美成人护士h版

柚子快報激活碼778899分享：Hive語法，函數(shù)--學(xué)習(xí)筆記

隨便看看

特朗普要求美國最高法院暫停執(zhí)行TikTok強制出售令

最新留言

您暫未設(shè)置收款碼

柚子快報激活碼778899分享：Hive語法，函數(shù)--學(xué)習(xí)筆記