柚子快報(bào)激活碼778899分享:數(shù)據(jù)倉庫哈哈
柚子快報(bào)激活碼778899分享:數(shù)據(jù)倉庫哈哈
數(shù)據(jù)倉庫
基本概念數(shù)據(jù)庫(database)和數(shù)據(jù)倉庫(Data Warehouse)的異同
整體架構(gòu)分層架構(gòu)方法論ER模型(建模理論)維度模型
何為分層第一層:數(shù)據(jù)源(ODS ER模型)設(shè)計(jì)要點(diǎn)日志表業(yè)務(wù)表1活動(dòng)信息表(全量表)2活動(dòng)規(guī)則表(全量表)3一級(jí)品類表(全量表)4二級(jí)品類表(全量表)5三級(jí)品類表(全量表)6編碼字典表(全量表)7省份表(全量表)8地區(qū)表(全量表)9品牌表(全量表)10購(gòu)物車表(全量表)11優(yōu)惠券信息表(全量表)12商品平臺(tái)屬性表(全量表)13商品表(全量表)14商品銷售屬性值表(全量表)15SPU表(全量表)16營(yíng)銷坑位表(全量表)17營(yíng)銷渠道表(全量表)18購(gòu)物車表(增量表)19評(píng)論表(增量表)20優(yōu)惠券領(lǐng)用表(增量表)21收藏表(增量表)22訂單明細(xì)表(增量表)23訂單明細(xì)活動(dòng)關(guān)聯(lián)表(增量表)24訂單明細(xì)優(yōu)惠券關(guān)聯(lián)表(增量表)25訂單表(增量表)26退單表(增量表)27訂單狀態(tài)流水表(增量表)28支付表(增量表)29退款表(增量表)30用戶表(增量表)31數(shù)據(jù)裝載腳本
第二層:數(shù)據(jù)加工(DWD data warehouse detail)事實(shí)表設(shè)計(jì)(事務(wù)型事實(shí)表)
事務(wù)的原子性事實(shí)表設(shè)計(jì)(周期型快照事實(shí)表)
從當(dāng)前表中取數(shù)據(jù)后再放回去需考慮去重問題,增加retry的容錯(cuò)性事實(shí)表設(shè)計(jì)(累積型快照事實(shí)表)分區(qū)策略
第三層:數(shù)據(jù)統(tǒng)計(jì)(DWS data warehouse summary 提高性能的關(guān)鍵層)第四層:數(shù)據(jù)分析(ADS application data service)表的設(shè)計(jì)(要點(diǎn))優(yōu)化(假)1流量主題1.1各渠道流量統(tǒng)計(jì)
第五層:共通層(DIM dimension)設(shè)計(jì)要點(diǎn)是否創(chuàng)建表維度表設(shè)計(jì)1商品維度表2優(yōu)惠券維度表3活動(dòng)(規(guī)則)維度表4地區(qū)維度表5營(yíng)銷坑位維度表6營(yíng)銷渠道維度表7日期維度表8用戶維度表(拉鏈(壓縮)表)
CTE : 共通表表達(dá)式拉鏈表設(shè)計(jì)任務(wù)調(diào)度器
基本概念
本質(zhì)是對(duì)數(shù)據(jù)進(jìn)行加工處理后對(duì)外提供數(shù)據(jù)服務(wù)
數(shù)據(jù)庫(database)和數(shù)據(jù)倉庫(Data Warehouse)的異同
數(shù)據(jù)庫用于存儲(chǔ)企業(yè)基礎(chǔ),核心的業(yè)務(wù)數(shù)據(jù)從數(shù)據(jù)來源進(jìn)行區(qū)分
數(shù)據(jù)庫:企業(yè)的業(yè)務(wù)系統(tǒng)數(shù)據(jù)倉庫:數(shù)據(jù)庫(后臺(tái)的后臺(tái)) 從數(shù)據(jù)存儲(chǔ)進(jìn)行區(qū)分
數(shù)據(jù)庫:存儲(chǔ)的目的為了可以快速進(jìn)行數(shù)據(jù)查詢操作 索引 : SQL 存儲(chǔ)方式:行式存儲(chǔ) 數(shù)據(jù)量:不能存儲(chǔ)海量數(shù)據(jù)數(shù)據(jù)倉庫:存儲(chǔ)的目的為了可以快速進(jìn)行統(tǒng)計(jì)分析 索引 : 沒有索引(k-v) 存儲(chǔ)方式:列式存儲(chǔ) 數(shù)據(jù)量:必須存儲(chǔ)海量數(shù)據(jù) 從數(shù)據(jù)價(jià)值進(jìn)行區(qū)分
數(shù)據(jù)庫 :保障企業(yè)業(yè)務(wù)系統(tǒng)的執(zhí)行 事務(wù)(回滾)數(shù)據(jù)倉庫 :統(tǒng)計(jì)分析的結(jié)果可以為企業(yè)的經(jīng)營(yíng)決策提供數(shù)據(jù)依據(jù) 沒有事務(wù) 數(shù)據(jù)倉庫不是數(shù)據(jù)流轉(zhuǎn)的終點(diǎn) :可視化才是數(shù)據(jù)的終點(diǎn)
整體架構(gòu)
Spark : 數(shù)據(jù)的統(tǒng)計(jì)分析 數(shù)據(jù)倉庫:數(shù)據(jù)的統(tǒng)計(jì)分析 數(shù)據(jù)倉庫不能直接對(duì)接MySQL數(shù)據(jù)庫作為數(shù)據(jù)源!
數(shù)據(jù)庫不是為了數(shù)據(jù)倉庫服務(wù)的。數(shù)據(jù)倉庫如果直接對(duì)象數(shù)據(jù)庫,會(huì)導(dǎo)致數(shù)據(jù)庫的性能降低數(shù)據(jù)庫不能存儲(chǔ)海量數(shù)據(jù)。數(shù)據(jù)倉庫必須獲取海量數(shù)據(jù)數(shù)據(jù)庫采用行式存儲(chǔ)。數(shù)據(jù)倉庫為了提高統(tǒng)計(jì)分析效率,所以需要列式存儲(chǔ)
數(shù)據(jù)倉庫應(yīng)該增加自己的數(shù)據(jù)源
數(shù)據(jù)倉庫的數(shù)據(jù)源中的數(shù)據(jù)應(yīng)該和MySQL數(shù)據(jù)庫中的數(shù)據(jù)保持一致 數(shù)據(jù)倉庫的數(shù)據(jù)源應(yīng)該不斷融合(匯總)MySQL數(shù)據(jù)庫中的數(shù)據(jù) 將數(shù)據(jù)庫的數(shù)據(jù)匯總的到數(shù)據(jù)倉庫數(shù)據(jù)源的過程,一般稱之為數(shù)據(jù)同步,也稱之為數(shù)據(jù)采集
分層架構(gòu)
數(shù)據(jù)倉庫計(jì)算周期為1天:1天統(tǒng)計(jì)一回?cái)?shù)據(jù)結(jié)果
方法論
ER模型(建模理論)
ER(Entity Relationship)(實(shí)體關(guān)系)模型 采用面向?qū)ο蟮姆绞皆O(shè)計(jì)表(和Java一樣)
將對(duì)象理解為表將對(duì)象之間的關(guān)系理解為表之間的關(guān)系
超詳細(xì)內(nèi)容(帶圖)看這里
維度模型
事實(shí) :行為所產(chǎn)生的事情(數(shù)據(jù)) 維度:分析數(shù)據(jù)的角度(狀態(tài))
超詳細(xì)內(nèi)容(帶圖)看這里
何為分層
Spark中的方法可能會(huì)含有shuffle功能, shuffle操作會(huì)將完整的計(jì)算流程一分為二,會(huì)分為2個(gè)階段(Stage),前面一個(gè)階段稱之為Map階段,后面的階段稱之為Reduce階段, shuffle中前一個(gè)階段的任務(wù)不執(zhí)行完,后面的階段的任務(wù)不允許執(zhí)行的, Task Pool(任務(wù)池) - 任務(wù)調(diào)度(FIFO, FAIR)。
數(shù)據(jù)倉庫也存在同樣的問題,將整個(gè)計(jì)算流程分為了4段, 在數(shù)據(jù)倉庫中不稱之為段,一般稱之為層,每一層有特殊的含義和特殊的功能 前面一層的數(shù)據(jù)沒有處理完,后面一層的數(shù)據(jù)沒有辦法處理
第一層:數(shù)據(jù)源(ODS ER模型)
功能:
為整個(gè)數(shù)據(jù)倉庫作為數(shù)據(jù)來源 不斷匯總業(yè)務(wù)數(shù)據(jù)和日志數(shù)據(jù) 數(shù)據(jù)量非常大:海量數(shù)據(jù) -> 考慮資源問題:使用最少的資源存儲(chǔ)最多的資源(考慮使用壓縮算法gzip、lzo、snappy);考慮網(wǎng)絡(luò)資源:考慮傳輸方式,數(shù)據(jù)盡可能不變(格式、壓縮方式、存儲(chǔ)方式) 統(tǒng)計(jì)本質(zhì)上就是對(duì)行為數(shù)據(jù)進(jìn)行統(tǒng)計(jì) 分析本質(zhì)上就是站在什么角度對(duì)統(tǒng)計(jì)結(jié)果進(jìn)行分析
-- ODS
-- 1. ODS層表建模方式:ER模型
-- 2. 數(shù)據(jù)格式不變,數(shù)據(jù)壓縮方式 gzip
-- 3. 表名
-- 分層標(biāo)記(ods_) + 同步數(shù)據(jù)的表名 + 增量/全量(inc/full)
-- 增量,全量
設(shè)計(jì)要點(diǎn)
(1)ODS層的表結(jié)構(gòu)設(shè)計(jì)依托于從業(yè)務(wù)系統(tǒng)同步過來的數(shù)據(jù)結(jié)構(gòu)。 (2)ODS層要保存全部歷史數(shù)據(jù),故其壓縮格式應(yīng)選擇壓縮比較高的,此處選擇gzip。 (3)ODS層表名的命名規(guī)范為:ods_表名_單分區(qū)增量全量標(biāo)識(shí)(inc/full)。
日志表
1)建表語句
DROP TABLE IF EXISTS ods_log_inc;
CREATE EXTERNAL TABLE ods_log_inc
(
`common` STRUCT ba :STRING, ch :STRING, is_new :STRING, md :STRING, mid :STRING, os :STRING, sid :STRING, uid :STRING, vc :STRING> COMMENT '公共信息', `page` STRUCT item :STRING, item_type :STRING, last_page_id :STRING, page_id :STRING, from_pos_id :STRING, from_pos_seq :STRING, refer_id :STRING> COMMENT '頁面信息', `actions` ARRAY item:STRING, item_type:STRING, ts:BIGINT>> COMMENT '動(dòng)作信息', `displays` ARRAY item :STRING, item_type :STRING, `pos_seq` :STRING, pos_id :STRING>> COMMENT '曝光信息', `start` STRUCT first_open :BIGINT, loading_time :BIGINT, open_ad_id :BIGINT, open_ad_ms :BIGINT, open_ad_skip_ms :BIGINT> COMMENT '啟動(dòng)信息', `err` STRUCT msg:STRING> COMMENT '錯(cuò)誤信息', `ts` BIGINT COMMENT '時(shí)間戳' ) COMMENT '活動(dòng)信息表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_log_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 2)數(shù)據(jù)裝載 load data inpath '/origin_data/gmall/log/topic_log/2022-06-08' into table ods_log_inc partition(dt='2022-06-08'); 3)每日數(shù)據(jù)裝載腳本 (1)在hadoop102的/home/atguigu/bin目錄下創(chuàng)建hdfs_to_ods_log.sh vim hdfs_to_ods_log.sh (2)編寫如下內(nèi)容 #!/bin/bash # 定義變量方便修改 APP=gmall # 如果是輸入的日期按照取輸入日期;如果沒輸入日期取當(dāng)前時(shí)間的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi echo ================== 日志日期為 $do_date ================== sql=" load data inpath '/origin_data/$APP/log/topic_log/$do_date' into table ${APP}.ods_log_inc partition(dt='$do_date'); " hive -e "$sql" (3)增加腳本執(zhí)行權(quán)限 chmod +x hdfs_to_ods_log.sh (4)腳本用法 hdfs_to_ods_log.sh 2022-06-08 業(yè)務(wù)表 1活動(dòng)信息表(全量表) DROP TABLE IF EXISTS ods_activity_info_full; CREATE EXTERNAL TABLE ods_activity_info_full ( `id` STRING COMMENT '活動(dòng)id', `activity_name` STRING COMMENT '活動(dòng)名稱', `activity_type` STRING COMMENT '活動(dòng)類型', `activity_desc` STRING COMMENT '活動(dòng)描述', `start_time` STRING COMMENT '開始時(shí)間', `end_time` STRING COMMENT '結(jié)束時(shí)間', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '活動(dòng)信息表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_activity_info_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 2活動(dòng)規(guī)則表(全量表) DROP TABLE IF EXISTS ods_activity_rule_full; CREATE EXTERNAL TABLE ods_activity_rule_full ( `id` STRING COMMENT '編號(hào)', `activity_id` STRING COMMENT '活動(dòng)ID', `activity_type` STRING COMMENT '活動(dòng)類型', `condition_amount` DECIMAL(16, 2) COMMENT '滿減金額', `condition_num` BIGINT COMMENT '滿減件數(shù)', `benefit_amount` DECIMAL(16, 2) COMMENT '優(yōu)惠金額', `benefit_discount` DECIMAL(16, 2) COMMENT '優(yōu)惠折扣', `benefit_level` STRING COMMENT '優(yōu)惠級(jí)別', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '活動(dòng)規(guī)則表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_activity_rule_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 3一級(jí)品類表(全量表) DROP TABLE IF EXISTS ods_base_category1_full; CREATE EXTERNAL TABLE ods_base_category1_full ( `id` STRING COMMENT '編號(hào)', `name` STRING COMMENT '分類名稱', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '一級(jí)品類表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_base_category1_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 4二級(jí)品類表(全量表) DROP TABLE IF EXISTS ods_base_category2_full; CREATE EXTERNAL TABLE ods_base_category2_full ( `id` STRING COMMENT '編號(hào)', `name` STRING COMMENT '二級(jí)分類名稱', `category1_id` STRING COMMENT '一級(jí)分類編號(hào)', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '二級(jí)品類表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_base_category2_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 5三級(jí)品類表(全量表) DROP TABLE IF EXISTS ods_base_category3_full; CREATE EXTERNAL TABLE ods_base_category3_full ( `id` STRING COMMENT '編號(hào)', `name` STRING COMMENT '三級(jí)分類名稱', `category2_id` STRING COMMENT '二級(jí)分類編號(hào)', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '三級(jí)品類表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_base_category3_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 6編碼字典表(全量表) DROP TABLE IF EXISTS ods_base_dic_full; CREATE EXTERNAL TABLE ods_base_dic_full ( `dic_code` STRING COMMENT '編號(hào)', `dic_name` STRING COMMENT '編碼名稱', `parent_code` STRING COMMENT '父編號(hào)', `create_time` STRING COMMENT '創(chuàng)建日期', `operate_time` STRING COMMENT '修改日期' ) COMMENT '編碼字典表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_base_dic_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 7省份表(全量表) DROP TABLE IF EXISTS ods_base_province_full; CREATE EXTERNAL TABLE ods_base_province_full ( `id` STRING COMMENT '編號(hào)', `name` STRING COMMENT '省份名稱', `region_id` STRING COMMENT '地區(qū)ID', `area_code` STRING COMMENT '地區(qū)編碼', `iso_code` STRING COMMENT '舊版國(guó)際標(biāo)準(zhǔn)地區(qū)編碼,供可視化使用', `iso_3166_2` STRING COMMENT '新版國(guó)際標(biāo)準(zhǔn)地區(qū)編碼,供可視化使用', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '省份表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_base_province_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 8地區(qū)表(全量表) DROP TABLE IF EXISTS ods_base_region_full; CREATE EXTERNAL TABLE ods_base_region_full ( `id` STRING COMMENT '地區(qū)ID', `region_name` STRING COMMENT '地區(qū)名稱', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '地區(qū)表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_base_region_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 9品牌表(全量表) DROP TABLE IF EXISTS ods_base_trademark_full; CREATE EXTERNAL TABLE ods_base_trademark_full ( `id` STRING COMMENT '編號(hào)', `tm_name` STRING COMMENT '品牌名稱', `logo_url` STRING COMMENT '品牌LOGO的圖片路徑', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '品牌表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_base_trademark_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 10購(gòu)物車表(全量表) DROP TABLE IF EXISTS ods_cart_info_full; CREATE EXTERNAL TABLE ods_cart_info_full ( `id` STRING COMMENT '編號(hào)', `user_id` STRING COMMENT '用戶ID', `sku_id` STRING COMMENT 'SKU_ID', `cart_price` DECIMAL(16, 2) COMMENT '放入購(gòu)物車時(shí)價(jià)格', `sku_num` BIGINT COMMENT '數(shù)量', `img_url` BIGINT COMMENT '商品圖片地址', `sku_name` STRING COMMENT 'SKU名稱 (冗余)', `is_checked` STRING COMMENT '是否被選中', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間', `is_ordered` STRING COMMENT '是否已經(jīng)下單', `order_time` STRING COMMENT '下單時(shí)間' ) COMMENT '購(gòu)物車全量表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_cart_info_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 11優(yōu)惠券信息表(全量表) DROP TABLE IF EXISTS ods_coupon_info_full; CREATE EXTERNAL TABLE ods_coupon_info_full ( `id` STRING COMMENT '購(gòu)物券編號(hào)', `coupon_name` STRING COMMENT '購(gòu)物券名稱', `coupon_type` STRING COMMENT '購(gòu)物券類型 1 現(xiàn)金券 2 折扣券 3 滿減券 4 滿件打折券', `condition_amount` DECIMAL(16, 2) COMMENT '滿額數(shù)', `condition_num` BIGINT COMMENT '滿件數(shù)', `activity_id` STRING COMMENT '活動(dòng)編號(hào)', `benefit_amount` DECIMAL(16, 2) COMMENT '減免金額', `benefit_discount` DECIMAL(16, 2) COMMENT '折扣', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `range_type` STRING COMMENT '范圍類型 1、商品(SPUID) 2、品類(三級(jí)品類id) 3、品牌', `limit_num` BIGINT COMMENT '最多領(lǐng)用次數(shù)', `taken_count` BIGINT COMMENT '已領(lǐng)用次數(shù)', `start_time` STRING COMMENT '可以領(lǐng)取的開始時(shí)間', `end_time` STRING COMMENT '可以領(lǐng)取的結(jié)束時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間', `expire_time` STRING COMMENT '過期時(shí)間', `range_desc` STRING COMMENT '范圍描述' ) COMMENT '優(yōu)惠券信息表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_coupon_info_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 12商品平臺(tái)屬性表(全量表) DROP TABLE IF EXISTS ods_sku_attr_value_full; CREATE EXTERNAL TABLE ods_sku_attr_value_full ( `id` STRING COMMENT '編號(hào)', `attr_id` STRING COMMENT '平臺(tái)屬性ID', `value_id` STRING COMMENT '平臺(tái)屬性值ID', `sku_id` STRING COMMENT 'SKU_ID', `attr_name` STRING COMMENT '平臺(tái)屬性名稱', `value_name` STRING COMMENT '平臺(tái)屬性值名稱', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '商品平臺(tái)屬性表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_sku_attr_value_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 13商品表(全量表) DROP TABLE IF EXISTS ods_sku_info_full; CREATE EXTERNAL TABLE ods_sku_info_full ( `id` STRING COMMENT 'SKU_ID', `spu_id` STRING COMMENT 'SPU_ID', `price` DECIMAL(16, 2) COMMENT '價(jià)格', `sku_name` STRING COMMENT 'SKU名稱', `sku_desc` STRING COMMENT 'SKU規(guī)格描述', `weight` DECIMAL(16, 2) COMMENT '重量', `tm_id` STRING COMMENT '品牌ID', `category3_id` STRING COMMENT '三級(jí)品類ID', `sku_default_img` STRING COMMENT '默認(rèn)顯示圖片地址', `is_sale` STRING COMMENT '是否在售', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '商品表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_sku_info_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 14商品銷售屬性值表(全量表) DROP TABLE IF EXISTS ods_sku_sale_attr_value_full; CREATE EXTERNAL TABLE ods_sku_sale_attr_value_full ( `id` STRING COMMENT '編號(hào)', `sku_id` STRING COMMENT 'SKU_ID', `spu_id` STRING COMMENT 'SPU_ID', `sale_attr_value_id` STRING COMMENT '銷售屬性值ID', `sale_attr_id` STRING COMMENT '銷售屬性ID', `sale_attr_name` STRING COMMENT '銷售屬性名稱', `sale_attr_value_name` STRING COMMENT '銷售屬性值名稱', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '商品銷售屬性值表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_sku_sale_attr_value_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 15SPU表(全量表) DROP TABLE IF EXISTS ods_spu_info_full; CREATE EXTERNAL TABLE ods_spu_info_full ( `id` STRING COMMENT 'SPU_ID', `spu_name` STRING COMMENT 'SPU名稱', `description` STRING COMMENT '描述信息', `category3_id` STRING COMMENT '三級(jí)品類ID', `tm_id` STRING COMMENT '品牌ID', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT 'SPU表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_spu_info_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 16營(yíng)銷坑位表(全量表) DROP TABLE IF EXISTS ods_promotion_pos_full; CREATE EXTERNAL TABLE ods_promotion_pos_full ( `id` STRING COMMENT '營(yíng)銷坑位ID', `pos_location` STRING COMMENT '營(yíng)銷坑位位置', `pos_type` STRING COMMENT '營(yíng)銷坑位類型:banner,宮格,列表,瀑布', `promotion_type` STRING COMMENT '營(yíng)銷類型:算法、固定、搜索', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '營(yíng)銷坑位表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_promotion_pos_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 17營(yíng)銷渠道表(全量表) DROP TABLE IF EXISTS ods_promotion_refer_full; CREATE EXTERNAL TABLE ods_promotion_refer_full ( `id` STRING COMMENT '外部營(yíng)銷渠道ID', `refer_name` STRING COMMENT '外部營(yíng)銷渠道名稱', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '營(yíng)銷渠道表' PARTITIONED BY (`dt` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' NULL DEFINED AS '' LOCATION '/warehouse/gmall/ods/ods_promotion_refer_full/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 18購(gòu)物車表(增量表) DROP TABLE IF EXISTS ods_cart_info_inc; CREATE EXTERNAL TABLE ods_cart_info_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT user_id :STRING, sku_id :STRING, cart_price :DECIMAL(16, 2), sku_num :BIGINT, img_url :STRING, sku_name :STRING, is_checked :STRING, create_time :STRING, operate_time :STRING, is_ordered :STRING, order_time:STRING> COMMENT '數(shù)據(jù)', `old` MAP ) COMMENT '購(gòu)物車增量表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_cart_info_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 19評(píng)論表(增量表) DROP TABLE IF EXISTS ods_comment_info_inc; CREATE EXTERNAL TABLE ods_comment_info_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT user_id :STRING, nick_name :STRING, head_img :STRING, sku_id :STRING, spu_id :STRING, order_id :STRING, appraise :STRING, comment_txt :STRING, create_time :STRING, operate_time :STRING> COMMENT '數(shù)據(jù)', `old` MAP ) COMMENT '評(píng)論表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_comment_info_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 20優(yōu)惠券領(lǐng)用表(增量表) DROP TABLE IF EXISTS ods_coupon_use_inc; CREATE EXTERNAL TABLE ods_coupon_use_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT coupon_id :STRING, user_id :STRING, order_id :STRING, coupon_status :STRING, get_time :STRING, using_time:STRING, used_time :STRING,expire_time :STRING, create_time :STRING, operate_time :STRING> COMMENT '數(shù)據(jù)', `old` MAP ) COMMENT '優(yōu)惠券領(lǐng)用表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_coupon_use_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 21收藏表(增量表) DROP TABLE IF EXISTS ods_favor_info_inc; CREATE EXTERNAL TABLE ods_favor_info_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT user_id :STRING, sku_id :STRING, spu_id :STRING, is_cancel :STRING, create_time :STRING, operate_time:STRING> COMMENT '數(shù)據(jù)', `old` MAP STRING> COMMENT '舊值' ) COMMENT '收藏表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_favor_info_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 22訂單明細(xì)表(增量表) DROP TABLE IF EXISTS ods_order_detail_inc; CREATE EXTERNAL TABLE ods_order_detail_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT order_id :STRING, sku_id :STRING, sku_name :STRING, img_url :STRING, order_price:DECIMAL(16, 2), sku_num :BIGINT, create_time :STRING, source_type :STRING, source_id :STRING, split_total_amount:DECIMAL(16, 2), split_activity_amount :DECIMAL(16, 2), split_coupon_amount:DECIMAL(16, 2), operate_time :STRING> COMMENT '數(shù)據(jù)', `old` MAP STRING> COMMENT '舊值' ) COMMENT '訂單明細(xì)表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_order_detail_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 23訂單明細(xì)活動(dòng)關(guān)聯(lián)表(增量表) DROP TABLE IF EXISTS ods_order_detail_activity_inc; CREATE EXTERNAL TABLE ods_order_detail_activity_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT order_id :STRING, order_detail_id :STRING, activity_id :STRING, activity_rule_id :STRING, sku_id:STRING, create_time :STRING, operate_time :STRING> COMMENT '數(shù)據(jù)', `old` MAP ) COMMENT '訂單明細(xì)活動(dòng)關(guān)聯(lián)表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_order_detail_activity_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 24訂單明細(xì)優(yōu)惠券關(guān)聯(lián)表(增量表) DROP TABLE IF EXISTS ods_order_detail_coupon_inc; CREATE EXTERNAL TABLE ods_order_detail_coupon_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT order_id :STRING, order_detail_id :STRING, coupon_id :STRING, coupon_use_id :STRING, sku_id:STRING, create_time :STRING, operate_time :STRING> COMMENT '數(shù)據(jù)', `old` MAP ) COMMENT '訂單明細(xì)優(yōu)惠券關(guān)聯(lián)表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_order_detail_coupon_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 25訂單表(增量表) DROP TABLE IF EXISTS ods_order_info_inc; CREATE EXTERNAL TABLE ods_order_info_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT consignee :STRING, consignee_tel :STRING, total_amount :DECIMAL(16, 2), order_status :STRING, user_id:STRING, payment_way :STRING, delivery_address :STRING, order_comment :STRING, out_trade_no :STRING, trade_body:STRING, create_time :STRING, operate_time :STRING, expire_time :STRING, process_status :STRING, tracking_no:STRING, parent_order_id :STRING, img_url :STRING, province_id :STRING, activity_reduce_amount:DECIMAL(16, 2), coupon_reduce_amount :DECIMAL(16, 2), original_total_amount :DECIMAL(16, 2), freight_fee:DECIMAL(16, 2), freight_fee_reduce :DECIMAL(16, 2), refundable_time :DECIMAL(16, 2)> COMMENT '數(shù)據(jù)', `old` MAP ) COMMENT '訂單表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_order_info_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 26退單表(增量表) DROP TABLE IF EXISTS ods_order_refund_info_inc; CREATE EXTERNAL TABLE ods_order_refund_info_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT user_id :STRING, order_id :STRING, sku_id :STRING, refund_type :STRING, refund_num :BIGINT, refund_amount:DECIMAL(16, 2), refund_reason_type :STRING, refund_reason_txt :STRING, refund_status :STRING, create_time:STRING, operate_time :STRING> COMMENT '數(shù)據(jù)', `old` MAP ) COMMENT '退單表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_order_refund_info_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 27訂單狀態(tài)流水表(增量表) DROP TABLE IF EXISTS ods_order_status_log_inc; CREATE EXTERNAL TABLE ods_order_status_log_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT order_id :STRING, order_status :STRING, create_time :STRING, operate_time :STRING> COMMENT '數(shù)據(jù)', `old` MAP ) COMMENT '訂單狀態(tài)流水表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_order_status_log_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 28支付表(增量表) DROP TABLE IF EXISTS ods_payment_info_inc; CREATE EXTERNAL TABLE ods_payment_info_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT out_trade_no :STRING, order_id :STRING, user_id :STRING, payment_type :STRING, trade_no:STRING, total_amount :DECIMAL(16, 2), subject :STRING, payment_status :STRING, create_time :STRING, callback_time:STRING, callback_content :STRING, operate_time :STRING> COMMENT '數(shù)據(jù)', `old` MAP ) COMMENT '支付表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_payment_info_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 29退款表(增量表) DROP TABLE IF EXISTS ods_refund_payment_inc; CREATE EXTERNAL TABLE ods_refund_payment_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT out_trade_no :STRING, order_id :STRING, sku_id :STRING, payment_type :STRING, trade_no :STRING, total_amount:DECIMAL(16, 2), subject :STRING, refund_status :STRING, create_time :STRING, callback_time :STRING, callback_content:STRING, operate_time :STRING> COMMENT '數(shù)據(jù)', `old` MAP ) COMMENT '退款表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_refund_payment_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 30用戶表(增量表) DROP TABLE IF EXISTS ods_user_info_inc; CREATE EXTERNAL TABLE ods_user_info_inc ( `type` STRING COMMENT '變動(dòng)類型', `ts` BIGINT COMMENT '變動(dòng)時(shí)間', `data` STRUCT login_name :STRING, nick_name :STRING, passwd :STRING, name :STRING, phone_num :STRING, email:STRING, head_img :STRING, user_level :STRING, birthday :STRING, gender :STRING, create_time :STRING, operate_time:STRING, status :STRING> COMMENT '數(shù)據(jù)', `old` MAP ) COMMENT '用戶表' PARTITIONED BY (`dt` STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe' LOCATION '/warehouse/gmall/ods/ods_user_info_inc/' TBLPROPERTIES ('compression.codec'='org.apache.hadoop.io.compress.GzipCodec'); 31數(shù)據(jù)裝載腳本 (1)在hadoop102的/home/atguigu/bin目錄下創(chuàng)建hdfs_to_ods_db.sh vim hdfs_to_ods_db.sh (2)編寫如下內(nèi)容 #!/bin/bash APP=gmall if [ -n "$2" ] ;then do_date=$2 else do_date=`date -d '-1 day' +%F` fi load_data(){ sql="" for i in $*; do #判斷路徑是否存在 hadoop fs -test -e /origin_data/$APP/db/${i:4}/$do_date #路徑存在方可裝載數(shù)據(jù) if [[ $? = 0 ]]; then sql=$sql"load data inpath '/origin_data/$APP/db/${i:4}/$do_date' OVERWRITE into table ${APP}.$i partition(dt='$do_date');" fi done hive -e "$sql" } case $1 in "ods_activity_info_full") load_data "ods_activity_info_full" ;; "ods_activity_rule_full") load_data "ods_activity_rule_full" ;; "ods_base_category1_full") load_data "ods_base_category1_full" ;; "ods_base_category2_full") load_data "ods_base_category2_full" ;; "ods_base_category3_full") load_data "ods_base_category3_full" ;; "ods_base_dic_full") load_data "ods_base_dic_full" ;; "ods_base_province_full") load_data "ods_base_province_full" ;; "ods_base_region_full") load_data "ods_base_region_full" ;; "ods_base_trademark_full") load_data "ods_base_trademark_full" ;; "ods_cart_info_full") load_data "ods_cart_info_full" ;; "ods_coupon_info_full") load_data "ods_coupon_info_full" ;; "ods_sku_attr_value_full") load_data "ods_sku_attr_value_full" ;; "ods_sku_info_full") load_data "ods_sku_info_full" ;; "ods_sku_sale_attr_value_full") load_data "ods_sku_sale_attr_value_full" ;; "ods_spu_info_full") load_data "ods_spu_info_full" ;; "ods_promotion_pos_full") load_data "ods_promotion_pos_full" ;; "ods_promotion_refer_full") load_data "ods_promotion_refer_full" ;; "ods_cart_info_inc") load_data "ods_cart_info_inc" ;; "ods_comment_info_inc") load_data "ods_comment_info_inc" ;; "ods_coupon_use_inc") load_data "ods_coupon_use_inc" ;; "ods_favor_info_inc") load_data "ods_favor_info_inc" ;; "ods_order_detail_inc") load_data "ods_order_detail_inc" ;; "ods_order_detail_activity_inc") load_data "ods_order_detail_activity_inc" ;; "ods_order_detail_coupon_inc") load_data "ods_order_detail_coupon_inc" ;; "ods_order_info_inc") load_data "ods_order_info_inc" ;; "ods_order_refund_info_inc") load_data "ods_order_refund_info_inc" ;; "ods_order_status_log_inc") load_data "ods_order_status_log_inc" ;; "ods_payment_info_inc") load_data "ods_payment_info_inc" ;; "ods_refund_payment_inc") load_data "ods_refund_payment_inc" ;; "ods_user_info_inc") load_data "ods_user_info_inc" ;; "all") load_data "ods_activity_info_full" "ods_activity_rule_full" "ods_base_category1_full" "ods_base_category2_full" "ods_base_category3_full" "ods_base_dic_full" "ods_base_province_full" "ods_base_region_full" "ods_base_trademark_full" "ods_cart_info_full" "ods_coupon_info_full" "ods_sku_attr_value_full" "ods_sku_info_full" "ods_sku_sale_attr_value_full" "ods_spu_info_full" "ods_promotion_pos_full" "ods_promotion_refer_full" "ods_cart_info_inc" "ods_comment_info_inc" "ods_coupon_use_inc" "ods_favor_info_inc" "ods_order_detail_inc" "ods_order_detail_activity_inc" "ods_order_detail_coupon_inc" "ods_order_info_inc" "ods_order_refund_info_inc" "ods_order_status_log_inc" "ods_payment_info_inc" "ods_refund_payment_inc" "ods_user_info_inc" ;; esac (3)增加腳本執(zhí)行權(quán)限 chmod +x hdfs_to_ods_db.sh (4)腳本用法 hdfs_to_ods_db.sh all 2022-06-08 第二層:數(shù)據(jù)加工(DWD data warehouse detail) 功能:將數(shù)據(jù)源中的數(shù)據(jù)進(jìn)行加工處理(判空、無效) 為了后續(xù)統(tǒng)計(jì)分析做數(shù)據(jù)準(zhǔn)備 數(shù)據(jù)量非常大,所以分離出了DIM層將數(shù)據(jù)整合 壓縮方式:snappy 事實(shí)表設(shè)計(jì)(事務(wù)型事實(shí)表) -- DWD -- Data Warehouse Detail -- detail : 詳細(xì),明細(xì) -- DWD層表主要設(shè)計(jì)的目的為了統(tǒng)計(jì)分析做準(zhǔn)備 -- 表中主要保存的是行為數(shù)據(jù) -- 多個(gè)行為數(shù)據(jù)中如果存在共通性的內(nèi)容,那么可以提煉出來形成DIM層維度表的數(shù)據(jù) -- 表的設(shè)計(jì)要點(diǎn) -- 表的設(shè)計(jì)要依據(jù)維度建模理論中的事實(shí)表 -- 表設(shè)計(jì)時(shí)需要orc列式存儲(chǔ)以及snappy壓縮 -- 命名規(guī)范: -- 分層標(biāo)記(dwd_) + 數(shù)據(jù)域(分類) + 原子性行為名稱 + 增量/全量(inc/full) -- 絕大多數(shù)的行為數(shù)據(jù)都是增量數(shù)據(jù)采集 -- 特殊情況例外,可以采用全量方式實(shí)現(xiàn)。 -- dwd_user_login_success_inc -- 事實(shí)表 -- 維度引用 + 度量值(行為產(chǎn)生時(shí)可以用于統(tǒng)計(jì)分析的數(shù)值:金額,數(shù)量,個(gè)數(shù)) -- 事實(shí)表會(huì)根據(jù)場(chǎng)景分為3大類: -- 1. 事務(wù)型事實(shí)表 -- 行為是原子性 -- 用戶登錄(非原子) -- 用戶登錄成功(原子) -- 用戶登錄失?。ㄔ樱?/p> -- 粒度:描述一行數(shù)據(jù)的詳細(xì)程度 -- 描述的越詳細(xì)(維度越多),粒度越細(xì) -- 描述的越簡(jiǎn)單(維度越多),粒度越粗 -- 設(shè)計(jì)步驟: -- 1. 選擇業(yè)務(wù)過程 :確定表 -- 2. 聲明粒度:確定行 -- 3. 確認(rèn)維度:確定列 -- 4. 確認(rèn)事實(shí):確定度量值 -- 2. 周期快照事實(shí)表 -- 3. 累積快照事實(shí)表 -- 交易域加購(gòu)事務(wù)事實(shí)表 -- 交易域 : trade -- 加購(gòu) : 行為 -- 將商品加入到購(gòu)物車中的行為 -- 購(gòu)物車中沒有這個(gè)商品,往購(gòu)物車中增加商品 -- 購(gòu)物車中有這個(gè)商品,繼續(xù)往購(gòu)物車中增加該商品 -- 事務(wù)事實(shí)表 -- 原子性 -- 時(shí)間(行為時(shí)間) + 用戶 + 商品 + 數(shù)量 -- 表的字段結(jié)構(gòu):必要的維度屬性 + 度量值 + 可選的維度屬性 -- 建表語句 -- 分區(qū)策略:哪一天的行為數(shù)據(jù)存放到哪一天分區(qū) 事務(wù)的原子性 登錄成功(OK) 登錄失敗(OK) 下單成功(OK) 下單失敗(非正常業(yè)務(wù)行為,不需要再創(chuàng)建一張表) 支付成功(OK) 支付失?。∣K) 事實(shí)表設(shè)計(jì)(周期型快照事實(shí)表) 全量 -- 事務(wù)性事實(shí)表局限性 -- 事實(shí)表只針對(duì)于當(dāng)前行為進(jìn)行的統(tǒng)計(jì)分析時(shí),性能可以得到保障。 -- 當(dāng)前行為事實(shí)表和其他行為數(shù)據(jù)進(jìn)行關(guān)聯(lián)時(shí),數(shù)據(jù)量會(huì)幾何爆炸性增長(zhǎng),性能會(huì)急劇下降。 -- 存量性統(tǒng)計(jì)指標(biāo)使用事務(wù)性事實(shí)表效率太低,所以一般會(huì)采用其他事實(shí)表的設(shè)計(jì)方式 -- 2. 周期型快照事實(shí)表 -- 交易域購(gòu)物車周期快照事實(shí)表 -- 交易域 -- 購(gòu)物車 : cart_info -- 周期快照事實(shí)表 從當(dāng)前表中取數(shù)據(jù)后再放回去需考慮去重問題,增加retry的容錯(cuò)性 事實(shí)表設(shè)計(jì)(累積型快照事實(shí)表) -- 多行為統(tǒng)計(jì)指標(biāo)使用事務(wù)性事實(shí)表效率太低,所以一般會(huì)采用其他事實(shí)表的設(shè)計(jì)方式 -- 3. 累積型快照事實(shí)表 -- 使用一張表保存多個(gè)行為的狀態(tài)數(shù)據(jù) -- 交易域交易流程累積快照事實(shí)表 -- 交易域 -- 交易流程 : 以訂單為基礎(chǔ)的交易流程 -- 累積快照事實(shí)表 分區(qū)策略 -- 事務(wù)性事實(shí)表:哪一天的行為數(shù)據(jù)存放到哪一天的分區(qū) -- 周期性事實(shí)表:每一天存儲(chǔ)一份數(shù)據(jù) -- 累積快照事實(shí)表:從業(yè)務(wù)流程中獲取最后一個(gè)業(yè)務(wù)行為時(shí)間作為分區(qū)字段 -- 下單時(shí)間 (X) -- 支付時(shí)間 (X) -- 收貨時(shí)間 (OK) 第三層:數(shù)據(jù)統(tǒng)計(jì)(DWS data warehouse summary 提高性能的關(guān)鍵層) 功能:將加工后的數(shù)據(jù)進(jìn)行統(tǒng)計(jì) 數(shù)據(jù)量非常大 壓縮方式:snappy 第四層:數(shù)據(jù)分析(ADS application data service) 功能:將統(tǒng)計(jì)結(jié)果進(jìn)行分析,為用戶提供經(jīng)營(yíng)決策 壓縮方式:gzip 數(shù)據(jù)格式:tsv 表的設(shè)計(jì)(要點(diǎn)) -- ADS層中存儲(chǔ)的是統(tǒng)計(jì)分析的最終結(jié)果 -- 數(shù)據(jù)量不多 -- 表不需要分區(qū) -- 無需做進(jìn)一步聚合 -- 無需orc列式存儲(chǔ)和snappy壓縮 -- 行式存儲(chǔ) + gzip -- 結(jié)果還需要向后流轉(zhuǎn)(可視化) -- tsv -- 表的結(jié)構(gòu)不能太復(fù)雜(滿足客戶的需求即可) 優(yōu)化(假) Spark: reduceByKey(函數(shù)內(nèi)部combine減少落盤數(shù)據(jù)量)和groupByKeycache、persist和checkpointDWS 1流量主題 1.1各渠道流量統(tǒng)計(jì) 1)建表語句 DROP TABLE IF EXISTS ads_traffic_stats_by_channel; CREATE EXTERNAL TABLE ads_traffic_stats_by_channel ( `dt` STRING COMMENT '統(tǒng)計(jì)日期', `recent_days` BIGINT COMMENT '最近天數(shù),1:最近1天,7:最近7天,30:最近30天', `channel` STRING COMMENT '渠道', `uv_count` BIGINT COMMENT '訪客人數(shù)', `avg_duration_sec` BIGINT COMMENT '會(huì)話平均停留時(shí)長(zhǎng),單位為秒', `avg_page_count` BIGINT COMMENT '會(huì)話平均瀏覽頁面數(shù)', `sv_count` BIGINT COMMENT '會(huì)話數(shù)', `bounce_rate` DECIMAL(16, 2) COMMENT '跳出率' ) COMMENT '各渠道流量統(tǒng)計(jì)' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/warehouse/gmall/ads/ads_traffic_stats_by_channel/'; 2)數(shù)據(jù)裝載 第五層:共通層(DIM dimension) 功能:將共同的數(shù)據(jù)放在共通的表中,可在多個(gè)統(tǒng)計(jì)需求中使用 dimension:維度,分析數(shù)據(jù)的角度 該層不需要一開始就設(shè)計(jì),可以等DWD層設(shè)計(jì)的差不多了,或是寫著寫著發(fā)現(xiàn)DWD中有好多表都用到了共通的字段,有大量冗余數(shù)據(jù),那么就可以將這部分共通的數(shù)據(jù)提取成一個(gè)表 設(shè)計(jì)要點(diǎn) (1)DIM層的設(shè)計(jì)依據(jù)是維度建模理論,該層存儲(chǔ)維度模型的維度表。 (2)DIM層的數(shù)據(jù)存儲(chǔ)格式為orc列式存儲(chǔ)+snappy壓縮。 (3)DIM層表名的命名規(guī)范為dim_表名_全量表或者拉鏈表標(biāo)識(shí)(full/zip)。 絕大多數(shù)的維度表都是全量表 是否創(chuàng)建表 數(shù)據(jù)量少,應(yīng)用面窄 :無需創(chuàng)建表(放用得到的表里即可,即維度退化) 數(shù)據(jù)量少,應(yīng)用面廣 :無需創(chuàng)建獨(dú)立表,一般和其他的數(shù)放置在一張表中(數(shù)據(jù)字典表(編碼表->是樹形表)) 樹形(有上下級(jí))數(shù)據(jù)保存時(shí)一般會(huì)采用 parent - child節(jié)點(diǎn)的設(shè)計(jì)方式 一般情況下,會(huì)采用一張表來設(shè)計(jì)上下級(jí)結(jié)構(gòu):部門(depart) – 表中的列:下級(jí)部門(主鍵)(N), 上級(jí)部門(外鍵)(1)字典表也是樹形表 維度表設(shè)計(jì) 確定維度表:確定維度的表是否該創(chuàng)建 原則上來講,每一個(gè)分析數(shù)據(jù)的角度(維度)都應(yīng)該創(chuàng)建一張表 案例:統(tǒng)計(jì)各個(gè)省份,各個(gè)品牌的訂單總銷量 – 訂單屬于事實(shí)(行為)表,省份和品牌就是維度表案例:統(tǒng)計(jì)各個(gè)性別不同年齡段的訂單總銷量 – 訂單屬于事實(shí)(行為)表, 性別和年齡就是維度表 如果多個(gè)維度存在關(guān)聯(lián),那么一般就會(huì)只創(chuàng)建一張表,表中包含了多個(gè)關(guān)聯(lián)的維度如果分析數(shù)據(jù)的角度應(yīng)用場(chǎng)景少,而且數(shù)據(jù)量小,不需要?jiǎng)?chuàng)建專門的維度表 案例:支付方式(微信支付,支付寶支付) 確定主維表和相關(guān)維表(用于分析維度表的列) 確定表中的列 案例:省份維度表 – 列:名稱數(shù)據(jù)倉庫的數(shù)據(jù)都來自于MySQL業(yè)務(wù)數(shù)據(jù), – 維度表的列的聲明可以參考業(yè)務(wù)數(shù)據(jù)庫表的字段MySQL業(yè)務(wù)數(shù)據(jù)庫中具有唯一性字段的那個(gè)業(yè)務(wù)表稱之為主維表 – 其他的表稱之為相關(guān)維表。 確定表中的列 盡可能豐富(多)編碼和文字共存沉淀通用屬性 :tel, xxx – 計(jì)算或轉(zhuǎn)換 1商品維度表 -- 商品維度表 :dim_sku_full -- 確定維度表 -- 主維表和相關(guān)維表 -- 主維表和相關(guān)維表都是MySQL業(yè)務(wù)表 -- 主要用于分析列的表稱之主維表(主鍵) -- sku_info -- 其他用于分析列的表稱之相關(guān)維表 -- sku_attr_value -- sku_sale_attr_value -- 確定表的列 -- 建表語句 DROP TABLE IF EXISTS dim_sku_full; CREATE EXTERNAL TABLE dim_sku_full ( `id` STRING COMMENT 'SKU_ID', `price` DECIMAL(16, 2) COMMENT '商品價(jià)格', `sku_name` STRING COMMENT '商品名稱', `sku_desc` STRING COMMENT '商品描述', `weight` DECIMAL(16, 2) COMMENT '重量', `is_sale` BOOLEAN COMMENT '是否在售', `spu_id` STRING COMMENT 'SPU編號(hào)', `spu_name` STRING COMMENT 'SPU名稱', `category3_id` STRING COMMENT '三級(jí)品類ID', `category3_name` STRING COMMENT '三級(jí)品類名稱', `category2_id` STRING COMMENT '二級(jí)品類id', `category2_name` STRING COMMENT '二級(jí)品類名稱', `category1_id` STRING COMMENT '一級(jí)品類ID', `category1_name` STRING COMMENT '一級(jí)品類名稱', `tm_id` STRING COMMENT '品牌ID', `tm_name` STRING COMMENT '品牌名稱', `sku_attr_values` ARRAY value_id :STRING, attr_name :STRING, value_name:STRING>> COMMENT '平臺(tái)屬性', `sku_sale_attr_values` ARRAY sale_attr_value_id :STRING, sale_attr_name :STRING, sale_attr_value_name:STRING>> COMMENT '銷售屬性', `create_time` STRING COMMENT '創(chuàng)建時(shí)間' ) COMMENT '商品維度表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_sku_full/' TBLPROPERTIES ('orc.compress' = 'snappy'); -- 數(shù)據(jù)裝載 -- load -- 數(shù)據(jù)源一定是ODS層 -- save -- 分區(qū)字段其實(shí)也是表的字段,但是我們一般稱之為虛擬字段 -- 數(shù)據(jù)字段(列):存儲(chǔ)在數(shù)據(jù)文件中 -- 分區(qū)字段:存儲(chǔ)在路徑中 -- 分區(qū) -- dt : date(日期) -- 策略:將每天采集的數(shù)據(jù)存放到ODS層的每天分區(qū)中 -- 將ODS層每天的數(shù)據(jù)關(guān)聯(lián)后存放到DIM層的每天分區(qū)中 -- 分區(qū)存儲(chǔ)應(yīng)該采用overwrite而不是into set hive.vectorized.execution.enabled=false; insert overwrite table dim_sku_full partition (dt='2022-06-08') select sku.`id` ,--STRING COMMENT 'SKU_ID', `price` ,--DECIMAL(16, 2) COMMENT '商品價(jià)格', `sku_name` ,--STRING COMMENT '商品名稱', `sku_desc` ,--STRING COMMENT '商品描述', `weight` ,--DECIMAL(16, 2) COMMENT '重量', `is_sale` ,--BOOLEAN COMMENT '是否在售', `spu_id` ,--STRING COMMENT 'SPU編號(hào)', `spu_name` ,--STRING COMMENT 'SPU名稱', `category3_id` ,--STRING COMMENT '三級(jí)品類ID', `category3_name` ,--STRING COMMENT '三級(jí)品類名稱', `category2_id` ,--STRING COMMENT '二級(jí)品類id', `category2_name` ,--STRING COMMENT '二級(jí)品類名稱', `category1_id` ,--STRING COMMENT '一級(jí)品類ID', `category1_name` ,--STRING COMMENT '一級(jí)品類名稱', `tm_id` ,-- STRING COMMENT '品牌ID', `tm_name` ,-- STRING COMMENT '品牌名稱', sku_attr_values, sku_sale_attr_values, `create_time` --STRING COMMENT '創(chuàng)建時(shí)間' from ( select `id` ,--STRING COMMENT 'SKU_ID', `price` ,--DECIMAL(16, 2) COMMENT '商品價(jià)格', `sku_name` ,--STRING COMMENT '商品名稱', `sku_desc` ,--STRING COMMENT '商品描述', `weight` ,--DECIMAL(16, 2) COMMENT '重量', `is_sale` ,--BOOLEAN COMMENT '是否在售', `spu_id` ,--STRING COMMENT 'SPU編號(hào)', `category3_id` ,--STRING COMMENT '三級(jí)品類ID', `tm_id` ,-- STRING COMMENT '品牌ID', create_time from ods_sku_info_full where dt = '2022-06-08' ) sku left join ( select id, spu_name from ods_spu_info_full where dt = '2022-06-08' ) spu on sku.spu_id = spu.id left join ( select id, tm_name from ods_base_trademark_full where dt = '2022-06-08' ) tm on sku.tm_id = tm.id left join ( select id, name category3_name, category2_id from ods_base_category3_full where dt = '2022-06-08' ) c3 on sku.category3_id = c3.id left join ( select id, name category2_name, category1_id from ods_base_category2_full where dt = '2022-06-08' ) c2 on c3.category2_id = c2.id left join ( select id, name category1_name from ods_base_category1_full where dt = '2022-06-08' ) c1 on c2.category1_id = c1.id left join ( -- 將查詢結(jié)果轉(zhuǎn)換為結(jié)構(gòu)體后,形成Array -- 將多個(gè)結(jié)構(gòu)體的數(shù)據(jù)聚合成一個(gè)數(shù)組類型的數(shù)據(jù)(聚合操作) select sku_id, collect_list(named_struct("attr_id", attr_id, "value_id", value_id, "attr_name", attr_name, "value_name", value_name)) sku_attr_values from ods_sku_attr_value_full where dt = '2022-06-08' group by sku_id ) sav on sku.id = sav.sku_id left join ( select sku_id, collect_list(named_struct("sale_attr_id", sale_attr_id, "sale_attr_value_id", sale_attr_value_id, "sale_attr_name", sale_attr_name, "sale_attr_value_name", sale_attr_value_name)) sku_sale_attr_values from ods_sku_sale_attr_value_full where dt = '2022-06-08' group by sku_id ) ssav on sku.id = ssav.sku_id; -- join & left join -- join 要求2張表的數(shù)據(jù)同時(shí)滿足條件才能作為結(jié)果返回 -- left join 要求2張表的數(shù)據(jù)左邊的表不滿足條件也能作為結(jié)果返回 -- 使用left join 替換 join,必須保證,替換后不影響結(jié)果 2優(yōu)惠券維度表 1)建表語句 -- 創(chuàng)建表 -- 分析表中的列 -- 主維表 -- coupon_info -- base_dic -- 相關(guān)維表 -- 建表語句 -- 表名 DROP TABLE IF EXISTS dim_coupon_full; CREATE EXTERNAL TABLE dim_coupon_full ( `id` STRING COMMENT '優(yōu)惠券編號(hào)', `coupon_name` STRING COMMENT '優(yōu)惠券名稱', `coupon_type_code` STRING COMMENT '優(yōu)惠券類型編碼', `coupon_type_name` STRING COMMENT '優(yōu)惠券類型名稱', `condition_amount` DECIMAL(16, 2) COMMENT '滿額數(shù)', `condition_num` BIGINT COMMENT '滿件數(shù)', `activity_id` STRING COMMENT '活動(dòng)編號(hào)', `benefit_amount` DECIMAL(16, 2) COMMENT '減免金額', `benefit_discount` DECIMAL(16, 2) COMMENT '折扣', `benefit_rule` STRING COMMENT '優(yōu)惠規(guī)則:滿元*減*元,滿*件打*折', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `range_type_code` STRING COMMENT '優(yōu)惠范圍類型編碼', `range_type_name` STRING COMMENT '優(yōu)惠范圍類型名稱', `limit_num` BIGINT COMMENT '最多領(lǐng)取次數(shù)', `taken_count` BIGINT COMMENT '已領(lǐng)取次數(shù)', `start_time` STRING COMMENT '可以領(lǐng)取的開始時(shí)間', `end_time` STRING COMMENT '可以領(lǐng)取的結(jié)束時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間', `expire_time` STRING COMMENT '過期時(shí)間' ) COMMENT '優(yōu)惠券維度表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_coupon_full/' TBLPROPERTIES ('orc.compress' = 'snappy'); 2)數(shù)據(jù)裝載 insert overwrite table dim_coupon_full partition(dt='2022-06-08') select id, coupon_name, coupon_type, coupon_dic.dic_name, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, case coupon_type when '3201' then concat('滿',condition_amount,'元減',benefit_amount,'元') when '3202' then concat('滿',condition_num,'件打', benefit_discount,' 折') when '3203' then concat('減',benefit_amount,'元') end benefit_rule, create_time, range_type, range_dic.dic_name, limit_num, taken_count, start_time, end_time, operate_time, expire_time from ( select id, coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, limit_num, taken_count, start_time, end_time, operate_time, expire_time from ods_coupon_info_full where dt='2022-06-08' )ci left join ( select dic_code, dic_name from ods_base_dic_full where dt='2022-06-08' and parent_code='32' )coupon_dic on ci.coupon_type=coupon_dic.dic_code left join ( select dic_code, dic_name from ods_base_dic_full where dt='2022-06-08' and parent_code='33' )range_dic on ci.range_type=range_dic.dic_code; 3活動(dòng)(規(guī)則)維度表 1)建表語句 -- 活動(dòng)(規(guī)則)維度表 -- activity_info -- activity_rule DROP TABLE IF EXISTS dim_activity_full; CREATE EXTERNAL TABLE dim_activity_full ( `activity_rule_id` STRING COMMENT '活動(dòng)規(guī)則ID', `activity_id` STRING COMMENT '活動(dòng)ID', `activity_name` STRING COMMENT '活動(dòng)名稱', `activity_type_code` STRING COMMENT '活動(dòng)類型編碼', `activity_type_name` STRING COMMENT '活動(dòng)類型名稱', `activity_desc` STRING COMMENT '活動(dòng)描述', `start_time` STRING COMMENT '開始時(shí)間', `end_time` STRING COMMENT '結(jié)束時(shí)間', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `condition_amount` DECIMAL(16, 2) COMMENT '滿減金額', `condition_num` BIGINT COMMENT '滿減件數(shù)', `benefit_amount` DECIMAL(16, 2) COMMENT '優(yōu)惠金額', `benefit_discount` DECIMAL(16, 2) COMMENT '優(yōu)惠折扣', `benefit_rule` STRING COMMENT '優(yōu)惠規(guī)則', `benefit_level` STRING COMMENT '優(yōu)惠級(jí)別' ) COMMENT '活動(dòng)維度表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_activity_full/' TBLPROPERTIES ('orc.compress' = 'snappy'); 2)數(shù)據(jù)裝載 insert overwrite table dim_activity_full partition(dt='2022-06-08') select rule.id, info.id, activity_name, rule.activity_type, dic.dic_name, activity_desc, start_time, end_time, create_time, condition_amount, condition_num, benefit_amount, benefit_discount, case rule.activity_type when '3101' then concat('滿',condition_amount,'元減',benefit_amount,'元') when '3102' then concat('滿',condition_num,'件打', benefit_discount,' 折') when '3103' then concat('打', benefit_discount,'折') end benefit_rule, benefit_level from ( select id, activity_id, activity_type, condition_amount, condition_num, benefit_amount, benefit_discount, benefit_level from ods_activity_rule_full where dt='2022-06-08' )rule left join ( select id, activity_name, activity_type, activity_desc, start_time, end_time, create_time from ods_activity_info_full where dt='2022-06-08' )info on rule.activity_id=info.id left join ( select dic_code, dic_name from ods_base_dic_full where dt='2022-06-08' and parent_code='31' )dic on rule.activity_type=dic.dic_code; 4地區(qū)維度表 1)建表語句 DROP TABLE IF EXISTS dim_province_full; CREATE EXTERNAL TABLE dim_province_full ( `id` STRING COMMENT '省份ID', `province_name` STRING COMMENT '省份名稱', `area_code` STRING COMMENT '地區(qū)編碼', `iso_code` STRING COMMENT '舊版國(guó)際標(biāo)準(zhǔn)地區(qū)編碼,供可視化使用', `iso_3166_2` STRING COMMENT '新版國(guó)際標(biāo)準(zhǔn)地區(qū)編碼,供可視化使用', `region_id` STRING COMMENT '地區(qū)ID', `region_name` STRING COMMENT '地區(qū)名稱' ) COMMENT '地區(qū)維度表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_province_full/' TBLPROPERTIES ('orc.compress' = 'snappy'); 2)數(shù)據(jù)裝載 insert overwrite table dim_province_full partition(dt='2022-06-08') select province.id, province.name, province.area_code, province.iso_code, province.iso_3166_2, region_id, region_name from ( select id, name, region_id, area_code, iso_code, iso_3166_2 from ods_base_province_full where dt='2022-06-08' )province left join ( select id, region_name from ods_base_region_full where dt='2022-06-08' )region on province.region_id=region.id; 5營(yíng)銷坑位維度表 1)建表語句 DROP TABLE IF EXISTS dim_promotion_pos_full; CREATE EXTERNAL TABLE dim_promotion_pos_full ( `id` STRING COMMENT '營(yíng)銷坑位ID', `pos_location` STRING COMMENT '營(yíng)銷坑位位置', `pos_type` STRING COMMENT '營(yíng)銷坑位類型 ', `promotion_type` STRING COMMENT '營(yíng)銷類型', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '營(yíng)銷坑位維度表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_promotion_pos_full/' TBLPROPERTIES ('orc.compress' = 'snappy'); 2)數(shù)據(jù)裝載 insert overwrite table dim_promotion_pos_full partition(dt='2022-06-08') select `id`, `pos_location`, `pos_type`, `promotion_type`, `create_time`, `operate_time` from ods_promotion_pos_full where dt='2022-06-08'; 6營(yíng)銷渠道維度表 1)建表語句 DROP TABLE IF EXISTS dim_promotion_refer_full; CREATE EXTERNAL TABLE dim_promotion_refer_full ( `id` STRING COMMENT '營(yíng)銷渠道ID', `refer_name` STRING COMMENT '營(yíng)銷渠道名稱', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '修改時(shí)間' ) COMMENT '營(yíng)銷渠道維度表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_promotion_refer_full/' TBLPROPERTIES ('orc.compress' = 'snappy'); 2)數(shù)據(jù)裝載 insert overwrite table dim_promotion_refer_full partition(dt='2022-06-08') select `id`, `refer_name`, `create_time`, `operate_time` from ods_promotion_refer_full where dt='2022-06-08'; 7日期維度表 日期維度表不需要從MySQL中導(dǎo),而是從文件中另行導(dǎo)入,也不需要每天導(dǎo)入,每年導(dǎo)入一次即可 1)建表語句 DROP TABLE IF EXISTS dim_date; CREATE EXTERNAL TABLE dim_date ( `date_id` STRING COMMENT '日期ID', `week_id` STRING COMMENT '周ID,一年中的第幾周', `week_day` STRING COMMENT '周幾', `day` STRING COMMENT '每月的第幾天', `month` STRING COMMENT '一年中的第幾月', `quarter` STRING COMMENT '一年中的第幾季度', `year` STRING COMMENT '年份', `is_workday` STRING COMMENT '是否是工作日', `holiday_id` STRING COMMENT '節(jié)假日' ) COMMENT '日期維度表' STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_date/' TBLPROPERTIES ('orc.compress' = 'snappy'); 2)數(shù)據(jù)裝載 (1)創(chuàng)建臨時(shí)表 DROP TABLE IF EXISTS tmp_dim_date_info; CREATE EXTERNAL TABLE tmp_dim_date_info ( `date_id` STRING COMMENT '日', `week_id` STRING COMMENT '周ID', `week_day` STRING COMMENT '周幾', `day` STRING COMMENT '每月的第幾天', `month` STRING COMMENT '第幾月', `quarter` STRING COMMENT '第幾季度', `year` STRING COMMENT '年', `is_workday` STRING COMMENT '是否是工作日', `holiday_id` STRING COMMENT '節(jié)假日' ) COMMENT '時(shí)間維度表' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/warehouse/gmall/tmp/tmp_dim_date_info/'; (2)將數(shù)據(jù)文件上傳到HFDS上臨時(shí)表路徑/warehouse/gmall/tmp/tmp_dim_date_info TSV格式的數(shù)據(jù) -- 日期數(shù)據(jù)不是由我們自己提供的 -- TSV -> ORC -- ODS(全量) -> DIM(ORC) insert overwrite table dim_date select * from tmp_dim_date_info; 8用戶維度表(拉鏈(壓縮)表) 將大量數(shù)據(jù)的表進(jìn)行特殊的設(shè)計(jì)進(jìn)行改善,讓數(shù)據(jù)減少,并且不影響業(yè)務(wù)邏輯 – 將數(shù)據(jù)狀態(tài)進(jìn)行時(shí)間標(biāo)記:開始 + 結(jié)束設(shè)計(jì)拉鏈表時(shí),需要在基本表的設(shè)計(jì)基礎(chǔ)上,增加2個(gè)額外字段,用于表示狀態(tài)的范圍(開始,結(jié)束)拉鏈表的數(shù)據(jù),每一個(gè)狀態(tài)的變化會(huì)保存一條數(shù)據(jù),如果狀態(tài)沒有任何的變化,那么數(shù)據(jù)只有一條 1)建表語句 -- 表名:dim_user_full -- 表中列 -- 主維表 : user_info -- 相關(guān)維表 : user_address DROP TABLE IF EXISTS dim_user_zip; CREATE EXTERNAL TABLE dim_user_zip ( `id` STRING COMMENT '用戶ID', `name` STRING COMMENT '用戶姓名', `phone_num` STRING COMMENT '手機(jī)號(hào)碼', `email` STRING COMMENT '郵箱', `user_level` STRING COMMENT '用戶等級(jí)', `birthday` STRING COMMENT '生日', `gender` STRING COMMENT '性別', `create_time` STRING COMMENT '創(chuàng)建時(shí)間', `operate_time` STRING COMMENT '操作時(shí)間', `start_date` STRING COMMENT '開始日期', `end_date` STRING COMMENT '結(jié)束日期' ) COMMENT '用戶維度表' PARTITIONED BY (`dt` STRING) STORED AS ORC LOCATION '/warehouse/gmall/dim/dim_user_zip/' TBLPROPERTIES ('orc.compress' = 'snappy'); 2)數(shù)據(jù)裝載 ① 首日裝載 -- 全量表 -- DataX -- TSV -- 增量表 -- Maxwell -- JSON -- 首日(全量-select) : bootstrap (3種類型) -- 每日(增量-binlog) : insert, update, delete insert overwrite table dim_user_zip partition (dt = '9999-12-31') select data.id, concat(substr(data.name, 1, 1), '*') name, if(data.phone_num regexp '^(13[0-9]|14[01456879]|15[0-35-9]|16[2567]|17[0-8]|18[0-9]|19[0-35-9])\\d{8}$', concat(substr(data.phone_num, 1, 3), '*'), null) phone_num, if(data.email regexp '^[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+(\\.[a-zA-Z0-9_-]+)+$', concat('*@', split(data.email, '@')[1]), null) email, data.user_level, data.birthday, data.gender, data.create_time, data.operate_time, '2022-06-08' start_date, '9999-12-31' end_date from ods_user_info_inc where dt = '2022-06-08' and type = 'bootstrap-insert'; ② 每日裝載 set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dim_user_zip partition (dt) select id, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, if(rn = 2, date_sub('2022-06-09', 1), end_date) end_date, if(rn = 1, '9999-12-31', date_sub('2022-06-09', 1)) dt from ( select id, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date, row_number() over (partition by id order by start_date desc) rn from ( select id, name, phone_num, email, user_level, birthday, gender, create_time, operate_time, start_date, end_date from dim_user_zip where dt = '9999-12-31' union select id, concat(substr(name, 1, 1), '*') name, if(phone_num regexp '^(13[0-9]|14[01456879]|15[0-35-9]|16[2567]|17[0-8]|18[0-9]|19[0-35-9])\\d{8}$', concat(substr(phone_num, 1, 3), '*'), null) phone_num, if(email regexp '^[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+(\\.[a-zA-Z0-9_-]+)+$', concat('*@', split(email, '@')[1]), null) email, user_level, birthday, gender, create_time, operate_time, '2022-06-09' start_date, '9999-12-31' end_date from ( select data.id, data.name, data.phone_num, data.email, data.user_level, data.birthday, data.gender, data.create_time, data.operate_time, row_number() over (partition by data.id order by ts desc) rn from ods_user_info_inc where dt = '2022-06-09' ) t1 where rn = 1 ) t2 ) t3; CTE : 共通表表達(dá)式 with aa as ( select * from test_table_part ), ab as (select * from aa) select * from aa join ab; select * from ( select * from test_table_part ) aa join ( select * from test_table_part ) bb; select * from dim_sku_full 拉鏈表設(shè)計(jì) -- 數(shù)據(jù)裝載 -- load -- save -- 增量表得數(shù)據(jù)操作一般都會(huì)寫2個(gè) -- 首日數(shù)據(jù)裝載 -- 每日數(shù)據(jù)裝載 -- 首日數(shù)據(jù)裝載 -- 同步方式:maxwell - 全量 - bootstrap - select * from user_info -- MySQL不保存行為數(shù)據(jù),也就意味著不保存歷史行為數(shù)據(jù) -- 拉鏈表會(huì)在當(dāng)前表得字段得基礎(chǔ)上,額外添加兩個(gè)字段(start, end),用于標(biāo)記狀態(tài)得有效范圍 -- start : 無法判斷開始范圍 -- end : 無法判斷 -- 折中地考慮 -- 從當(dāng)天開始,結(jié)束時(shí)間取時(shí)間極大值(避免數(shù)據(jù)頻繁修改) -- 分區(qū)策略 -- 絕大多數(shù)得維度表得分區(qū)策略都是以天為單位 -- 分區(qū)不能采用開始日期作為分區(qū)字段 -- 無法判斷數(shù)據(jù)是否為歷史狀態(tài)還是最新狀態(tài) -- 好得方式是使用結(jié)束時(shí)間為分區(qū)字段 任務(wù)調(diào)度器 保證每一層的SQL跑完再跑下一層 柚子快報(bào)激活碼778899分享:數(shù)據(jù)倉庫哈哈 相關(guān)鏈接
本文內(nèi)容根據(jù)網(wǎng)絡(luò)資料整理,出于傳遞更多信息之目的,不代表金鑰匙跨境贊同其觀點(diǎn)和立場(chǎng)。
轉(zhuǎn)載請(qǐng)注明,如有侵權(quán),聯(lián)系刪除。