hive分區和桶操作介紹

發布時間：2021-08-05 22:59:41 來源：億速云閱讀：151 作者：chen 欄目：云計算

本篇內容主要講解“hive分區和桶操作介紹”，感興趣的朋友不妨來看看。本文介紹的方法操作簡單快捷，實用性強。下面就讓小編來帶大家學習“hive分區和桶操作介紹”吧!

分區操作

Hive 的分區通過在創建表時啟動 PARTITION BY 實現，用來分區的維度并不是實際數據的某一列，具體分區的標志是由插入內容時給定的。當要查詢某一分區的內容時可以采用 WHERE 語句，例如使用 “WHERE tablename.partition_key>a” 創建含分區的表。創建分區語法如下。

CREATE TABLE table_name(...)PARTITION BY (dt STRING,country STRING)

1、創建分區

Hive 中創建分區表沒有什么復雜的分區類型（范圍分區、列表分區、hash 分區，混合分區等）。分區列也不是表中的一個實際的字段，而是一個或者多個偽列。意思是說，在表的數據文件中實際并不保存分區列的信息與數據。

創建一個簡單的分區表。

hive> create table partition_test(member_id string,name string) partitioned by (stat_date string,province string) row format delimited fields terminated by ',';

這個例子中創建了 stat_date 和 province 兩個字段作為分區列。通常情況下需要預先創建好分區，然后才能使用該分區。例如：

hive> alter table partition_test add partition (stat_date='2015-01-18',province='beijing');

這樣就創建了一個分區。這時會看到 Hive 在HDFS 存儲中創建了一個相應的文件夾。

$ hadoop fs -ls /user/hive/warehouse/partition_test/stat_date=2015-01-18
/user/hive/warehouse/partition_test/stat_date=2015-01-18/province=beijing----顯示剛剛創建的分區

每一個分區都會有一個獨立的文件夾，在上面例子中，stat_date 是主層次，province 是副層次。

2、插入數據

使用一個輔助的非分區表 partition_test_input 準備向 partition_test 中插入數據，實現步驟如下。

1) 查看 partition_test_input 表的結構，命令如下。

hive> desc partition_test_input;

2) 查看 partition_test_input 的數據，命令如下。

hive> select * from partition_test_input;

3) 向 partition_test 的分區中插入數據，命令如下。

insert overwrite table partition_test partition(stat_date='2015-01-18',province='jiangsu') select member_id,name from partition_test_input where stat_date='2015-01-18' and province='jiangsu';

向多個分區插入數據，命令如下。

hive> from partition_test_input insert overwrite table partition_test partition(stat_date='2015-01-18',province='jiangsu') select member_id,name from partition_test_input where stat_date='2015-01-18' and province='jiangsu' insert overwrite table partition_test partition(stat_date='2015-01-28',province='sichuan') select member_id,name from partition_test_input where stat_date='2015-01-28' and province='sichuan' insert overwrite table partition_test partition(stat_date='2015-01-28',province='beijing') select member_id,name from partition_test_input where stat_date='2015-01-28' and province='beijing';

3、動態分區

按照上面的方法向分區表中插入數據，如果數據源很大，針對一個分區就要寫一個 insert ，非常麻煩。使用動態分區可以很好地解決上述問題。動態分區可以根據查詢得到的數據自動匹配到相應的分區中去。

動態分區可以通過下面的設置來打開：

set hive.exec.dynamic.partition=true;set hive.exec.dynamic.partition.mode=nonstrict;

動態分區的使用方法很簡單，假設向 stat_date='2015-01-18' 這個分區下插入數據，至于 province 插到哪個子分區下讓數據庫自己來判斷。stat_date 叫做靜態分區列，province 叫做動態分區列。

hive> insert overwrite table partition_test partition(stat_date='2015-01-18',province) select member_id,name province from partition_test_input where stat_date='2015-01-18';

注意，動態分區不允許主分區采用動態列而副分區采用靜態列，這樣將導致所有的主分區都要創建副分區靜態列所定義的分區。

hive.exec.max.dynamic.partitions.pernode：每一個 MapReduce Job 允許創建的分區的最大數量，如果超過這個數量就會報錯（默認值100）。

hive.exec.max.dynamic.partitions：一個 dml 語句允許創建的所有分區的最大數量（默認值100）。

hive.exec.max.created.files：所有 MapReduce Job 允許創建的文件的最大數量（默認值10000）。

盡量讓分區列的值相同的數據在同一個 MapReduce 中，這樣每一個 MapReduce 可以盡量少地產生新的文件夾，可以通過 DISTRIBUTE BY 將分區列值相同的數據放到一起，命令如下。

hive> insert overwrite table partition_test partition(stat_date,province)select memeber_id,name,stat_date,province from partition_test_input distribute by stat_date,province;

桶操作

Hive 中 table 可以拆分成 Partition table 和桶（BUCKET），桶操作是通過 Partition 的 CLUSTERED BY 實現的，BUCKET 中的數據可以通過 SORT BY 排序。

BUCKET 主要作用如下。

1)數據 sampling；

2)提升某些查詢操作效率，例如 Map-Side Join。

需要特別主要的是，CLUSTERED BY 和 SORT BY 不會影響數據的導入，這意味著，用戶必須自己負責數據的導入，包括數據額分桶和排序。 'set hive.enforce.bucketing=true' 可以自動控制上一輪 Reduce 的數量從而適配 BUCKET 的個數，當然，用戶也可以自主設置 mapred.reduce.tasks 去適配 BUCKET 個數，推薦使用：

hive> set hive.enforce.bucketing=true;

操作示例如下。

1) 創建臨時表 student_tmp，并導入數據。

hive> desc student_tmp;hive> select * from student_tmp;

2) 創建 student 表。

hive> create table student(id int,age int,name string)partitioned by (stat_date string)clustered by (id) sorted by(age) into 2 bucketrow format delimited fields terminated by ',';

3) 設置環境變量。

hive> set hive.enforce.bucketing=true;

4) 插入數據。

hive> from student_tmp insert overwrite table student partition(stat_date='2015-01-19') select id,age,name where stat_date='2015-01-18' sort by age;

5) 查看文件目錄。

$ hadoop fs -ls /usr/hive/warehouse/student/stat_date=2015-01-19/

6) 查看 sampling 數據。

hive> select * from student tablesample(bucket 1 out of 2 on id);

tablesample 是抽樣語句，語法如下。

tablesample(bucket x out of y)

y 必須是 table 中 BUCKET 總數的倍數或者因子。

到此，相信大家對“hive分區和桶操作介紹”有了更深的了解，不妨來實際操作一番吧！這里是億速云網站，更多相關內容可以進入相關頻道進行查詢，關注我們，繼續學習！

向AI問一下細節

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

hive分區和桶操作介紹

分區操作

桶操作

猜你喜歡

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

hive分區和桶操作介紹

分區操作

桶操作

猜你喜歡

最新資訊

相關推薦

相關標簽