Hive優化的示例分析

發布時間：2021-12-10 10:39:07 來源：億速云閱讀：190 作者：小新欄目：云計算

這篇文章將為大家詳細講解有關Hive優化的示例分析，小編覺得挺實用的，因此分享給大家做個參考，希望大家閱讀完這篇文章后可以有所收獲。

一、什么時候可以避免執行MapReduce？select *where語句中只有分區字段
二、Jion優化驅動表最右邊查詢表表的大小從左邊到右邊依次增大標志機制顯示的告知查詢優化器哪張表示大表/*+streamtable(table_name)*/
三、Map-side聚合sethive.map.aggr=true;這個設置可以將頂層的聚合操作放在Map階段執行，從而減輕清洗階段數據傳輸和Reduce階段的執行時間，提升總體性能。缺點：該設置會消耗更多的內存。執行select count(1) from wlan;
四、Localhadoop本地模式SETmapred.job.tracker=local; 測試 select 1 from wlan limit 5;
下面兩個參數是local mr中常用的控制參數:
1,hive.exec.mode.local.auto.inputbytes.max設置local mr的最大輸入數據量,當輸入數據量小于這個值的時候會采用local mr的方式
2,hive.exec.mode.local.auto.tasks.max設置local mr的最大輸入文件個數,當輸入文件個數小于這個值的時候會采用local mr的方式默認執行模式：

hive (default)> select count(1) t1;

Query ID = root_20150611185656_333185b7-e8b3-40b5-bc4c-2f11978f9822

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

sethive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

sethive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1433931422330_0001,Tracking URL = http://crxy176:8088/proxy/application_1433931422330_0001/

Kill Command = /usr/local/hadoop-2.6.0/bin/hadoop job -kill job_1433931422330_0001

Hadoop job information for Stage-1: number of mappers: 1; number ofreducers: 1

2015-06-11 18:56:44,749 Stage-1 map = 0%, reduce = 0%

2015-06-11 18:56:57,029 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.8 sec

2015-06-11 18:57:11,050 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.2 sec

MapReduce Total cumulative CPU time: 4 seconds 200 msec

Ended Job = job_1433931422330_0001

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce:1 Cumulative CPU: 4.2 sec HDFS Read: 312 HDFS Write: 2 SUCCESS

Total MapReduce CPU Time Spent: 4 seconds 200 msec

100

Time taken: 46.573 seconds, Fetched: 1 row(s)

對比啟動Localhadoop模式：

hive (default)> select count(1) t1;

Automatically selecting local only mode for query

Query ID = root_20150611185555_97e1a1d0-1958-4f35-8ea7-8face4cda85f

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

sethive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

sethive.exec.reducers.max=<number>

In order to set a constant number of reducers:

setmapreduce.job.reduces=<number>

Job running in-process (local Hadoop)

Hadoop job information for Stage-1: number of mappers: 0; number ofreducers: 0

2015-06-11 18:55:25,123 Stage-1 map = 100%, reduce = 100%

Ended Job = job_local1510342541_0004

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 12HDFS Write: 22 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

100

Time taken: 1.721 seconds, Fetched: 1 row(s)

五、索引Hive中的索引架構開放了一個接口，允許你根據這個接口去實現自己的索引。目前Hive自己有一個參考的索引實現（CompactIndex），后來在0.8版本中又加入位圖索引。這里就講講CompactIndex。

/*在index_test_table表的id字段上創建索引*/

create index idx on table index_test_table(id)

as 'org.apache.Hadoop.Hive.ql.index.compact.CompactIndexHandler'with deferred rebuild;

alter index idx on index_test_table rebuild;

/*索引的剪裁。找到上面建的索引表，根據你最終要用的查詢條件剪裁一下。*/

/*如果你想跟RDBMS一樣建完索引就用，那是不行的，會直接報錯，這也是其麻煩的地方*/

create table my_index

as select _bucketname， `_offsets`

from default__index_test_table_idx__ where id = 10;

/*現在可以用索引了，注意最終查詢條件跟上面的剪裁條件一致*/

set Hive.index.compact.file = /user/Hive/warehouse/my_index;

set Hive.input.format =org.apache.Hadoop.Hive.ql.index.compact.HiveCompactIndexInputFormat;

select count(*) from index_test_table where id = 10;

六、數據傾斜所謂數據傾斜，說的是由于數據分布不均勻，個別值集中占據大部分數據量，加上Hadoop的計算模式，導致計算資源不均勻引起性能下降。

傾斜分成group by造成的傾斜和join造成的傾斜：

一個是Hive.Map.aggr，默認值已經為true，意思是會做Map端的combiner。所以如果你的group by查詢只是做count(*)的話，其實是看不出傾斜效果的，但是如果你做的是count(distinct)，那么還是會看出一點傾斜效果。

另一個參數是Hive.groupby. skewindata。這個參數的意思是做Reduce操作的時候，拿到的key并不是所有相同值給同一個Reduce，而是隨機分發，然后Reduce做聚合，做完之后再做一輪MR，拿前面聚合過的數據再算結果。所以這個參數其實跟Hive.Map.aggr做的是類似的事情，只是拿到Reduce端來做，而且要額外啟動一輪Job，所以其實不怎么推薦用，效果不明顯。

改寫SQL來優化

/*改寫前*/

select a， count(distinctb) as c from tbl group by a;

/*改寫后*/

select a， count(*) as c

from (selectdistinct a， b from tbl) group by a;

七、Job間并行

首先，在Hive生成的多個Job中，在有些情況下Job之間是可以并行的，典型的就是子查詢。當需要執行多個子查詢unionall或者join操作的時候，Job間并行就可以使用了。比如下面的代碼就是一個可以并行的場景示意：

hive>FROM t4

INSERT OVERWRITE TABLE t3PARTITION (...) SELECT ...WHERE...

關于“Hive優化的示例分析”這篇文章就分享到這里了，希望以上內容可以對大家有一定的幫助，使各位可以學到更多知識，如果覺得文章不錯，請把它分享出去讓更多的人看到。

向AI問一下細節

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

Hive優化的示例分析

猜你喜歡

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

Hive優化的示例分析

猜你喜歡

最新資訊

相關推薦

相關標簽