hive (default)> load data local inpath '/opt/module/hive/tmp_data/stu_buck.txt' into table stu_buck; Loading data to table default.stu_buck Table default.stu_buck stats: [numFiles=1, totalSize=158] OK Time taken: 0.97 seconds
查看数据是否存在
1 2 3 4 5 6 7 8 9
hive (default)> select * from stu_buck; OK stu_buck.id stu_buck.name 1 ss1
...
16 ss16 Time taken: 0.039 seconds, Fetched: 16 row(s)
查看 WebUI
此时查看 webUi,可以看到,文件只有一个,而不是分桶设置的 4 个文件。原因是 load 命令实际上调用的是 hadoop 的 put 命令,这个命令是不会进行数据分桶的。
分桶
先将 stu_buck 表的数据清空
1 2 3
hive (default)> truncate table stu_buck; OK Time taken: 0.077 seconds
创建一个非分桶表
1 2 3 4
hive (default)> create table stu(id int, name string) > row format delimited fields terminated by '\t'; OK Time taken: 0.12 seconds
将数据导入非分桶表
1 2 3 4 5
hive (default)> load data local inpath '/opt/module/hive/tmp_data/stu_buck.txt' into table stu; Loading data to table default.stu Table default.stu stats: [numFiles=1, numRows=0, totalSize=110, rawDataSize=0] OK Time taken: 0.212 seconds
使用 MR 任务,将非分桶表的数据导入分桶表
1 2 3 4 5 6
hive (default)> insert into stu_buck > select * from stu; Query ID = root_20200107104239_1fac1d2c-64a1-4bdb-be69-e39bc07b3b27 ...
Time taken: 25.352 seconds
1 2 3 4 5 6 7 8 9
hive (default)> SELECT * FROM stu_buck; OK stu_buck.id stu_buck.name 1 ss1
...
16 ss16 Time taken: 0.061 seconds, Fetched: 16 row(s)
hive (default)> truncate table stu_buck; OK Time taken: 0.062 seconds
修改属性
1 2
hive (default)> set hive.enforce.bucketing=true; hive (default)> set mapreduce.job.reduces=-1;
重新导入
1 2 3 4 5 6 7 8 9 10 11 12 13 14
hive (default)> insert into stu_buck > select * from stu; Query ID = root_20200107104826_96ec5d1b-9587-43d0-b0cf-3b6cfb18ae7e Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 4
...
Stage-Stage-1: Map: 1 Reduce: 4 Cumulative CPU: 14.83 sec HDFS Read: 15410 HDFS Write: 398 SUCCESS Total MapReduce CPU Time Spent: 14 seconds 830 msec OK stu.id stu.name Time taken: 31.441 seconds
语法:select * from table_name tablesample(bucket x out of y on field_name)
含义:
y 必须是 bucket 数的倍数或因子。hive 根据 y 的大小决定抽样比例。如:有 4 个 bucket,当 y 为 2 时,抽取 (4/2=2) 个 bucket 的数据,当 y 为 8 时,抽取 (4/8=1/2) 个 bucket 的数据。 x 表示从那个 bucket 开始抽取,如果需要取多个分区,以后的分区号为当前分区号加上 y。如:有 4 个bucket,tablesample(bucket 1 out of 2) 表示总共收取 (4/2=2) 个 bucket 的数据,抽取第 1(x) 个和第 3(x+y) 个 bucket 的数据。 x 必须小于等于 y 的值。
1 2 3 4 5 6 7 8
hive (default)> select * from stu_buck tablesample(bucket 1 out of 4 on id); OK stu_buck.id stu_buck.name 16 ss16 12 ss12 8 ss8 4 ss4 Time taken: 0.133 seconds, Fetched: 4 row(s)
1 2 3 4 5 6 7 8
hive (default)> select * from stu_buck tablesample(bucket 2 out of 4 on id); OK stu_buck.id stu_buck.name 9 ss9 5 ss5 1 ss1 13 ss13 Time taken: 0.04 seconds, Fetched: 4 row(s)
1 2 3 4 5 6 7 8 9 10 11 12
hive (default)> select * from stu_buck tablesample(bucket 2 out of 2 on id); OK stu_buck.id stu_buck.name 9 ss9 5 ss5 1 ss1 13 ss13 3 ss3 11 ss11 7 ss7 15 ss15 Time taken: 0.038 seconds, Fetched: 8 row(s)
1 2 3 4 5 6
hive (default)> select * from stu_buck tablesample(bucket 1 out of 8 on id); OK stu_buck.id stu_buck.name 16 ss16 8 ss8 Time taken: 0.049 seconds, Fetched: 2 row(s)
1 2
hive (default)> select * from stu_buck tablesample(bucket 3 out of 2 on id); FAILED: SemanticException [Error 10061]: Numerator should not be bigger than denominator in sample clause for table stu_buck