select deptno, count(*) from emp group by mgr。这条语句如果直接执行是会报错的,此时就可以用窗口函数来解决。
窗口函数
OVER:指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变化而变化。
OVER 函数可能出现的连接属性:
用在 over 函数内的
CURRENT ROW:当前行 n PRECEDING:往前 n 行数据 n FOLLOWING:往后 n 行数据 UNBOUNDED:起点 UNBOUNDED PRECEDING:从前面的起点 UNBOUNDED FOLLOWING:到后面的起点
用在 over 函数外
LAG(col, n, default_value):往前第 n 行数据 LEAD(col, n, default_value):往后第 n 行数据 NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从 1 开始;对于每一行,NTILE 返回慈航所属组的编号;n 必须是 int 类型
hive (default)> create table business( > name string, > orderdate string, > cost int > ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; OK Time taken: 0.072 seconds
需求 1
1
selectname, count(*) from business wheresubstring(orderdate, 1, 7) = '2017-04'groupbyname;
用此语句的查询结果为:
1 2 3 4
name _c1 jack 1 mart 4 Time taken: 36.572 seconds, Fetched: 2 row(s)
hive (default)> select *, sum(cost) over(distribute by name) from business;
...
OK business.name business.orderdate business.cost sum_window_0 jack 2017-01-05 46 176 jack 2017-01-08 55 176 jack 2017-01-01 10 176 jack 2017-04-06 42 176 jack 2017-02-03 23 176 mart 2017-04-13 94 299 mart 2017-04-11 75 299 mart 2017-04-09 68 299 mart 2017-04-08 62 299 neil 2017-05-10 12 92 neil 2017-06-12 80 92 tony 2017-01-04 29 94 tony 2017-01-02 15 94 tony 2017-01-07 50 94 Time taken: 19.168 seconds, Fetched: 14 row(s)
hive (default)> select *, sum(cost) over(distribute by name sort by orderdate) from business; Query ID = root_20200108153257_3b1c7cdb-ed63-4f8d-ac93-46a090a51c27
...
OK business.name business.orderdate business.cost sum_window_0 jack 2017-01-01 10 10 jack 2017-01-05 46 56 jack 2017-01-08 55 111 jack 2017-02-03 23 134 jack 2017-04-06 42 176 mart 2017-04-08 62 62 mart 2017-04-09 68 130 mart 2017-04-11 75 205 mart 2017-04-13 94 299 neil 2017-05-10 12 12 neil 2017-06-12 80 92 tony 2017-01-02 15 15 tony 2017-01-04 29 44 tony 2017-01-07 50 94 Time taken: 17.41 seconds, Fetched: 14 row(s)
hive (default)> select *, lag(orderdate, 1, '1970-01-01') over(distribute by name sort by orderdate) from business; Query ID = root_20200108154129_6bc92c4b-4095-46a9-92bb-4e261d0cfcfa
OK business.name business.orderdate business.cost lag_window_0 jack 2017-01-01 10 1970-01-01 jack 2017-01-05 46 2017-01-01 jack 2017-01-08 55 2017-01-05 jack 2017-02-03 23 2017-01-08 jack 2017-04-06 42 2017-02-03 mart 2017-04-08 62 1970-01-01 mart 2017-04-09 68 2017-04-08 mart 2017-04-11 75 2017-04-09 mart 2017-04-13 94 2017-04-11 neil 2017-05-10 12 1970-01-01 neil 2017-06-12 80 2017-05-10 tony 2017-01-02 15 1970-01-01 tony 2017-01-04 29 2017-01-02 tony 2017-01-07 50 2017-01-04 Time taken: 24.159 seconds, Fetched: 14 row(s)
需求 7
ntile(5) over(order by orderdate):将数组分为 5 个组,按照时间排序 where ntile_5 = 1:查询第一个组,即前 20%
1 2 3 4 5 6 7 8 9 10 11 12 13
hive (default)> select name, orderdate, cost from ( > select name, orderdate, cost, ntile(5) over(order by orderdate) ntile_5 from business > ) t1 where ntile_5 = 1; Query ID = root_20200108155352_e7cd86d2-1949-4cfc-91a8-18e6fe11a3ec
...
OK name orderdate cost jack 2017-01-01 10 tony 2017-01-02 15 tony 2017-01-04 29 Time taken: 17.475 seconds, Fetched: 3 row(s)
hive (default)> select *, rank() over(partition by subject order by score desc) rank1, > dense_rank() over(partition by subject order by score desc) rank2, > row_number() over(partition by subject order by score desc) rank3 > from score; Query ID = root_20200108160934_e99e5b37-0a2c-41c8-a545-4394a10f14e4