hive与sql的一些点滴

hive: Hive可以将结构化的数据文件映射为一张数据库表，并提供完整的SQL查询功能，可以将SQL语句转换为MapReduce任务进行运行，通过自己的SQL去查询分析需要的内容，这套SQL 简称Hive SQL.而mapreduce开发人员可以把己写的mapper 和reducer 作为插件来支持Hive 做更复杂的数据分析。

I1. 顺序

SQL的执行顺序：

第一步：执行FROM
第二步：WHERE条件过滤
第三步：GROUP BY分组
第四步：HAVING条件过滤
第五步：执行SELECT投影列
第六步：执行ORDER BY 排序

     (1)from 
    (3) join 
    (2) on 
    (4) where 
    (5)group by(开始使用select中的别名，后面的语句中都可以使用)
    (6) avg,sum.... 
    (7)having 
    (8) select 
    (9) distinct 
    (10) order by
    

上面执行顺序在where里的条件有索引条件下是先执行where，如果where里的筛选条件没有索引，则会把所有的select出来，取满足where条件的，性能瓶颈较大。索引添加经验：范围内的放到后面，固定的字段放到索引前面。

建表索引：
create table table(
  id int unsigned not null auto_increment comment '自增主键',
  article_id int(11) not null comment '文章id',  
  create_time datetime not null default current_timestamp comment '创建时间',
  primary key (id),
  index idx_update_time(article_id,update_time),       #普通索引名以idx开头，联合索引
  unique index uniq_article_id(article_id)  #唯一索引名以uniq开头
)engine=innodb auto_increment=1 charset=utf8mb4 comment='例子表';

建表后针对某个字段增加索引：
alter table table add index idx_start_time(start_time);

I2. sql常用操作:

DISTINCT即去重. as : 定义别名. having:查询平均年龄在20岁以上的班级

SELECT student_class, AVG(student_age) FROM t_student WHERE AVG(student_age)>20 GROUP BY student_class;

错误,(聚合函数在WHERE之后执行，所以这里在WHERE判断条件里加入聚合函数是做不到的),下面sql是正确

SELECT student_class,AVG(student_age) AS 平均年龄 FROM t_student GROUP BY (student_class) HAVING
AVG(student_age)>20;

通配符过滤用like 单字符通配符‘_’ 多字符通配符‘%’

A LIKE B 模糊匹配 A RLIKE B 正则表达式匹配

select * from T_Employee where FName like '_erry'

WHERE column_name BETWEEN value1 AND value2 
操作符 BETWEEN ... AND 会选取介于两个值之间的数据范围。这些值可以是数值、文本或者日期。

group by: 分组,只能针对某一个列或者某个列的聚合函数
having: 针对聚合函数分组的过滤(多行),where针对每一行的过滤
group by 多个字段,group by name,number，我们可以把name和number 看成一个整体字段，以他们整体来进行分组的
判断是否null, where field is null

安全模式下无法批量更新
批量更新时候，安全模式关闭，结束后再开启
SET SQL_SAFE_UPDATES = 0;
update table_name set field_name= 1000+filed_name;
SET SQL_SAFE_UPDATES = 1;

更新update操作

insert行

alter增加字段

I3.sql聚合函数汇总

count():求满足条件的记录数(COUNT()函数会忽略指定列的值为空的行，但如果 COUNT()函数中用的是星号（*），则不忽略)(count(1))
max():求最大值
min():求最小值
sum():求和只能对数值进行计算不能运用于日期和字符串
avg():求平均值

I4. hive内置函数

hive内置函数

floor(21.2),返回21
round(21.263,2),返回21.26
abs(double a)绝对值
from_unixtime(tmsp): 返回当前时区的时间YYYY- MM – DD HH：MM：SS
unix_timestamp():如果不带参数的调用，返回一个Unix时间戳（从’1970- 01 – 0100:00:00′到现在的UTC秒数）为无符号整数
unix_timestamp(string date): 指定日期参数调用UNIX_TIMESTAMP（），它返回参数值’1970- 01 – 0100:00:00′到指定日期的秒数。
date(timesp): 将时间戳转换为年-月-日
to_date(string timestamp): 返回时间中的年月日： to_date(“1970-01-01 00:00:00″) = “1970-01-01″
year(string date),month(string date),day(string date)等
length(string A): 返回字符串的长度
reverse(string A): 返回倒序字符串
concat(string A, string B…):连接多个字符串，合并为一个字符串
lower(string A): 小写
upper(string A): 大写
get_json_object(string json_string, string path): 处理json数据
regexp_extract(string subject, string pattern, int index): 通过下标返回正则表达式指定的部分。regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 2) returns ‘bar.’
parse_url(string urlString, string partToExtract [, string keyToExtract]):返回URL指定的部分。parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1′, ‘HOST’) 返回：’facebook.com’
A IN (val1, val2, ...)
A NOT IN (val1, val2, ...)

json_tuple():Takes a set of names (keys) and a JSON string, and returns a tuple of values. This is a more efficient version of the get_json_object UDF because it can get multiple keys with just one call.

example:
select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a;

应该用下面替换上面的:

select a.timestamp, b.*
from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;

parse_url_tuple比parser_url更高效
数据:
url1 http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1
url2 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-getjsonobject

create external table if not exists t_url(f1 string, f2 string) row format delimited fields TERMINATED BY ' ' location '/test/url';

SELECT f1, b.* FROM t_url LATERAL VIEW parse_url_tuple(f2, 'HOST', 'PATH', 'QUERY', 'QUERY:k1') b as host, path, query, query_id;
结果:
url1 facebook.com /path1/p.php k1=v1&k2=v2 v1
url2 cwiki.apache.org /confluence/display/Hive/LanguageManual+UDF NULL NULL

I5. hive聚合函数

包括了sql的聚合函数
var_pop(col): 返回指定列的方差
var_samp(col): 返回指定列的样本方差
还有偏差，协方差，相关系数等

I6. hive常用命令

hive的常用参数如下

-d,–define <key=value> Variable substitution to apply to hive commands. e.g. -d A=B or –define A=B
-e SQL from command line
-f SQL from files
-i Initialization SQL file
-S,–silent Silent mode in interactive shell
-v,–verbose Verbose mode (echo executed SQL to the console)

需要说明的

LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
local表示从本地导入数据到table
没有local表示从hdfs导入数据
关键词 'OVERWRITE' 意味着当前表中已经存在的数据将会被删除掉。
如果没有给出 'OVERWRITE'，则意味着数据文件将追加到当前的数据集中。

通过Hive导出到本地文件系统
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/t_hive' SELECT * FROM t_hive;
没有local则导出到hdfs中(s3)

CREATE EXTERNAL TABLE table1(date_1 string,resorce string,data string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as textfile
LOCATION 's3://**';
创建s3(or hdfs)上的映射表

insert overwrite directory "s3://**"
select *
from table1;
查询table1内容查询结果导入到s3(or hdfs)里

关于分区:
分区表使得查询时指定分区而不必查询整个表，节约时间
分区表的字段在整个表里其实就是一个普通的字段,每个分区以文件夹的形式单独存在表文件夹的目录下

建（映射表)分区:
create external table table_name(filed type) PARTITIONED BY (partition_field type)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' stored as textfile
location '**'
添加数据到分区:
alter table table_name add IF NOT EXISTS partition(partition_field='**') location '**';
删除分区:
ALTER TABLE table_name DROP partition_spec, partition_spec,...

将本地数据添加到分区中:
load data local inpath '/home/hadoop/Desktop/data.txt' overwrite into table rable_name partition (ds='2015-11-11'); 
往一个分区表的某个分区中添加数据:
insert overwrite table table_name partition (ds='**') select id,max(name) from test group by id; 
分区中查询:
select * from table_name where partition_field ='**'; 

版权申明

本作品采用知识共享署名-非商业性使用 4.0 国际许可协议进行许可。转载文章请注明原文出处。

天道酬勤

其实，我是一名文字工作者，同时，我也是一名技术宅！

hive与sql的一些点滴

版权申明