旨在以尽可能短的篇幅介绍Hive 1.2.1在Ubuntu 16.04上的安装过程,适合于有过Hive或Hadoop安装经验的同学。
本文首先介绍Hive本地模式的安装,然后介绍分布模式的安装。
很明显Hive需要依赖Hadoop,而且,不同于HBase,Hive必须依赖于HDFS,不能使用本地文件系统;Hive基于Hadoop的分布式存储系统HDFS和HBase以及MapReduce并行计算框架工作。
一是将
在操作“将查询结果写入到HDFS中”时,会出现
本文首先介绍Hive本地模式的安装,然后介绍分布模式的安装。
关于Hive
Hive是基于Hadoop的一个数据仓库工具。Hive可直接用类似SQL的语言描述数据处理逻辑,避免开发人员在开发大数据查询分析处理程序时,编写复杂的基于JAVA的MapReduce程序。换句话说,Hive是将MapReduce抽象为类似SQL语句,在执行SQL语句时,Hive将其转换为MapReduce任务并运行。很明显Hive需要依赖Hadoop,而且,不同于HBase,Hive必须依赖于HDFS,不能使用本地文件系统;Hive基于Hadoop的分布式存储系统HDFS和HBase以及MapReduce并行计算框架工作。
下载和初始化Hive
本文假定你已经安装好Hadoop下载
使用来自CNNIC的Hive镜像$ cd ~
$ wget http://mirrors.cnnic.cn/apache/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz
$ tar apache-hive-1.2.1-bin.tar.gz
$ sudomv apache-hive-1.2.1-bin /usr/local/hive
初始化Hive
Hadoop路径
由于Hive依赖Hadoop,需要设置Hadoop路径。依照前文安装Hadoop,其路径为/usr/local/hadoop
。设置HADOOP_HOME
Bash
$ cd /usr/local/hive
$ cp conf/hive-env.sh.template conf/hive-env.sh
$ nano conf/hive-env.sh
在其中加入HADOOP_HOME(根据自己实际修改)HADOOP_HOME=/usr/local/hadoop
初始化配置
$ cp conf/hive-default.xml.template conf/hive-default.xml
后续操作全部在/usr/local/hive路径下本地模式
本地模式下Hive依靠本机的Hadoop环境运行,此时仅需要HDFS的支持,不需要如YARN的支持。配置
新建conf/hive-site.xml
,内容为XML
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>local</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/tmp/hadoop/mapred/local</value>
</property>
</configuration>
实际上这是重载了Hadoop的配置。测试
启动Hive
启动Hive前必须先启动HDFS!$ /usr/local/hadoop/sbin/start-dfs.sh
然后$ bin/hive
出现hive>
新建表
CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
出现OK
Time taken: 11.741 seconds
显示表
SHOW TABLES;
出现OK
invites
Time taken: 0.962 seconds, Fetched: 1 row(s)
DESCRIBE invites;
出现OK
foo int
bar string
ds string
# Partition Information
# col_name data_type comment
ds string
Time taken: 1.44 seconds, Fetched: 8 row(s)
修改表字段
ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2');
分别出现OK
Time taken: 0.804 seconds
OK
Time taken: 0.577 seconds
从文件插入数据到表中
LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
分别出现Loading data to table default.invites partition (ds=2008-08-15)
Partition default.invites{ds=2008-08-15} stats: [numFiles=1, numRows=0, totalSize=5791, rawDataSize=0]
OK
Time taken: 4.879 seconds
Loading data to table default.invites partition (ds=2008-08-08)
Partition default.invites{ds=2008-08-08} stats: [numFiles=1, numRows=0, totalSize=216, rawDataSize=0]
OK
Time taken: 1.607 seconds
SQL查询
直接查询SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
出现...
285
35
227
395
244
Time taken: 3.851 seconds, Fetched: 500 row(s)
将查询结果写入到HDFS中INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
此时为MapReduce任务,出现Query ID = hadoop_20160527103747_2efd85ec-b858-4fcb-8a9a-df6aa90b4d7f
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-05-27 10:37:54,169 Stage-1 map = 0%, reduce = 0%
2016-05-27 10:37:55,222 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local718531458_0001
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/tmp/hdfs_out/.hive-staging_hive_2016-05-27_10-37-47_317_3150862765548535991-1/-ext-10000
Moving data to: /tmp/hdfs_out
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 11582 HDFS Write: 18798 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 8.535 seconds
删除表
DROP TABLE invites;
出现OK
Time taken: 4.013 seconds
退出Hive
quit;
如果有必要,同时停止HDFS$ /usr/local/hadoop/sbin/stop-dfs.sh
分布模式
分布模式时Hive需要如YARN的支持。初始化
配置
你有两种选择一是将
hive-site.xml
删除即可。$ mv conf/hive-site.xml conf/hive-site.xml.bak
二是将hive-site.xml
中mapreduce.framework.name
值修改为yarn
,即XML
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/tmp/hadoop/mapred/local</value>
</property>
</configuration>
创建目录
Hive需要在HDFS中存储数据,需要提前创建目录,默认情况下:$ /usr/local/hadoop/bin/hdfs dfs -mkdir /tmp
$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user/hive/warehouse
$ /usr/local/hadoop/bin/hdfs dfs -chmod g+w /tmp
$ /usr/local/hadoop/bin/hdfs dfs -chmod g+w /user/hive/warehouse
测试
启动Hive
启动Hive前必须先启动HDFS和YARN$ /usr/local/hadoop/sbin/start-dfs.sh
$ /usr/local/hadoop/sbin/start-yarn.sh
然后$ bin/hive
出现hive>
测试表操作
测试表操作和本地模式一样,在显示结果上有细微差别。在操作“将查询结果写入到HDFS中”时,会出现
Query ID = hadoop_20160527105353_49d6a7ec-5124-4e07-bd68-87e20bf87278
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1464317481996_0001, Tracking URL = http://localhost:8088/proxy/application_1464317481996_0001/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1464317481996_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-05-27 10:55:02,780 Stage-1 map = 0%, reduce = 0%
2016-05-27 10:55:28,781 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.04 sec
MapReduce Total cumulative CPU time: 7 seconds 40 msec
Ended Job = job_1464317481996_0001
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/tmp/hdfs_out/.hive-staging_hive_2016-05-27_10-53-53_594_4595666558140512513-1/-ext-10000
Moving data to: /tmp/hdfs_out
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 7.04 sec HDFS Read: 9165 HDFS Write: 12791 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 40 msec
OK
Time taken: 98.675 seconds
此时MapReduce的Job会有ID,通过浏览器访问http://localhost:8088/,可以看到这个Job。退出Hive
quit;
如果有必要,同时停止HDFS和YARN$ /usr/local/hadoop/sbin/stop-dfs.sh
$ /usr/local/hadoop/sbin/stop-yarn.sh
启动和停止Hive
启动
在启动HDFS和YARN后:bin/hive
停止
在停止HDFS和YARN前quit;
小结
从安装运行上来看,Hive和HBase的一个明显不同就是Hive没有守护进程,不需要启动脚本;这很容易理解,Hive的命令执行总是有开始和结束的,并不需要维持一个环境。而这里所说的启动和停止Hive,严格来说是使用和退出Hive CLI (command line interface),需要注意。参考
- 黄宜华, 苗凯翔. "深入理解大数据:大数据处理与编程实践". 机械工业出版社, 2014.
- GettingStarted - Apache Hive - Apache Software Foundation