Quantcast
Channel: 看得透又看得远者prevail. ppt.cc/flUmLx ppt.cc/fqtgqx ppt.cc/fZsXUx ppt.cc/fhWnZx ppt.cc/fnrkVx ppt.cc/f2CBVx
Viewing all articles
Browse latest Browse all 20546

安装Hive 1.2.1到Ubuntu 16.04

$
0
0
旨在以尽可能短的篇幅介绍Hive 1.2.1在Ubuntu 16.04上的安装过程,适合于有过Hive或Hadoop安装经验的同学。
本文首先介绍Hive本地模式的安装,然后介绍分布模式的安装。

关于Hive

Hive是基于Hadoop的一个数据仓库工具。Hive可直接用类似SQL的语言描述数据处理逻辑,避免开发人员在开发大数据查询分析处理程序时,编写复杂的基于JAVA的MapReduce程序。换句话说,Hive是将MapReduce抽象为类似SQL语句,在执行SQL语句时,Hive将其转换为MapReduce任务并运行。
很明显Hive需要依赖Hadoop,而且,不同于HBase,Hive必须依赖于HDFS,不能使用本地文件系统;Hive基于Hadoop的分布式存储系统HDFS和HBase以及MapReduce并行计算框架工作。

下载和初始化Hive

本文假定你已经安装好Hadoop

下载

使用来自CNNIC的Hive镜像

$ cd ~
$ wget http://mirrors.cnnic.cn/apache/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz
$ tar apache-hive-1.2.1-bin.tar.gz
$ sudomv apache-hive-1.2.1-bin /usr/local/hive

初始化Hive

Hadoop路径

由于Hive依赖Hadoop,需要设置Hadoop路径。依照前文安装Hadoop,其路径为/usr/local/hadoop

设置HADOOP_HOME

Bash
$ cd /usr/local/hive
$ cp conf/hive-env.sh.template conf/hive-env.sh
$ nano conf/hive-env.sh
在其中加入HADOOP_HOME(根据自己实际修改)
HADOOP_HOME=/usr/local/hadoop

初始化配置


$ cp conf/hive-default.xml.template conf/hive-default.xml
后续操作全部在/usr/local/hive路径下

本地模式

本地模式下Hive依靠本机的Hadoop环境运行,此时仅需要HDFS的支持,不需要如YARN的支持。

配置

新建conf/hive-site.xml,内容为
XML
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>local</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/tmp/hadoop/mapred/local</value>
</property>
</configuration>
实际上这是重载了Hadoop的配置。

测试

启动Hive

启动Hive前必须先启动HDFS!

$ /usr/local/hadoop/sbin/start-dfs.sh
然后

$ bin/hive
出现
hive>

新建表

CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
出现
OK
Time taken: 11.741 seconds

显示表

SHOW TABLES;
出现
OK
invites
Time taken: 0.962 seconds, Fetched: 1 row(s)
DESCRIBE invites;
出现
OK
foo int
bar string
ds string

# Partition Information
# col_name data_type comment

ds string
Time taken: 1.44 seconds, Fetched: 8 row(s)

修改表字段

ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
ALTER TABLE invites REPLACE COLUMNS (foo INT, bar STRING, baz INT COMMENT 'baz replaces new_col2');
分别出现
OK
Time taken: 0.804 seconds

OK
Time taken: 0.577 seconds

从文件插入数据到表中

LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
分别出现
Loading data to table default.invites partition (ds=2008-08-15)
Partition default.invites{ds=2008-08-15} stats: [numFiles=1, numRows=0, totalSize=5791, rawDataSize=0]
OK
Time taken: 4.879 seconds

Loading data to table default.invites partition (ds=2008-08-08)
Partition default.invites{ds=2008-08-08} stats: [numFiles=1, numRows=0, totalSize=216, rawDataSize=0]
OK
Time taken: 1.607 seconds

SQL查询

直接查询
SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
出现
...
285
35
227
395
244
Time taken: 3.851 seconds, Fetched: 500 row(s)
将查询结果写入到HDFS中
INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
此时为MapReduce任务,出现
Query ID = hadoop_20160527103747_2efd85ec-b858-4fcb-8a9a-df6aa90b4d7f
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-05-27 10:37:54,169 Stage-1 map = 0%, reduce = 0%
2016-05-27 10:37:55,222 Stage-1 map = 100%, reduce = 0%
Ended Job = job_local718531458_0001
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/tmp/hdfs_out/.hive-staging_hive_2016-05-27_10-37-47_317_3150862765548535991-1/-ext-10000
Moving data to: /tmp/hdfs_out
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 11582 HDFS Write: 18798 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 8.535 seconds

删除表

DROP TABLE invites;
出现
OK
Time taken: 4.013 seconds

退出Hive

quit;
如果有必要,同时停止HDFS

$ /usr/local/hadoop/sbin/stop-dfs.sh

分布模式

分布模式时Hive需要如YARN的支持。

初始化

配置

你有两种选择
一是将hive-site.xml删除即可。

$ mv conf/hive-site.xml conf/hive-site.xml.bak
二是将hive-site.xmlmapreduce.framework.name值修改为yarn,即
XML
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/tmp/hadoop/mapred/local</value>
</property>
</configuration>

创建目录

Hive需要在HDFS中存储数据,需要提前创建目录,默认情况下:
$ /usr/local/hadoop/bin/hdfs dfs -mkdir /tmp
$ /usr/local/hadoop/bin/hdfs dfs -mkdir /user/hive/warehouse
$ /usr/local/hadoop/bin/hdfs dfs -chmod g+w /tmp
$ /usr/local/hadoop/bin/hdfs dfs -chmod g+w /user/hive/warehouse

测试

启动Hive

启动Hive前必须先启动HDFS和YARN

$ /usr/local/hadoop/sbin/start-dfs.sh
$ /usr/local/hadoop/sbin/start-yarn.sh
然后
$ bin/hive
出现
hive>

测试表操作

测试表操作和本地模式一样,在显示结果上有细微差别。
在操作“将查询结果写入到HDFS中”时,会出现
Query ID = hadoop_20160527105353_49d6a7ec-5124-4e07-bd68-87e20bf87278
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1464317481996_0001, Tracking URL = http://localhost:8088/proxy/application_1464317481996_0001/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1464317481996_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-05-27 10:55:02,780 Stage-1 map = 0%, reduce = 0%
2016-05-27 10:55:28,781 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.04 sec
MapReduce Total cumulative CPU time: 7 seconds 40 msec
Ended Job = job_1464317481996_0001
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to: hdfs://localhost:9000/tmp/hdfs_out/.hive-staging_hive_2016-05-27_10-53-53_594_4595666558140512513-1/-ext-10000
Moving data to: /tmp/hdfs_out
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 7.04 sec HDFS Read: 9165 HDFS Write: 12791 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 40 msec
OK
Time taken: 98.675 seconds
此时MapReduce的Job会有ID,通过浏览器访问http://localhost:8088/,可以看到这个Job。

退出Hive

quit;
如果有必要,同时停止HDFS和YARN

$ /usr/local/hadoop/sbin/stop-dfs.sh
$ /usr/local/hadoop/sbin/stop-yarn.sh

启动和停止Hive

启动

在启动HDFS和YARN后:
bin/hive

停止

在停止HDFS和YARN前
quit;

小结

从安装运行上来看,Hive和HBase的一个明显不同就是Hive没有守护进程,不需要启动脚本;这很容易理解,Hive的命令执行总是有开始和结束的,并不需要维持一个环境。而这里所说的启动和停止Hive,严格来说是使用和退出Hive CLI (command line interface),需要注意。

参考



Viewing all articles
Browse latest Browse all 20546

Trending Articles