Hive Practical’s
First let us start Hadoop services and next start Hive shell as,
hdadmin@ubuntu:~/hive-0.13.1-cdh5.3.2$ bin/hive Logging initialized using configuration in jar:file:/home/hdadmin/hive-0.13.1-cdh5.3.2/lib/hive-common-0.13.1-cdh5.3.2.jar!/hive-log4j.properties hive> show tables; OK Time taken: 1.555 seconds hive> CREATE TABLE tab01 (abc INT, xyz STRING); OK Time taken: 1.047 seconds hive> show tables; OK tab01 Time taken: 0.057 seconds, Fetched: 1 row(s)
Here we have created a table and loaded the data ‘kv2.txt’ into it. In hive we don’t have insert so we use load statement. Here ‘local’ means inside filesystem. This examples folder is present inside hive directory.
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE tab01; Copying data from file:/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv2.txt Copying file: file:/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv2.txt Loading data to table default.tab01 Table default.tab01 stats: [numFiles=1, numRows=0, totalSize=5791, rawDataSize=0] OK Time taken: 3.654 seconds
When we type
select * from tab01;
It will no submit any map-reduce job, it will just fetch the result.
When we do any operation like,
Select count(*) from tab01;
Now it will submit map-reduce job. Here we have calculations so we get map-reduce job.
hive> select count(*) from tab01; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1512276014959_0001, Tracking URL = http://ubuntu:8088/proxy/application_1512276014959_0001/ Kill Command = /home/hdadmin/hadoop-2.5.0-cdh5.3.2/bin/hadoop job -kill job_1512276014959_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2017-12-02 21:54:54,891 Stage-1 map = 0%, reduce = 0% 2017-12-02 21:55:13,935 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.26 sec 2017-12-02 21:55:27,654 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.1 sec MapReduce Total cumulative CPU time: 4 seconds 100 msec Ended Job = job_1512276014959_0001 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.1 sec HDFS Read: 6000 HDFS Write: 4 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 100 msec OK 500 Time taken: 56.123 seconds, Fetched: 1 row(s)
Now let us create a partition on table and load data into it.
hive> CREATE TABLE tab02 (foo INT, bar STRING) PARTITIONED BY (mon STRING); OK Time taken: 0.36 seconds hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE tab02 PARTITION (mon='04'); Copying data from file:/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv2.txt Copying file: file:/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv2.txt Loading data to table default.tab02 partition (mon=04) Partition default.tab02{mon=04} stats: [numFiles=1, numRows=0, totalSize=5791, rawDataSize=0] OK Time taken: 1.506 seconds hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE tab02 PARTITION (mon='07'); Copying data from file:/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv3.txt Copying file: file:/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv3.txt Loading data to table default.tab02 partition (mon=07) Partition default.tab02{mon=07} stats: [numFiles=1, numRows=0, totalSize=216, rawDataSize=0]OK Time taken: 0.735 seconds
To check our partition created—go to localhost:50070, browse filesystem utilities.
Go to /user/hive/warehouse–here our tables are created-tab01, tab02.
In tab01-we have kv2.txt
In tab02- we have mon=04, mon=07 as 2 separate directories as partitions. In mon=04 we have kv2.txt file and mob=07 we have another kv3.txt file.
Here partition is physical partition not logical partition.
We can read that file using-
hdadmin@ubuntu:~/hive-0.13.1-cdh5.3.2$ hdfs dfs -cat /user/hive/warehouse/tab02/mon=04/kv2.txt hive> select count(*) from tab02 where mon='07'; OK 25 Time taken: 34.669 seconds, Fetched: 1 row(s)
When the data is already available inside HDFS we can create external table. In internal table we need to manually load the data and in external table data is already available inside HDFS.
Now let us see the example of external table.
Here we have “student” data and we are loading that data into hive and doing some operations.
hive> create table student (name string, marks int) > row format delimited fields terminated by ','; OK Time taken: 1.669 seconds hive> show tables; OK students Time taken: 0.192 seconds, Fetched: 1 row(s) hive> show databases; OK default Time taken: 0.065 seconds, Fetched: 1 row(s) hive> load data local inpath 'student.txt' overwrite into table student; Copying data from file:/home/hdadmin/hive-0.13.1-cdh5.3.2/student.txt Copying file: file:/home/hdadmin/hive-0.13.1-cdh5.3.2/student.txt Loading data to table default.students Table default.students stats: [numFiles=1, numRows=0, totalSize=34, rawDataSize=0] OK Time taken: 2.753 seconds hive> select * from student; OK siva 3 sai 6 rahul 9 siva 8 Time taken: 0.563 seconds, Fetched: 5 row(s) hive> select name from student; siva sai rahul siv Time taken: 16.04 seconds, Fetched: 5 row(s) hive> select * from student where name = 'siva'; OK siva 3 siva 8 Time taken: 14.417 seconds, Fetched: 2 row(s) hive> select * from student group by name; rahul sai siva Time taken: 19.784 seconds, Fetched: 4 row(s) hive> select name, MAX(marks) from student group by name; rahul 9 sai 6 siva 8 Time taken: 19.427 seconds, Fetched: 4 row(s) We can also set hive properties as, hive> set hive.execution.engine; hive.execution.engine=mr hive> set hive.enforce.bucketing; hive.enforce.bucketing=false hive> set hive.enforce.bucketing = true;hive> set hive.enforce.bucketing; hive.enforce.bucketing=true hive> set hive.execution.engine = tez; hive> set hive.execution.engine;hive.execution.engine=tez
We have functions also in hive,
hive> show functions; abs acos and array array_contains ascii asin assert_true atan avg base64 between bin case ceil ceiling coalesce collect_list collect_set compute_stats concat concat_ws context_ngrams conv corr cos count covar_pop covar_samp ……… ……..
We can see the function definition as,
hive> describe function count; OK count(*) - Returns the total number of retrieved rows, including rows containing NULL values. count(expr) - Returns the number of rows for which the supplied expression is non-NULL. count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL. Time taken: 0.089 seconds, Fetched: 3 row(s) hive> describe function length; OK length(str | binary) - Returns the length of str or number of bytes in binary data Time taken: 0.103 seconds, Fetched: 1 row(s)