/    /  Hive Practical’s

Hive Practical’s

First let us start Hadoop services and next start Hive shell as,

hdadmin@ubuntu:~/hive-0.13.1-cdh5.3.2$ bin/hive

Logging initialized using configuration in jar:file:/home/hdadmin/hive-0.13.1-cdh5.3.2/lib/hive-common-0.13.1-cdh5.3.2.jar!/hive-log4j.properties

hive> show tables;

OK

Time taken: 1.555 seconds

hive> CREATE TABLE tab01 (abc INT, xyz STRING);  

OK

Time taken: 1.047 seconds

hive> show tables;
OK
tab01

Time taken: 0.057 seconds, Fetched: 1 row(s)

Here we have created a table and loaded the data ‘kv2.txt’ into it. In hive we don’t have insert so we use load statement. Here ‘local’ means inside filesystem. This examples folder is present inside hive directory.

hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE tab01;

Copying data from file:/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv2.txt

Copying file: file:/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv2.txt

Loading data to table default.tab01

Table default.tab01 stats: [numFiles=1, numRows=0, totalSize=5791, rawDataSize=0]

OK

Time taken: 3.654 seconds

When we type

select * from tab01;

It will no submit any map-reduce job, it will just fetch the result.

When we do any operation like,

Select count(*) from tab01;

Now it will submit map-reduce job. Here we have calculations so we get map-reduce job.

hive> select count(*) from tab01;

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1512276014959_0001, Tracking URL = http://ubuntu:8088/proxy/application_1512276014959_0001/

Kill Command = /home/hdadmin/hadoop-2.5.0-cdh5.3.2/bin/hadoop job  -kill job_1512276014959_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2017-12-02 21:54:54,891 Stage-1 map = 0%,  reduce = 0%

2017-12-02 21:55:13,935 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.26 sec

2017-12-02 21:55:27,654 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.1 sec

MapReduce Total cumulative CPU time: 4 seconds 100 msec

Ended Job = job_1512276014959_0001

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.1 sec   HDFS Read: 6000 HDFS Write: 4 SUCCESS

Total MapReduce CPU Time Spent: 4 seconds 100 msec

OK

500

Time taken: 56.123 seconds, Fetched: 1 row(s)

Now let us create a partition on table and load data into it.

hive> CREATE TABLE tab02 (foo INT, bar STRING) PARTITIONED BY (mon STRING);

OK

Time taken: 0.36 seconds

hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE tab02 PARTITION (mon='04');

Copying data from file:/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv2.txt

Copying file: file:/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv2.txt

Loading data to table default.tab02 partition (mon=04)

Partition default.tab02{mon=04} stats: [numFiles=1, numRows=0, totalSize=5791, rawDataSize=0]

OK

Time taken: 1.506 seconds

hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE tab02 PARTITION (mon='07');

Copying data from file:/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv3.txt

Copying file: file:/home/hdadmin/hive-0.13.1-cdh5.3.2/examples/files/kv3.txt

Loading data to table default.tab02 partition (mon=07)

Partition default.tab02{mon=07} stats: [numFiles=1, numRows=0, totalSize=216, rawDataSize=0]OK

Time taken: 0.735 seconds

To check our partition created—go to localhost:50070, browse filesystem utilities.

Go to /user/hive/warehouse–here our tables are created-tab01, tab02.

In tab01-we have kv2.txt

In tab02- we have mon=04, mon=07 as 2 separate directories as partitions. In mon=04 we have kv2.txt file and mob=07 we have another kv3.txt file.

Here partition is physical partition not logical partition.

We can read that file using-

hdadmin@ubuntu:~/hive-0.13.1-cdh5.3.2$ hdfs dfs -cat /user/hive/warehouse/tab02/mon=04/kv2.txt

hive> select count(*) from tab02 where mon='07';

OK

25

Time taken: 34.669 seconds, Fetched: 1 row(s)

When the data is already available inside HDFS we can create external table. In internal table we need to manually load the data and in external table data is already available inside HDFS.

Now let us see the example of external table.

Here we have “student” data and we are loading that data into hive and doing some operations.

hive> create table student (name string, marks int)

> row format delimited fields terminated by ',';

OK

Time taken: 1.669 seconds

hive> show tables;

OK

students

Time taken: 0.192 seconds, Fetched: 1 row(s)

hive> show databases;

OK

default

Time taken: 0.065 seconds, Fetched: 1 row(s)

hive> load data local inpath 'student.txt' overwrite into table student;

Copying data from file:/home/hdadmin/hive-0.13.1-cdh5.3.2/student.txt

Copying file: file:/home/hdadmin/hive-0.13.1-cdh5.3.2/student.txt

Loading data to table default.students

Table default.students stats: [numFiles=1, numRows=0, totalSize=34, rawDataSize=0]

OK

Time taken: 2.753 seconds

hive> select * from student;

OK

siva     3

sai       6

rahul  9

siva     8

Time taken: 0.563 seconds, Fetched: 5 row(s)

hive> select name from student;

siva

sai

rahul

siv

Time taken: 16.04 seconds, Fetched: 5 row(s)
hive> select * from student where name = 'siva';
OK
siva     3

siva     8

Time taken: 14.417 seconds, Fetched: 2 row(s)

hive> select * from student group by name;

rahul

sai

siva

Time taken: 19.784 seconds, Fetched: 4 row(s)

hive> select name, MAX(marks) from student group by name;

rahul  9

sai       6

siva     8

Time taken: 19.427 seconds, Fetched: 4 row(s)

We can also set hive properties as,

hive> set hive.execution.engine;

hive.execution.engine=mr

hive> set hive.enforce.bucketing;

hive.enforce.bucketing=false

hive> set hive.enforce.bucketing = true;hive> set hive.enforce.bucketing;

hive.enforce.bucketing=true

hive> set hive.execution.engine = tez;

hive> set hive.execution.engine;hive.execution.engine=tez

We have functions also in hive,

hive> show functions;

abs

acos

and

array

array_contains

ascii

asin

assert_true

atan

avg

base64

between

bin

case

ceil

ceiling

coalesce

collect_list

collect_set

compute_stats

concat

concat_ws

context_ngrams

conv

corr

cos

count

covar_pop

covar_samp

………

……..

We can see the function definition as,

hive> describe function count;

OK

count(*) - Returns the total number of retrieved rows, including rows containing NULL values.

count(expr) - Returns the number of rows for which the supplied expression is non-NULL.

count(DISTINCT expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL.

Time taken: 0.089 seconds, Fetched: 3 row(s)

hive> describe function length;

OK

length(str | binary) - Returns the length of str or number of bytes in binary data

Time taken: 0.103 seconds, Fetched: 1 row(s)