TABLE:
Now let us see how to create a table in hive. We have seen that the data is stored in HDFS and its metadata is stored in RDBMS. There are mainly 2 types of table i.e. Managed table and External tables.
When we create a table in hive, the default table which is created is managed table which means its Files, metastore are all managed by hive internal process and stored in the warehouse directory. Then if the managed table is dropped then all its data is lost.
In External table the files, metastore are all managed by process outside of hive. They can access data which is shared on other HDFS locations also. So we can use external tables by referring the path of the data stored. So when a table is dropped just its structure is lost in hive but its data will be available on remote location.
Let us take an example,
CREATE TABLE managed_table (name STRING); LOAD DATA INPATH '/user/input.txt' INTO table managed_table;
It will move the file inputa.txt into Hive’s warehouse directory for the managed_table table, which hdfs://user/hive/warehouse/managed_table
So If the table is later dropped, using:
DROP TABLE managed_table;
the table, including its metadata and its data, is deleted.
The location of the external data is specified at table creation time:
CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/external_table'; LOAD DATA INPATH '/user/input.txt' INTO TABLE external_table;
With the EXTERNAL keyword, Hive knows that it is not managing the data, so it doesn’t move it to its warehouse directory. When you drop an external table, Hive will leave the data untouched and only delete the metadata.
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property.
Let us see external table in practical,
Scenario 1
Here we are creating an external table and loading the data into it.
hive>create external table ext_student (name STRING, marks INT) > row format delimited fields terminated by ',' > location '/home/hdadmin'; hive> load data inpath '/data1/student.txt' into table ext_student; Loading data to table default.ext_student Table default.ext_student stats: [numFiles=0, numRows=0, totalSize=0, rawDataSize=0] OK Time taken: 1.625 seconds hive> select * from ext_student; OK siva 3 sai 6 rahul 9 siva 8 Time taken: 0.44 seconds, Fetched: 5 row(s)
Let us see the location where the external table is created.
hdadmin@ubuntu:~/hive-0.13.1-cdh5.3.2$ hdfs dfs -ls /home/hdadmin Found 1 items -rwxr-xr-x 1 hdadmin supergroup 34 2017-12-10 00:03 /home/hdadmin/student.txt
Now let us delete ‘student.txt’ in same location where we created it.
hdadmin@ubuntu:~/hive-0.13.1-cdh5.3.2$ hdfs dfs -rm /home/hdadmin/student.txt 17/12/11 00:15:15 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. Deleted /home/hdadmin/student.txt Now let us try to access the data in ext_student table. hive> select * from ext_student; OK Time taken: 1.47 seconds Here we can see that we did not got any error while accessing the table. The table is empty as we have deleted the student.txt file
Now let us try to again load the data into it.
student.txt' into table ext_student; FAILED: SemanticException Line 1:17 Invalid path ''/data1/student.txt'': No files matching path hdfs://localhost:9000/data1/student.txt Now we are getting the error as we don’t have ‘student.txt’ file.
Now try to again put ‘student.txt’ file in /data1.
hdadmin@ubuntu:~$ hdfs dfs -put student.txt /data1 hdadmin@ubuntu:~$ hdfs dfs -ls /data1 Found 1 items -rw-r--r-- 1 hdadmin supergroup 34 2017-12-11 00:18 /data1/student.txt
Now load the data,
hive> load data inpath '/data1/student.txt' into table ext_student; Loading data to table default.ext_student Table default.ext_student stats: [numFiles=0, numRows=0, totalSize=0, rawDataSize=0] OK Time taken: 1.723 seconds
As we have loaded the data, ‘student.txt’ file is automatically created in the path given by ‘Location’ parameter.
hdadmin@ubuntu:~/hive-0.13.1-cdh5.3.2$ hdfs dfs -ls /home/hdadmin Found 1 items -rwxr-xr-x 1 hdadmin supergroup 34 2017-12-11 00:18 /home/hdadmin/student.txt hive> select * from ext_student; OK siva 3 sai 6 rahul 9 siva 8 Time taken: 1.47 seconds, Fetched: 5 row(s)
Scenario-2:
Now let us drop the external table and see what happens,
hive> show tables; OK ext_student Time taken: 0.225 seconds, Fetched: 2 row(s) hive> drop table ext_student; OK Time taken: 0.779 seconds hive> show tables; OK Time taken: 0.089 seconds, Fetched: 1 row(s)
We have dropped the table but ‘student.txt’ file is present in the path given by “Location” parameter.
hdadmin@ubuntu:~/hive-0.13.1-cdh5.3.2$ hdfs dfs -ls /home/hdadmin Found 1 items -rwxr-xr-x 1 hdadmin supergroup 34 2017-12-11 00:18 /home/hdadmin/student.txt /home/hdadmin/student.txt