Hive is a type of framework built on top of Hadoop for data warehousing. It was developed at Facebook for the analysis of large amount of data which is coming day to day. After trying with few other storage systems, the Facebook team ultimately chosen Hadoop as storage system for Hive since it is cost effective and scalable.
So, by developing hive on Hadoop we have all the features of Hadoop on hive such as scalability, reliability and replication properties. Hive was created mainly for the people who have strong SQL skills.
But it is not the ideal solution for building complex big data solutions, it is mainly used for analysis of the huge amount of data easily using SQL queries. We can run all SQL queries on Hive like Create table, Create view, Alter table, All DML and DDL operations also.
Some features of hive are it can support only structured data. We cannot process unstructured and semi-structured data in hive. We need to create schema in hive same as we do in RDBMS. Hive does not store actual data in it, it stores it actual data in HDFS. We store the data in HDFS in the form of TextFile, SequenceFile, RCFile (RDFile stores data in columnar way) etc.
Let us just see sample execution of data in hive:
- We need to create table in Hive (create schema)
- Need to load the data in hive using local filesystem or HDFS.
- Execute any HQL query in it.
- Hive converts SQL query into Map-reduce job.
- At last we get the output.
Note: Here the Map-reduce job generated by hive is very much different than Normal Map-reduce job. The Map-reduce job generated by hive is very optimized and we can analyze very large number of data like in Petabytes.
Let’s see the Hive tutorial Basic Points that needs to cover!!!
Hive stores it tables schemas i.e. its metadata in Metastore. Metastore is a type of database which only stores hive metadata. We will see about it in detail in next slides. Generally, to install hive before we need to install the latest version of Hadoop and on top of it we will install hive. Learn more..
Generally, in production hive is installed on master machine or on any 3rd party machines where hive, pig, other components are installed. Hive needs metastore for storing schema information which is RDBMS.It can be any type of database like oracle or MySQL or embedded data store. We only store metadata information in metastore, as it is only metadata information its size also will be very less. Learn more..
Metastore is the central repository of Hive metadata. It is divided into 2 pieces: a service and the backing store for the data. By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk. This is called as the embedded metastore configuration. Learn more..
In traditional RDBMS a table schema is checked when we load the data. If the data loaded and the schema does not match, then it is rejected. This is called as Schema on write which means data is checked with schema when it written into the database. Let us take an example and look into this. Learn more..
SerDe means Serializer and Deserializer. Hive uses SerDe and FileFormat to read and write table rows. Main use of SerDe interface is for IO operations. A SerDe allows hive to read the data from the table and write it back to the HDFS in any custom format. If we have unstructured data, then we use RegEx SerDe which will instruct hive how to handle that record. Learn more..
HIVEQL is the language used for writing programs in hive. This is the mixture of many languages. Here we use Hive shell for writing.Now let us see some commands for understanding HIveQL.We can start hive shell by executing hive executables—-/bin/hive”hive –help”. Learn more..
Partition means dividing the table based on a column. They are defined at table creation time using ‘Partitioned by’ clause. It enables us to access data easily based on column. When we partition a table, a new directory is created based on number of columns. Suppose, if we partition a table by date, the records of same date will be stored in one partition. So if we want to retrieve any data we can do this easily by seeing the date. Learn more..
Bucketing is another way for dividing data sets into more manageable parts. Clustering, aka bucketing, will result in a fixed number of files, since we will specify the number of buckets. Hive will calculate a hash for it and assign a record to that bucket. Physically, each bucket is just a file in the table directory. It can be done with partitioning on hive tables or without partitioning also. Learn more..
First let us start Hadoop services and next start Hive shell as,Here we have created a table and loaded the data ‘kv2.txt’ into it. In hive we don’t have insert so we use load statement. Here ‘local’ means inside filesystem. This examples folder is present inside hive directory. Learn more..