Generally, in production hive is installed on master machine or on any 3rd party machines where hive, pig, other components are installed. Hive needs metastore for storing schema information which is RDBMS.
This RDBMS can be any type of database like oracle or MySQL or embedded data store. We only store metadata information in metastore, as it is only metadata information its size also will be very less.
Now let us see how the data is stored inside hive internally.
- User creates table in hive using table statement then correspondingly metadata is created like columns, number of columns in RDBMS.
- When user submit SQL query, hive will be interacted with metadata(RDBMS)
- Hive will convert SQL query into map reduce job and it is submitted over the master node.
- When user loads the data, Data will be loaded inside HDFS i.e. inside slaves.
The functionality in hive is we can read or process the data stored inside hive for other purpose also without using any 3rd party tools.
Generally, we use Apache Flume to capture the live data and store it into HDFS or Hbase or any other storage then we can use Hive and create table and run SQL queries on it to fetch the data.
In hive we can create database and create tables in it or we can directly create tables and let them store in the default database. Performance of hives map reduce is very fast compared to our map reduce.