Apache Flume is used for streaming the data into HDFS or HBase or other storage systems. Similar to Flume other types of system are Apache Kafka and Facebook Scribe. These systems act as a buffer between the data producers and the final destination.
Apache Flume acts as a system used to write data to Apache Hadoop and Apache HBase in a reliable and scalable fashion and we can write data into it in any format and language such as MapReduce, Hive, Pig, Impala. So let us first see why do we need Flume?
We can get a doubt that why to use Flume, directly only we can write the data into HDFS or Hbase. Normally there is so much data processed on a Hadoop cluster, so there will be hundreds to thousands of large servers producing the data. Such huge data trying to write in HDFS can cause huge problems.
In HDFS we have studied that at a time thousands of files are being written and for that many blocks are allocated in a single namenode.
By this huge operations server becomes stress, network latency occurs and sometimes we can also loose the data. Scaling Flume is also easier than scaling HDFS or HBase clusters. So we use Flume to collect the data and send to HDFS.
Let’s have a glance about Flume Tutorial!!!
Simplest unit in Flume deployment is a Flume Agent. It is a java process that runs Sources and Sinks which are connected by channels. We can receive data from one agent and send to other agents which means we can connect one agent to one or more agents. By this chain of agents, we can move data from one location to another. Learn more..
Let us just take an example and see the working of Flume:First take a local directory which watches for new text files.As files are added send each line of each file to the console.Let just imagine that new files are continuously ingested into flume, but here we will be adding files by ourselves. Learn more..
External events are send from Avro client to Avro source and Avro source listens to it based on port number. Required properties for Avro source are channel, type (need to be Avro), bind (hostname or IP address) and port. Example of property file for Avro source. Learn more..
HDFS sink is used to write events from channel into HDFS. The file format is text and sequence files. We can also use partition, bucketing, compression concepts of hive in it. Based on size of data or time or number of events the new files will be created automatically. Flume uses the Hadoop jars to communicate with the HDFS cluster. Learn more..
All the events are stored in the channel and then given to sink. Source adds the events and Sink removes it.The events are stored in an in-memory queue with configurable max size. It’s ideal for flows that need higher throughput and are prepared to lose the staged data in the event of an agent failures. Learn more..
We can have multiple sources, channels and sinks in a single Flume configuration file. A single agent can contain several independent flows. Let us see example of multiple flows. Learn more..
Let us see how to connect between 2 different agents or 2 tiers of agents. So first let us see why do we need to connect 2 different agents? Suppose we have an agent running on node1 which produces less data, this data is written to HDFS. This HDFS data will only consist of the events produced from node1. Learn more..
A sink group allows multiple sinks to be treated as one used for failover or load-balancing purposes. If a tier2 agent is not available, then the events will be sent to another tier2 agent and then to HDFS destination without any problem.Learn more..
We can have multiple destinations in Flume using multiplexing of event flow. From the below diagram we can see that the source from agent “foo” is fanning out the flow to 3 different channels. We will see about fan out in the next topic, here fan out can be replicating or multiplexing. Each sink has different destinations. Learn more..
Fan out is the process of delivering events form one source to multiple sinks through multiple channels. We have 2 modes for fan out, they are replicating and multiplexing. In the replicating flow, the event is sent to all the configured channels. In the multiplexing flow, the event is sent to only a subset of channels. Learn more..
We have studied partitions in hive, similar to that we have partitions in Flume. By the help of partitions, we can access data fastly. In Flume we can partition data by time, a process can be run to transform the completed partition. We can store data in partition by setting hdfs.path parameter to include. Learn more..
Here we are using 2 different nodes in which we have Hadoop and flume installed already and are up and running. First let us configure flume configuration file on Node-1. Learn more..