Home / Flume tutorial / Flume – Architecture

Flume – Architecture:

Agent:

Simplest unit in Flume deployment is a Flume Agent. It is a java process that runs Sources and Sinks which are connected by channels. We can receive data from one agent and send to other agents which means we can connect one agent to one or more agents. By this chain of agents, we can move data from one location to another.

By having many agents, we can receive data from application servers and write that data to HDFS or Hbase, by this it is possible to scale number of servers and amount of data written to HDFS just by adding more flume agents.

Event:

The data in Flume is represented as Events which are nothing but a simple data structures having a body and a set of headers. The body of the event is a byte array that usually is the payload that Flume is transporting. The headers are represented as a map with string keys and string values.

Headers are not meant to transfer data, but for routing purposes and to keep track of priority, severity of events being sent, etc. The headers can be used to add event IDs or UUIDs to events as well

Each Flume agent has three components: source, channel and sink. The source is used for getting events into the Flume agent, while the sink is used for removing the events from the agent and forwarding them to the next agent in the topology, or to HDFS, HBase, Solr, etc.

The channel acts like a buffer which stores data that the source has received, until a sink has successfully written the data out to the next hop or the eventual destination. Let us see each individual component.

Source:

Sources are components which receive data from other applications. There are sources which produce data their own, they are only used for testing purpose. They can listen to 1 or more network ports and receive data or can read data from local filesystem also. A source can write to 1 or more channels and it can also replicate the events to all or some of the channels.

Channel:

Channels are components which receive data from source and it writes data to sinks. Multiple sources can write the data to the same channel and multiple sinks can read that data. But each sink can read exactly a specific event from one channel.

Sink:

Sinks reads the events from the channels and removes it from the channel. They push events to the next hop or final destination. Once the data is safely placed in the next hop or destination then the sinks inform that to channels via transaction commit and then that events are deleted from the channels.

The basic architecture of the Flume agent can be seen as,

Sources writes the data into channels using channel processors, interceptors and selectors. Each and every source has its own channel processor, which takes the task given by the source and then passes that task or events to one or more interceptors. Interceptors read the event and modify or drop the event based on some criteria like regex. We can have multiple interceptors which are called in the order in which they are defined. This can be called as chain-of-responsibility design pattern. Then we pass that list of events generated by interceptor chain to channel selector. The selectors decide which channels attached to this source each event be written to.

They apply some filtering type of criteria about which channels are required and optional. A failure when writing a required channel causes the processor to throw a “ChannelException” error which indicates the source to retry the event. If a failure happens to optional channel, then it is ignored. Once all the events are written successfully then the processor indicates success to the source and sends an acknowledgement to the source that sent the event and continues to accept more events. This can be shown in below image,

Events processor

So let us summarize the flow of flume agent,

The source receives or produces the data and writes it to one or more channels and then one or more sinks reads that data from the channel and sends to next agent or final destination. The duration by which the data resides in the channel depends on the durability guarantees of the channel.