Hadoop is often called as the backbone of Big Data Analytics. It is made up of several modules which have got a particular task to perform.
Four Major Components of Hadoop:
- Hadoop Common
It was discharged in 2005 by the Apache Software Foundation, a non-profit association which produces open source programming that controls a significant part of the Internet. Also, in case you're pondering on as to where the odd name originated from, it was the name given to a toy elephant belonging to the child of one of its very first makers!
Today, it is the most generally utilized framework for data stockpiling and handling crosswise over "item" equipment - moderately economical, off-the-rack frameworks connected together, instead of costly, bespoke frameworks specially designed only for the activities in hand. Truth be told it is used by almost all the organizations in the Fortune 500.
Today we will discuss Flume, one of the many components of Apache Hadoop that help it in data ingestion. Before it starts getting too complicated for you, we have already divided the blog into some parts so that the understanding becomes easier for all our wonderful readers.
Read: Pig Vs Hive: Difference Two Key Components of Hadoop Big Data
This Flume Tutorial Blog Is Going To Have The Following Parts-
- What is Flume?
- What are the Advantages of Flume?
- What are the Disadvantages of Flume?
- Flume Architecture Tutorial Guide
- Apache Flume Tutorial Guide for Beginner
What is Flume?
Apache Flume is an appropriated, reliable, and accessible service for productively gathering, conglomerating, and moving a lot of streaming data information into the Hadoop Distributed File System (HDFS). It has a very straightforward and adaptable design based on data streams inflow and is vigorous and error resistant with tunable unwavering quality systems for failover and recuperation scenarios.
YARN organizes information ingest from Apache Flume and different administrations that convey crude data into an Enterprise Hadoop cluster.
After the above-given description, it must be quite clear to you that Flume first collects, then aggregates and lastly transports the huge sum of streaming data like the log files, social media, events collected from a variety of sources like arrangements traffic, emails, messages etc. to the HDFS. The chief idea following the Flume’s design is to successfully incarcerate streaming data coming from an assortment of web servers to HDFS. It has undemanding as well as a very flexible framework based on the streaming data that flows. It is highly fault-tolerant and gives a consistency environment for the data error tolerance along with the failure recovery.
What are the Advantages of Apache Flume?
Here are some really good advantages that you experience with Apache Flume.
Read: Apache Storm Interview Questions and Answers: Fresher & Experience
- Data storage- Data functioning or flowing through whichever of the core or centralized stores can be easily stored using the great Apache Flume.
- Data Mediation- Flume generally acts as an intermediary between the zone of data producers and the zone of centralized stores. It comes into picture when the pace of incoming data is exceeding the pace at which the same data can be actually written to its destination.
- Steady Data Flow- Flume provides a steady flow of data by mediating the writing time and the data delivery time.
- Reliable Message Delivery- Flume at all times ensures that there is dependable message delivery. To do so all the Apache Flume dealings are based on channels wherein the two transactions (1 sender & 1 receiver) are perpetuated for each message.
- Multiple Sources Data Ingestion- Using Apache Flume, one can easily ingest data from numerous servers into the Hadoop system. It surely helps you to ingest all the online in-streaming data coming from a variety of sources such as channels of network traffic, social media accounts and activities, emails, messages, log files maintained etc. in HDFS.
What are the Disadvantages of Apache Flume?
Truth be told, Flume is an excellent example of how a well thought and perceived a piece of technology works. There is hardly any major disadvantage of using it. However, over the years a few of the following disadvantages have come up-
- Complex Topology- Apache Flume has been observed to have a complex topology which means that the procedures of configuration and maintenance are not that easy to do.
- Scalability and Reliability Issues-In Flume almost the entire throughput usually depends on the support store of the channel therefore actually scalability and reliability have been found to be not up to the mark.
- No Data Replication- The framework of Flume at no time support for data replication.
- Message Duplicacy- In Apache Flume there is no sure shot guarantee of getting 100% unique message delivery because many instances have been reported to have duplicate messages.
Apache Flume Architecture Tutorial Guide
After singing the praises of this amazing platform given by Hadoop, it is time that we get into some technical concepts now. To understand the functioning of Flume lets us take a look at its Components and Additional Components-
- Event: A byte payload with discretionary string headers that represent a unit of data that Flume can transport from its point of origin to its ultimate destination
- Flow: Movement of events from the point of origin to their last goal is viewed as a data flow, or essentially streaming of data. This isn't a thorough definition and is utilized just at a higher state for description purposes only.
- Client: An interface execution that works at the point at which the event originates and conveys them successfully without leakage to a Flume operator. Customers commonly work in the process space of the application they are consuming their data logs from. For instance, Flume Log4j Appender is a customer.
- Agent: A platform free procedure that hosts Flume segments, for example, sources, channels and sinks, and hence has acquired the ability to receive, store and forward the desired or you can say appropriate events to their next-jump goal.
- Source: An interface execution system that can expend occasions conveyed to it through a particular instrument. For instance, an Avro source is a source usage that can be utilized to get Avro events from customers or different operators in the stream. At the point when a source gets an event, it hands it over to at least one of the channels.
- Channel: A transient store for events, where events are conveyed to the channel by means of sources working inside the operator. An event which is put in a channel remains in that channel until the point when a sink evacuates it for additional or further transport. A case of the channel is the JDBC channel that uses a record framework support implanted database to hold on to the occasions until the point when they are evacuated by a sink. Channels assume an imperative part in guaranteeing the strength of the data streams.
- Sink: An interface usage that can successfully expel events from a channel and transmit them to the next following operator in the same data stream, or to the event's last destination point. Sinks that transmit the event to its the last goal are otherwise called terminal sinks. The Flume HDFS sink is a case of a terminal sink. Though the Flume Avro sink is a case of a normal sink that can transmit messages to different operators that are running an Avro source.
Flume Additional Components
- Interceptors: Interceptors are utilized to modify/assess Flume events which are exchanged amongst source and channel.
- Channel Selectors: These are utilized to figure out which channel is to be picked to move the data in the event of numerous channels. There are two sorts of channel selectors −
- Default channel selectors− These are otherwise called imitating channel selectors they reproduce every one of the events in each channel.
- Multiplexing channel selectors− These channels get to decide the channel to send an event on the basis of the address in the header of that event
- Sink Processors: These are utilized to conjure a specific sink from the selected group of particular sinks. These are utilized to make failover ways for your sinks or load the balance events over various sinks from a channel.
Hadoop Flume Tutorial Guide
Here is a small diagrammatic representation that will make this entire process very easy for you to understand. It is a very basic three-step procedure to understand the working of Apache Flume-
- The work of Flume is to catch streaming data from various sources such as social media clouds, various web servers etc.
- It then processes and streamlines this huge amount of streaming data. The ingestion is done.
- This ingested data is then handed over to the HDFS or HBase systems for further processing
The components that we just discussed in the preceding part of this blog will help you in deciphering the architecture, execution, arrangement and correct operation of Flume.
Read: Big Data Hadoop Tutorial for Beginners
Let us understand this procedure in detail now-
- A typical Flow in Flume NG originates from the Client.
- The Client then carries the Event that has been received to its succeeding point of destination in Flume.
- This succeeding destination is the Agent. To be more precise, the next destination is a Source which is already in force within the Agent.
- The Source that receives this Event will then transport it to one or more of the Channels. The Channels that have received this Event is ultimately drained by either one or more Sinks which are also functioning within the same Agent.
- The Sink carries this Event to its terminal destination which is the HDFS or HBase usually.
Lastly, it is to say that Flume may sound a very complicated thing to understand but you saw for yourself that it really is not. You just need to be well-versed with its terminologies that we already discussed and the function of every component. Once you have that right, it is an easy journey from there.
- AWS & Fundamentals of Linux
- Amazon Simple Storage Service
- Elastic Compute Cloud
- Databases Overview & Amazon Route 53
4 days 02 Apr 2020
- Intro to DevOps
- GIT and Maven
- Jenkins & Ansible
- Docker and Cloud Computing
5 days 03 Apr 2020
- Data Science Introduction
- Hadoop and Spark Overview
- Python & Intro to R Programming
- Machine Learning
11 days 09 Apr 2020
- Architecture, HDFS & MapReduce
- Unix Shell & Apache Pig Installation
- HIVE Installation & User-Defined Functions
- SQOOP & Hbase Installation
5 days 03 Apr 2020
- Salesforce Configuration Introduction
- Security & Automation Process
- Sales & Service Cloud
- Apex Programming, SOQL & SOSL
2 days 31 Mar 2020
- Introduction and Software Testing
- Software Test Life Cycle
- Automation Testing and API Testing
- Selenium framework development using Testing
1 day 30 Mar 2020
- BA & Stakeholders Overview
- BPMN, Requirement Elicitation
- BA Tools & Design Documents
- Enterprise Analysis, Agile & Scrum
6 days 04 Apr 2020
- Introduction & Database Query
- Programming, Indexes & System Functions
- SSIS Package Development Procedures
- SSRS Report Design
12 days 10 Apr 2020
Receive Latest Materials and Offers on Hadoop Course