Apache Hadoop community has released a new release of Hadoop that is called Hadoop3.0. Through this version, the feedback can be provided of downstream applications and end users and the platform to check it. This feature can be incorporated in alpha and beta process. Thousands of new fixes and improvements have been incorporated in this new release in comparison to previous minor release 2.7.0. That was released a year before. This blog will provide you with information about the new release of Hadoop and its features.
A number of significant enhancements are being incorporated in the new Hadoop version. The features are listed below and they have been proven much more advantageous to the Hadoop user. Apache site has full information about these new changes and enhancements that are being done in the new version. You can refer the site to get the look of those changes and here an overview is provided for all those changes that are being offered to the Hadoop users.
Hadoop is a collection of open source projects and this framework provides the platform to deal with big data. As handing big data can be a tedious task that can take an enormous amount of time. In case, if the error is found in the last iteration step then error finding can be a tedious task as a result of which query building can become difficult. Hadoop has provided a beneficial platform to the users and it can be even made better by adding more functionalities. This article will discuss all of the functionalities that are found in the system.
In 2011, Hadoop was first introduced to open source users and after that three releases or updates have been provided by the company. Now Apache Hadoop 3 has been released by the Hadoop community members and is still in the testing phase. In this new version of Hadoop, the company has tried to bring thousands of fixes and enhancements. It has brought some of the major changes and in the Hadoop 3.0.
As Oracle JDK 7 has been ended in 2015, so now in this new version of Hadoop all files will be compiled to run on JDK 8 version, so that Hadoop 3.0 version users will have to upgrade to Java version 8 if they want to use Hadoop 3.
Erasure encoding is an advanced RAID technology that helps in data recovery at the time when the hard disc fails. It provides the option of data storage with full fault tolerance and less space overhead as was there in HDFS replication. In place of replication, the user can use Erasure coding, that can provide equal fault tolerance with minimum storage overhead easily.
Due to Erasure coding, the requirement of disc storage has been reduced to half and fault tolerance has been improved by 50%. Due to this feature, Hadoop customers can save big bucks while setting up infrastructure as now even with less storage they can store the same amount of data or in the same amount of storage they can store the double amount of data.
With increased size or amount of data, Erasure coding has brought an important and major feature that is advantageous for a number of Hadoop users.
A vast amount of Hadoop functionality is controlled via shell. The new shell scripts that are introduced in Hadoop 3 has fixed a number of bugs and included many new features and feature of having the provision to rewrite the shell scripts is one of them.
In Hadoop 3 we are provided new Hadoop –client-API and Hadoop-client-runtime artifact that has packed a number of Hadoop dependencies into a single pack or single jar file. Now in the new version of Hadoop, Hadoop-client-API has to compile the scope, while the Hadoop-client-runtime has runtime scope, that contains the third-party dependencies from Hadoop-client. In this way, the user can bundle the dependencies into a jar file and then can test the complete jar version of the file for any version conflict. In this way, Hadoop dependencies onto application classpath can be removed.
Here the Execution Type notion is being introduced in this new version and the applications can now request for containers that can be of an opportunistic type. These containers can be dispatched to NM for execution even without any resource. The containers are queued up at NM and wait for the resources so that it can start. These opportunistic containers are of low priority and preempted by nature. In this way, cluster utilization has been improved up to great extent. Now guaranteed containers have low priority.
In the previous versions, Hadoop was in the range of Linux ephemeral that is 32768-61000. In this configuration sometimes, services fail to bind to the port as other applications have been bound to those ports. So. the conflicting ports have been now moved out of this range and new port numbers have been given to these ports that are described below:
In the previous version of Hadoop, only one active Namenode and one standby Namenode was available. Now the replication is being done among three JournalNodes, so the architecture can now tolerate more failure unlike in previous versions where only two nodes were present to handle fault tolerance.
The cases where a higher degree of fault tolerance may be required can now be handled by this version. In this version, a number of standby nodes are available for the user. Like now the user can configure three NameNodes and Five JournalNodes and the cluster can now tolerate the failure of two nodes rather than one node.
Microsoft Azure Data Lake and Aliyun Object Storage System integration is now supported by this new Hadoop version and is like an alternative for the Hadoop-compatible filesystem.
Multiple disks are now supported by a single DataNode. The disks are filled up evenly during the normal write operation. A significant skew within a DataNode can be resulted due to either addition or replacement of disks. Existing HDFS balancer cannot handle such situation, that is concerned with inter, not intra and DNS skew. This new intra-DataNode balancing functionality can be handled by Hadoop 3.0, that can be also invoked by hdfs disk balance CLI.
To improve scalability and reliability this Timeline service v.2 was much important, in which flows and aggregation were introduced to enhance the system usability.
Hadoop version 1 was limited to only one instance of reader/writer and storage architecture that could not be scaled beyond small clusters, while version2 uses a distributed writer architecture with a backed that was scalable, here data read and write operations were separable. For each YARN application distributed collectors are provided.
Now in this new Hadoop version YARN application can be logically grouped and a series of applications can now be grouped together logically. In this way, a logical application can be completed.
In case of Map output collector, the support for MapReduce has been added. This can improve the performance by 30% or more. NativeMapOutputCollector has been added to handle key-value pairs that are emitted by the mapper, so spill, sort, IFile serialization can be easily done even in the native code.
Now the daemon heap size can be configured by new methods. As per host memory size, auto-tuning is not possible, here the variable HADOOP_HEAPSIZE can be deprecated. So, the heap size need not be mentioned at the time of task configuration or for Java option. Moreover, existing configurations are not affected by these changes.
Apache Hadoop 3.0 alpha1 has done lots of changes and is a milestone for other tools. The features that are listed above have provided a powerful platform to the Hadoop developers. You should regularly keep on checking the new changes and improvements that are being offered for this platform. A number of Hadoop developers and users are using the platform for complex business solutions. The features and improvements of Hadoop 3.0 have been appreciated immensely by the users worldwide.
A dynamic, highly professional, and a global online training course provider committed to propelling the next generation of technology learners with a whole new way of training experience.
Receive Latest Materials and Offers on Hadoop Course