Big data storage is performed by the Hadoop framework in a distributed environment. Today, Big Data and Hadoop are two most used and popular terms among Big Data users. Hadoop mainly has two main components, one is HDFS and other is YARN. HDFS is used to store files in a distributed environment and YARN is used for managing resources.
Various clusters are used in Hadoop architecture to store and process the data. The clusters that are used in data processing must be used in a planned way. As each Hadoop cluster will have a certain capacity, so it must be used in a planned way to access and store the data. This article discusses the planning for Hadoop cluster data node as per their capacity and requirement of the process.
Hadoop Cluster Capacity Planning
Hadoop can accommodate changes and it is one and the foremost expectation and requirement of the system. It can be understood as in case if you will overestimate the storage requirement for your system, then in Hadoop you can scale down the clusters, while in case if you need more storage in a limited budget, then still you can process data by small clusters and nodes can also be added as per the growth of data sets.
While planning Hadoop cluster capacity, you should consider the needs for data redundancy. Here, it is required as Hadoop clusters replicated data to provide protection at the time of data loss. So, while planning cluster capacity you must consider this storage that is used to store replicas. If as per our estimation, you will be having 5 TB data and you are choosing four replicas for Hadoop then total 20 TB storage will be required to accommodate 5 TB data along with replicas.
While planning for Hadoop cluster capacity, overhead and compression must also be taken into account. Here, data compression may be the most important as while storing data in clusters compression technique is applied and that is of much importance. It is a good practice to know which type of data will be stored in the cluster as some type of data is really benefited from compression while others may not.
There is not much compression can be used in case of scientific data or compressed media or in case of Docker containers so compression is not useful for them, while for other types of data like in case of text-rich data compression rate is quite good, so it may be beneficial to use compression for such type of data.
Factors that affect Hadoop Cluster Capacity Planning
For planning the Hadoop cluster capacity, certain considerable factors are –
- Number of machines
- Machine Specifications
- Volume of data
- Data retention policy
- Type of workload
- Data storage mechanism
For a few other factors, it is good to make certain assumptions. They are basically planned for data nodes and name nodes. Here we are going to estimate capacity for data nodes in this article.
Data Node Capacity Planning
Cluster or data node capacity planning is basically a top to bottom approach for that you must know the following:
- Number of nodes needed
- Node capacity for each CPU
- Node capacity for memory units
Let’s do a quick calculation to estimate the data node capacity. Hadoop usually replicates data 3 ways so 3 times of storage will be required along with temporary storage. If we have 50TB data to store, so after compression we will get 50 – (50*60%)=20TB data, now multiply it with 3 to 3 * 20 = 60TB but take its 70% capacity so 60 TB=x*70% thus x=60/70%= 856TB approx. will be total required storage capacity. Here in this calculation disk capacity is kept 70%.
Now to know the total number of nodes that will be required, we will have to divide this required storage by the number of available hard disks. Like in this case, if we have 8 HDD with multi-core processors, then a total number of required nodes will become 856 TB/8 1TB = 107 Nodes.
Further, we will have to calculate the number of tasks per node. Usually, 1 core per task is counted, but in case if the job is not that much heavy then a number of tasks can be greater than the number of cores. e.g. 8 cores, jobs use approx 75% of CPU and let’s suppose a total number of free slots = 10, so we can assign maxMapTasks to 7 nodes, while maxReduceTasks to rest 3 nodes.
Total memory that is assigned to these tasks can also be calculated. Here, task tracker and data node take up 1 GB of RAM and OS will take around 2GB memory. So, if there is total 24 GB memory available then for our 10 tasks total 22 GB memory is available and hence we can assign 2.2 GB to each of the tasks.
Throughout this article, we have seen how the data node capacity can be calculated for Hadoop clusters. Similarly, we can also calculate the capacity of Name nodes of Hadoop clusters. It is quite important to calculate such capacities in advance for accurate processing through proper formulas or techniques. You just have to follow the step by step calculation to get your job done as per your expectations.