AI, Gen AI and Cloud: NameNode and DataNode

Tuesday, September 10, 2019

NameNode and DataNode

In this post let’s talk about the 2 important types of nodes and it’s functions in your Hadoop cluster – NameNode and DataNode.

What is HDFS?

We covered a great deal of information about HDFS in “HDFS – Why Another Filesystem?” chapter in the Hadoop Starter Kit course. If you are new to Hadoop, we suggest to take the free course.

Namenode

NameNode is the centerpiece of HDFS.
NameNode is also known as the Master
NameNode only stores the metadata of HDFS – the directory tree of all files in the file system, and tracks the files across the cluster.
NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes.
NameNode knows the list of the blocks and its location for any given file in HDFS. With this information NameNode knows how to construct the file from blocks.
NameNode is so critical to HDFS and when the NameNode is down, HDFS/Hadoop cluster is inaccessible and considered down.
NameNode is a single point of failure in Hadoop cluster.
NameNode is usually configured with a lot of memory (RAM). Because the block locations are help in main memory.

DataNode

DataNode is responsible for storing the actual data in HDFS.
DataNode is also known as the Slave
NameNode and DataNode are in constant communication.
When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.
When a DataNode is down, it does not affect the availability of data or the cluster. NameNode will arrange for replication for the blocks managed by the DataNode that is not available.
DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.

Hardware Configuration

Hardware configuration of nodes varies from cluster to cluster and it depends on the usage of the cluster. In Some Hadoop clusters the velocity of data growth is high, in that instance more importance is given to the storage capacity. If the SLAs for the job executions are important and can not be missed then more importance is give to the processing power of nodes.

Often the term “Commodity Computers” is misunderstood. Commodity Computers or Nodes does not mean cheap or less powerful hardware, it just means in-expensive computer and deemphasize the need for specialized hardware.

Here is a sample configuration for NameNode and DataNode hardware configuration.

Name Node Configuration

Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 128 GB
Disk: 6 x 1TB SATA
Network: 10 Gigabit Ethernet

Data Node Configuration

Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 64 GB
Disk: 12-24 x 1TB SATA
Network: 10 Gigabit Ethernet

AI, Gen AI and Cloud

Tuesday, September 10, 2019

NameNode and DataNode

What is HDFS?

Namenode

DataNode

Hardware Configuration

Name Node Configuration

Data Node Configuration

No comments:

Post a Comment

Distributed Computing: A Guide to Comparing Data Between Hive Tables Using Spark

Report Abuse