AI, Gen AI and Cloud: December 2019

Friday, December 20, 2019

HDFS commands

How can you debug Hadoop code?

First, we should check the list of MapReduce jobs currently running. Next, we need to see that there are no orphaned jobs running; if yes, we need to determine the location of RM logs.

Run:
```
ps –ef | grep –I ResourceManager
```
Then, look for the log directory in the displayed result. We have to find out the job ID from the displayed list and check if there is any error message associated with that job.
On the basis of RM logs, we need to identify the worker node that was involved in the execution of the task.
Now, we will login to that node and run the below code:
```
ps –ef | grep –iNodeManager
```
Then, we will examine the Node Manager log. The majority of errors come from the user-level logs for each MapReduce job.

How to configure Replication Factor in HDFS?

The hdfs-site.xml file is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all the files placed in HDFS.
We can also modify the replication factor on a per-file basis using the below:

Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely,

We can also change the replication factor of all the files under a directory.

[training@localhost ~]$ hadoop fs –setrep –w 3 -R /my/dir

config file named as ‘hdfs-site.xml

Replication factor is the number of replication we are creating for a particular block as to avoid any fault in system if any data block or data gets deleted or lost.

How to compress a Mapper output not touching Reducer output?

To achieve this compression, we should set:
conf.set("mapreduce.map.output.compress", true) conf.set("mapreduce.output.fileoutputformat.compress", false)

What is the difference between Map-side Join and Reduce-side Join?

Map-side Join at Map side is performed when data reaches the Map. We need a strict structure for defining Map-side Join. On the other hand, Reduce-side Join (Repartitioned Join) is simpler than Map-side Join since here the input datasets need not be structured. However, it is less efficient as it will have to go through sort and shuffle phases, coming with network overheads.

How can you transfer data from Hive to HDFS?

By writing the query:

hive> insert overwrite directory '/' select * from emp;

We can write our query for the data we want to import from Hive to HDFS. The output we receive will be stored in part files in the specified HDFS path

AI, Gen AI and Cloud