(continued from part 1, Distributed File Systems - Scaling)
(continued from part 2, Distributed File Systems - Recovery Process)
Client in Hadoop refers to the Interface used to communicate with the Hadoop Filesystem.
The basic filesystem client hdfs dfs
is used to connect to a Hadoop Filesystem and perform basic file related tasks. It uses the ClientProtocol to communicate with a NameNode daemon, and connects directly to DataNodes to read/write block data.
Before we get into using the HDFS Client, I’m going to assume you’re comfortable using the Unix command line interface (or CLI, for short).
To follow along, it’d be a good idea to install a Hadoop sandbox in a virtual machine, as most of us probably don’t have a personal Hadoop cluster to toy around with.
Useful commands to get information from the Namenode and how to change meta information.
How to read and write data with the help of HDFS client.
How to transfer files between local and distributed storage.
How to change replication factor, update permissions to access data, and get a general report about files and blocks in HDFS.
If you ever need help with the HDFS client API, use the following commands:
hdfs
means you’re working with the hdfs client.
dfs
means you’re working with the distributed file system API.
Let’s do a read-only request to a name node.
-ls
let’s you see the directory content or the file information
-R
gives you recursive output
-h
shows file sizes in human readable format
Note: these file sizes don’t include replicas. To get the file sizes of space used by all replicas, use -du
.
Now let’s modify the structure of our file system by creating a directory called ‘deep’.
-p
If you try to create a deep nested folder, then you will get an error back
if parent folder doesn’t exist. To create parent folders automatically, use -p
.
Alright let’s remove the ‘deep’ folder. Remember to use -r
to delete folders recursively.
In addition to folders, you can create empty files with touchz
utility.
There’s a difference between using touch
in the local file system and touchz
in the distributed file system.
With touch
, you use it to update file meta information (ie access and modification type).
With touchz
, you create a file of zero length. That’s where the z
comes from.
After creating ‘file.txt’, let’s try and move it to another location with a different name.
mv
can be used the same way as it is used in the local file system to manipulate files and folders.
So up until now, we’ve been communicating with the namenode.
Let’s move on and discover how to communicate with data nodes.
Use put
to transfer a file from the local file system into HDFS.
How do we read the content of a remote file?
In the local file system, we use cat
, head
, and tail
to bring the content of a file to a screen.
In HDFS you can use a cat
to bring the whole file to a screen.
To see only the first lines of the file, use piping as there is no head
utility in HDFS.
To see the end of the file you can use tail
utility. Note that the behavior of
a local tail
utility and the distributed tail
utility is different. Local file system commands
are focused on text files. In a distributed file system, we work with binary data. So, the HDFS tail
command brings out onto the
screen the last one kilobyte of a file.
As with moving files from the local file system to HDFS, we can also download files from
HDFS to the local file system by using get
.
With the -getmerge
utility, all of this data can be merged into one local file.
So that covers the basic name node and data node APIs. The following are some more advanced commands.
chown
, which stands for ‘change ownership’ can be used to configure access permissions.
groups
is useful to get information about your HDFS ID
setrep
provides API to decrease and increase replication factor
hdfs fsck
, which stands for ‘file system checking utility’, can be used to request name node to provide you with the information
about file blocks and the allocations
find
is used to search by pattern recursively in the folder
Here, we’ve covered how to request meta-information from the name node and change its structure. We also learned how to read and write data to and from data nodes in HDFS. And we also covered how to change the replication factor of files and get detailed info about the data in HDFS.
Sources:
Dral, Alexey A. 2014. Scaling Distributed File System. Big Data Essentials: HDFS, MapReduce and Spark RDD by Yandex. https://www.coursera.org/learn/big-data-essentials