What do you like best?
1) Distributed File System helps in partitioning huge data into multiple machines which helps in storing peta data,it follows write once and follows WORM-write ones read many times.
2) We have master node to distribute data among nodes and maintain the metadata of file version and path in it which is easier to spot the files.
3) Data Loss- As data is stored in multiple data nodes, there is a replication in case of any failure and very less chance to lose data.
4) Reading, copying, moving files to HDFS using putty commands is easier.
5) Apache ambari provides the user interface for Hadoop eco-systems which helps us to download,copy,rename,move and change permissions to directory and files in HDFS more easier.
6) Use of checksum for data integrity helps to check corruption of data.
What do you dislike?
1) Failure in namenode has no replication which takes lot of time to recover.
2) As Block size has a limit in size,storing small files is not efficient.
3) It doesn't allow multiple users to write to a file.
Recommendations to others considering the product
1) HDFS is a filesystem which has huge memory and can store your files in a distributed manner in multiple network machines.
2) Follows Write ones and read multiple times slogan along with replication of data in data nodes.
3) we have master node to distribute data among nodes and maintain the metadata of file version and path in it which is easier to spot the file
What business problems are you solving with the product? What benefits have you realized?
1) We are able to load peta bytes of metadata to Hbase using map reduce programs by creating H file in HDFS
2) We are able to scheedule our jobs by keeping the relevant files in HDFS by oozie yarn user.
3) We are able to store both content(any flat file) and metadata in the form of H file in HDFS and finally load to Hbase.
4) We are storing logs in HDFS for the date which keeps track of the job
5) We run purging module to delete files from HDFS ones its loaded to HBASE