Hadoop commands cheat sheet

April 30, 2013
  1. %hadoop namenode -format     #This will format the file system
  2. starting the daemons :        %start-dfs.sh  ;     %start-mapred.sh
  3. If the configuration files, mentioned in the previous post are in a separate directory run the commands with the –config option : %start-dfs.sh –config path-to-config-directory
  4. This will start a namenode, a secondary namenode and a datanode
  5. Commands to close the daemons : stop-dfs.sh and stop-mapred.sh
  6. sd
  7. sd

You should also read up http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html and then write a post like how the developer machine works and then how the developer machine can connect and run jobs on the hadoop cluster, etc. Then how he can deploy the code, or test the code from his developer machine while development with the hadoop cluster and then how does it changed when the code gets deployed.

Advertisements

Hadoop Configuration files

April 30, 2013

Common Properties – core-site.xml

HDFS Properties – hdfs-site.xml

MapReduce Properties – mapred-site.xml

The default settings are stored in the docs directory of the hadoop installation for HTML files called : core-default.html, hdfs-default.html, mapred-default.html.

For different configurations :

  1. http://hadoop.apache.org/docs/stable/single_node_setup.html
  2. http://hadoop.apache.org/docs/stable/cluster_setup.html

Difference between HBase and HDFS ?

April 30, 2013

HDFS is a distributed file system and has the following properties:
1. It is optimized for streaming access of large files. You would typically store files that are in the 100s of MB upwards on HDFS and access them through MapReduce to process them in batch mode.
2. HDFS is optimized for use cases where you write once and read many times like in the case of production logs. You can append to files in some of the recent versions but that is not a feature that is very commonly used. There is no concept of random writes.
3. HDFS doesn’t do random reads very well.

HBase on the other hand is a distributed column oriented database. The filesystem of choice typically is HDFS owing to the tight integration between HBase and HDFS. Having said that, it doesn’t mean that HBase can’t work on any other filesystem. It’s just not proven in production and at scale to work with anything except HDFS.
HBase provides you with the following:
1. It gives you the ability to do random read/writes on your data which HDFS doesnt allow you to.
2. HBase stores data in the form of key value pairs in a columnar fashion. HBase provides a flexible data model.
3. Fast scans across tables.
4. Scale in terms of writes as well as total volume of data.

An analogous comparison would be between MySQL and Ext4.


Hadoop commandline commands

April 30, 2013

http://hadoop.apache.org/docs/r1.0.4/commands_manual.html

 


New things to learn

April 26, 2013

pushd, tee, wget, curl, httpd, netstat, tcdump, oprofile, 


Pass by value, reference, pointer C++

April 24, 2013

In Java, all the objects are passed by reference. I need to make this understanding of pass by value, reference and pointer clear first. Read it again and again. 

http://www.cplusplus.com/articles/z6vU7k9E/

Then think about why it is said that everything is passed by reference in java if it is an object. 

 


Copy Constructors, Assignment Operators and Exception Safe

April 24, 2013
  1. http://www.cplusplus.com/articles/y8hv0pDG/
  2. The closest equivalent to this is the concept of deep copying and shallow copying of objects in java. In java each object always needs to have the equals and hashcode method and the equals method is used for comparison. In C++, only when we have a copy constructor, do we need the concept of == operator on the object
  3. df
  4. df