Insert into doesnt work in Hive – by passing it !

August 1, 2013

The insert into command doesnt work in hive and gives the error “mismatched input ‘INTO’ expecting OVERWRITE in insert clause”.

Hence, we have to look at the other methods for implementing the same.
Hive tables are set up to be partitioned and load/insert overwrite partitions at a time.

There is another inefficient method of doing this :

INSERT OVERWRITE INTO TABLE myTable
SELECT … FROM …
UNION ALL
SELECT * from myTable;


Hadoop 0.20.2 Eclipse plugin not fully functioning – can’t ‘Run on Hadoop’

May 6, 2013

The hadoop eclipse plugin bundled with the hadoop distribution is compatible with eclipse up to version 3.3. The JIRA-ticket MAPREDUCE-1280 contains a patch for running the plugin in eclipse 3.4 and upwards.

I just compiled the patched plugin with the fixes from the JIRA-ticket MAPREDUCE-1280. The file is attached to the ticket. You can find it here.

Simply remove the old plugin from your eclipse-installation and put the new version of the plugin into the dropins-folder of your eclipse-installation.

After upgrading from an older version of the plugin you will have to start eclipse with the “-clean” command line switch. Help on eclipse command line switches can be found here.


5 reasons to use the hadoop-eclipse plugin

May 6, 2013
  1. You can browse the hdfs 
  2. You can create new users in the hdfs
  3. You can upload files from your local development machine to the HDFS directly using the plugin and no commandline
  4. Creating a MapReduce Project instead of a generic Java project automatically adds the prerequisite jar files to the build path. If you create a regular Java project, you must add the Hadoop jar (and its dependencies) to the build path manually.
  5. An easier way to manipulate files in HDFS may be through the Eclipse plugin. In the DFS location viewer, right-click on any folder to see a list of actions available. You can create new subdirectories, upload individual files or whole subdirectories, or download files and directories to the local disk.

How to create a directory on the hadoop filesystem ?

May 3, 2013

hadoop dfs -fs hdfs://<url>:<port> -mkdir <directory_name>


How to list all the files in a particular directory in Hadoop Filesystem ?

May 3, 2013

hadoop dfs -fs hdfs://<url>:<port> -lsr <directory_name>

All other hadoop shell commands can be found here  http://hadoop.apache.org/docs/stable/file_system_shell.html

Besides, you can read the FsShell.java code here http://svn.apache.org/viewvc/hadoop/common/tags/release-1.0.3/src/core/org/apache/hadoop/fs/FsShell.java?view=markup.

You can add the implementation of some commands that are not exactly there.

Usage: java FsShell
[-ls <path>]
[-lsr <path>]
[-df [<path>]]
[-du <path>]
[-dus <path>]
[-count[-q] <path>]
[-mv <src> <dst>]
[-cp <src> <dst>]
[-rm [-skipTrash] <path>]
[-rmr [-skipTrash] <path>]
[-expunge]
[-put <localsrc> … <dst>]
[-copyFromLocal <localsrc> … <dst>]
[-moveFromLocal <localsrc> … <dst>]
[-get [-ignoreCrc] [-crc] <src> <localdst>]
[-getmerge <src> <localdst> [addnl]]
[-cat <src>]
[-text <src>]
[-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>]
[-moveToLocal [-crc] <src> <localdst>]
[-mkdir <path>]
[-setrep [-R] [-w] <rep> <path/file>]
[-touchz <path>]
[-test -[ezd] <path>]
[-stat [format] <path>]
[-tail [-f] <file>]
[-chmod [-R] <MODE[,MODE]… | OCTALMODE> PATH…]
[-chown [-R] [OWNER][:[GROUP]] PATH…]
[-chgrp [-R] GROUP PATH…]
[-help [cmd]]

Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.

 


How to copy files from one Hadoop Cluster to another ?

May 1, 2013

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting.

bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo

Reference : http://hadoop.apache.org/docs/r0.19.1/distcp.html

To Do : read all multi source copy of distcp and publish a blog on it

Read all the other command guide from here http://hadoop.apache.org/docs/r0.19.1/commands_manual.html

Read the whole of hadoop from YDN and blog about the same

Think about all of them in question format and write them down.


How can you specify the Hadoop Filesystem

May 1, 2013

-fs [local | <file system URI>]: Specify the file system to use.
If not specified, the current configuration is used,
taken from the following, in increasing precedence:
core-default.xml inside the hadoop jar file
core-site.xml in $HADOOP_CONF_DIR
‘local’ means use the local file system as your DFS.
<file system URI> specifies a particular file system to
contact. This argument is optional but if used must appear
appear first on the command line. Exactly one additional
argument must be specified.

To get clarity in understanding, you should realize that hadoop is using a distributed filesystem on a cluster. So, even if you are pointing to one url, there are a lot of nodes behind that url in the cluster. As a hadoop user, you are agnostic of hadoop is maintaining these nodes or how the hadoop operations are running on these nodes. So basically, when you do a distcp, that is copy data from one cluster to another, you basically copy data from one distributed hadoop filesystem to another distributed hadoop filesystem.


Hadoop commands cheat sheet

April 30, 2013
  1. %hadoop namenode -format     #This will format the file system
  2. starting the daemons :        %start-dfs.sh  ;     %start-mapred.sh
  3. If the configuration files, mentioned in the previous post are in a separate directory run the commands with the –config option : %start-dfs.sh –config path-to-config-directory
  4. This will start a namenode, a secondary namenode and a datanode
  5. Commands to close the daemons : stop-dfs.sh and stop-mapred.sh
  6. sd
  7. sd

You should also read up http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html and then write a post like how the developer machine works and then how the developer machine can connect and run jobs on the hadoop cluster, etc. Then how he can deploy the code, or test the code from his developer machine while development with the hadoop cluster and then how does it changed when the code gets deployed.


Hadoop Configuration files

April 30, 2013

Common Properties – core-site.xml

HDFS Properties – hdfs-site.xml

MapReduce Properties – mapred-site.xml

The default settings are stored in the docs directory of the hadoop installation for HTML files called : core-default.html, hdfs-default.html, mapred-default.html.

For different configurations :

  1. http://hadoop.apache.org/docs/stable/single_node_setup.html
  2. http://hadoop.apache.org/docs/stable/cluster_setup.html

Difference between HBase and HDFS ?

April 30, 2013

HDFS is a distributed file system and has the following properties:
1. It is optimized for streaming access of large files. You would typically store files that are in the 100s of MB upwards on HDFS and access them through MapReduce to process them in batch mode.
2. HDFS is optimized for use cases where you write once and read many times like in the case of production logs. You can append to files in some of the recent versions but that is not a feature that is very commonly used. There is no concept of random writes.
3. HDFS doesn’t do random reads very well.

HBase on the other hand is a distributed column oriented database. The filesystem of choice typically is HDFS owing to the tight integration between HBase and HDFS. Having said that, it doesn’t mean that HBase can’t work on any other filesystem. It’s just not proven in production and at scale to work with anything except HDFS.
HBase provides you with the following:
1. It gives you the ability to do random read/writes on your data which HDFS doesnt allow you to.
2. HBase stores data in the form of key value pairs in a columnar fashion. HBase provides a flexible data model.
3. Fast scans across tables.
4. Scale in terms of writes as well as total volume of data.

An analogous comparison would be between MySQL and Ext4.