August 1, 2013
The insert into command doesnt work in hive and gives the error “mismatched input ‘INTO’ expecting OVERWRITE in insert clause”.
Hence, we have to look at the other methods for implementing the same.
Hive tables are set up to be partitioned and load/insert overwrite partitions at a time.
There is another inefficient method of doing this :
INSERT OVERWRITE INTO TABLE myTable
SELECT … FROM …
SELECT * from myTable;
May 6, 2013
The hadoop eclipse plugin bundled with the hadoop distribution is compatible with eclipse up to version 3.3. The JIRA-ticket MAPREDUCE-1280 contains a patch for running the plugin in eclipse 3.4 and upwards.
I just compiled the patched plugin with the fixes from the JIRA-ticket MAPREDUCE-1280. The file is attached to the ticket. You can find it here.
Simply remove the old plugin from your eclipse-installation and put the new version of the plugin into the dropins-folder of your eclipse-installation.
After upgrading from an older version of the plugin you will have to start eclipse with the “-clean” command line switch. Help on eclipse command line switches can be found here.
May 3, 2013
hadoop dfs -fs hdfs://<url>:<port> -mkdir <directory_name>
May 3, 2013
hadoop dfs -fs hdfs://<url>:<port> -lsr <directory_name>
All other hadoop shell commands can be found here http://hadoop.apache.org/docs/stable/file_system_shell.html
Besides, you can read the FsShell.java code here http://svn.apache.org/viewvc/hadoop/common/tags/release-1.0.3/src/core/org/apache/hadoop/fs/FsShell.java?view=markup.
You can add the implementation of some commands that are not exactly there.
Usage: java FsShell
[-mv <src> <dst>]
[-cp <src> <dst>]
[-rm [-skipTrash] <path>]
[-rmr [-skipTrash] <path>]
[-put <localsrc> … <dst>]
[-copyFromLocal <localsrc> … <dst>]
[-moveFromLocal <localsrc> … <dst>]
[-get [-ignoreCrc] [-crc] <src> <localdst>]
[-getmerge <src> <localdst> [addnl]]
[-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>]
[-moveToLocal [-crc] <src> <localdst>]
[-setrep [-R] [-w] <rep> <path/file>]
[-test -[ezd] <path>]
[-stat [format] <path>]
[-tail [-f] <file>]
[-chmod [-R] <MODE[,MODE]… | OCTALMODE> PATH…]
[-chown [-R] [OWNER][:[GROUP]] PATH…]
[-chgrp [-R] GROUP PATH…]
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
May 1, 2013
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting.
bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
Reference : http://hadoop.apache.org/docs/r0.19.1/distcp.html
To Do : read all multi source copy of distcp and publish a blog on it
Read all the other command guide from here http://hadoop.apache.org/docs/r0.19.1/commands_manual.html
Read the whole of hadoop from YDN and blog about the same
Think about all of them in question format and write them down.
May 1, 2013
-fs [local | <file system URI>]: Specify the file system to use.
If not specified, the current configuration is used,
taken from the following, in increasing precedence:
core-default.xml inside the hadoop jar file
core-site.xml in $HADOOP_CONF_DIR
‘local’ means use the local file system as your DFS.
<file system URI> specifies a particular file system to
contact. This argument is optional but if used must appear
appear first on the command line. Exactly one additional
argument must be specified.
To get clarity in understanding, you should realize that hadoop is using a distributed filesystem on a cluster. So, even if you are pointing to one url, there are a lot of nodes behind that url in the cluster. As a hadoop user, you are agnostic of hadoop is maintaining these nodes or how the hadoop operations are running on these nodes. So basically, when you do a distcp, that is copy data from one cluster to another, you basically copy data from one distributed hadoop filesystem to another distributed hadoop filesystem.