Descriptive Statistics – starting with the data

September 27, 2013

There are the kinds of analysis that you can do when you start with any data set. This may be the starting point of all data science projects and it will give insights about the data. This is essential for both statisticians and also for consumer of statistical reports.

For quatitative variables :

  1. minimum, maximum
  2. median, quartile, inter quartile rang
  3. box plots
  4. mean
  5. spread of the data – standard deviation – sometimes there may be gaps in the data when we plot it as a histogram – outliers. When there are underlying special rules in the way the data is being generated, then there will be outliers in the data. For example : Some football clubs can play foreign players salaries above the salary cap, this will produce outlier salaries for those players. Another example : the top deal or product in an ecommerce site, gets the highest clicks by virtue of its position. This will create an outlier if ctr is considered, if the deals are ranked. Cleaning the data is an important first step in any statistical analysis. It is important to understand the reasons behind the outliers. In some cases, it is good to remove the outliers and in some cases it is not so good as we might lose valuable data signals. It is not unusual to report findings both with and without outliers.
  6. shape of the data – histograms
  7. skewed vs non-skewed, symmetric vs non-symmetric
  8. left skewed or negatively skewed – where it has a long left tail – mean < median < mode – the difference between the 3rd quartile and the median is smaller than the difference between the 1st quartile and median
  9. right skewed or positively skewed – where it has a long right tail
  10. extreme values or outliers – sometimes the data has a much better uniform shape when the outliers are removed

For categorical variables

  1. bar charts
  2. pie charts
  3. Examining the relationship between a quantitative variable and a categorical variable involves comparing the values of the quantitative variable among the groups defined by the categorical variable.

Missing Values

We must understand why the data for some of the variables are missing and the fact that they are missing might bias the result of our work.

Joke on Data Scientists

September 25, 2013

Today I started reading the Moneyball. Back to Michael Lewis after almost two years. And guess what, the current buzz word in the valley is “Data Science”.

I was having a discussion with my manager regarding hiring a candidate for an open position. During the reviews meeting, we reached to a conclusion that the candidate was not so ok on machine learning and not so ok on programming. So, somebody in the room cracked a joke “sounds like a data scientist”.

But jokes apart, statistics, machine learning and programming put together is a formidable skillset in the industry today. So, I have decided to start a series of blog posts as a statistics refresher for myself.

And guess what, 2013 is also the international year of statistics. Sounds coincidental.

Serendipity !

Cool feature in Flipkart’s user reviews trying to match a review to a particular item attribute

September 14, 2013

The flipkart Product Management and Research team have come up with a cool idea of trying to match a user review for a particular item to a specific attribute of the item only. They call it product features users are talking about. 



As you can see that they have identified operating systems, games, value for money and apps as the features for iphone. 

Now, based on a particular feature, you chose, you can see all the reviews that are clustered under that feature.


And then, you can select a particular review and read that review in detail. 



This is a real cool feature and will massively improve buyers experience. This will also in future lead the way for more granular recommendations. If flipkart knows what features in a product you are looking for, it can recommend you products which are good in that feature based on the recommendations of users who have used that feature. A strong case of collaborative filtering. Better recommendations in the future when they have a good data set and more money.

I thing this is a nice example, where the product management team and the research (NLP and machine learning) team have come together to bring out a new feature for flipkart.

What would be interesting to see, on how many other different products or categories is flipkart showing this feature.

For watches they are not.

Some other cool features on their website are, certified buyer reviews. This puts in more authenticity on the review and is held credible by the reader. They also write if there is a first time reviewer. 


Difference between Information Retrieval and Information Filtering

September 10, 2013

Information retrieval is about fulfilling immediate queries from a library of information available.

Example : you have a deal store containing 100 deals and a query comes from a user. You show the deals that are relevant to that query.

Information Filtering is about processing a stream of information to match your static set of likes, tastes and preferrences.

Example : a clipper service which reads all the news articles published today and serves you content that is relevant to you based on your likes and interests.

Insert into doesnt work in Hive – by passing it !

August 1, 2013

The insert into command doesnt work in hive and gives the error “mismatched input ‘INTO’ expecting OVERWRITE in insert clause”.

Hence, we have to look at the other methods for implementing the same.
Hive tables are set up to be partitioned and load/insert overwrite partitions at a time.

There is another inefficient method of doing this :

SELECT * from myTable;

Difference between HBase and HDFS ?

April 30, 2013

HDFS is a distributed file system and has the following properties:
1. It is optimized for streaming access of large files. You would typically store files that are in the 100s of MB upwards on HDFS and access them through MapReduce to process them in batch mode.
2. HDFS is optimized for use cases where you write once and read many times like in the case of production logs. You can append to files in some of the recent versions but that is not a feature that is very commonly used. There is no concept of random writes.
3. HDFS doesn’t do random reads very well.

HBase on the other hand is a distributed column oriented database. The filesystem of choice typically is HDFS owing to the tight integration between HBase and HDFS. Having said that, it doesn’t mean that HBase can’t work on any other filesystem. It’s just not proven in production and at scale to work with anything except HDFS.
HBase provides you with the following:
1. It gives you the ability to do random read/writes on your data which HDFS doesnt allow you to.
2. HBase stores data in the form of key value pairs in a columnar fashion. HBase provides a flexible data model.
3. Fast scans across tables.
4. Scale in terms of writes as well as total volume of data.

An analogous comparison would be between MySQL and Ext4.

Benchmark Trade Bond Price Challenge : kaggle

December 26, 2012

This post was long overdue. I participated in the benchmark trade bond pricing challenge and used a regression based approach to predict bond prices. Here is an outline of the approach.

  1. Build on the training set and predict on the test set. The dependent variable we are trying to predict is the bond price and the independent variables are last 10 trade prices.
  2. Prepare frequency charts based on callability is 0 or 1.
  3. Divide the data into 12 parts – callability, price > 100 or price < 100(bond price will always converge to 100 so the curve will look different), types of trade in the bond (dealer to dealer, dealer to client, client to client – quotes driven market)
  4. Some of the values were missing – missing value treatment (based on exponential weights)
  5. Run regression on these sub-data sets and analyze the results
  6. Some of the t-tests failed – bond ids and time to delay – p value that was kept as cut off for rejecting the coefficients is 3%

Notes :

  1. R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R2 of 1.0 indicates that the regression line perfectly fits the data.
  2. R2   measures goodness of fit. But it will not detect overfit because it will increase with any new predictor (unless it has already reached 1).
  3. Assumptions of Regression
  4. L Linear relationship
    I Independent observations
    N Normally distributed around line
    E Equal variance across X’s
  5. Multicollinearity : When two independent variables are correlated and its detection – Variance Inflation Factor
  6. A p-value is the probability of an observed or more extreme result arising by chance. So, if p<.03, then that probability is quite less and hence, we can keep that independent variable


I have been away from data science now because of job change and interviewing. New Year Resolution : get back to kaggle .