Methods of Data Collection and Biases

September 30, 2013

Methods :

  1. Simple Random Sampling
  2. Stratified Sampling – divide the population into non-overlapping subgroups called strata and choose SRS within each subgroup. Thus the variance within each subgroup is less than the overall population variance.
  3. Cluster Sampling
  4. Systematic Sampling – select the kth item – hidden patterns
  5. Convenience or Volunteer Sampling : select the first n points
  6. Convenience or Volunteer Sampling


Bias : 

  1. Selection Bias – predicting polls from twitter data.
  2. Measurement or Response Bias – the type of questions such that the people who answer it differ from the people who are not answering it. 
  3. Non-response Bias – if the individuals responding differ systematically from the people who are not responding. For example : a mandatory survey in canada which was sent to 1/5th of the people was changed to optional and was sent to 1/3rd of the people. Since, the response was voluntary and not mandatory, new immigrants were much less likely to respond to this survey.

Descriptive Statistics – starting with the data

September 27, 2013

There are the kinds of analysis that you can do when you start with any data set. This may be the starting point of all data science projects and it will give insights about the data. This is essential for both statisticians and also for consumer of statistical reports.

For quatitative variables :

  1. minimum, maximum
  2. median, quartile, inter quartile rang
  3. box plots
  4. mean
  5. spread of the data – standard deviation – sometimes there may be gaps in the data when we plot it as a histogram – outliers. When there are underlying special rules in the way the data is being generated, then there will be outliers in the data. For example : Some football clubs can play foreign players salaries above the salary cap, this will produce outlier salaries for those players. Another example : the top deal or product in an ecommerce site, gets the highest clicks by virtue of its position. This will create an outlier if ctr is considered, if the deals are ranked. Cleaning the data is an important first step in any statistical analysis. It is important to understand the reasons behind the outliers. In some cases, it is good to remove the outliers and in some cases it is not so good as we might lose valuable data signals. It is not unusual to report findings both with and without outliers.
  6. shape of the data – histograms
  7. skewed vs non-skewed, symmetric vs non-symmetric
  8. left skewed or negatively skewed – where it has a long left tail – mean < median < mode – the difference between the 3rd quartile and the median is smaller than the difference between the 1st quartile and median
  9. right skewed or positively skewed – where it has a long right tail
  10. extreme values or outliers – sometimes the data has a much better uniform shape when the outliers are removed

For categorical variables

  1. bar charts
  2. pie charts
  3. Examining the relationship between a quantitative variable and a categorical variable involves comparing the values of the quantitative variable among the groups defined by the categorical variable.

Missing Values

We must understand why the data for some of the variables are missing and the fact that they are missing might bias the result of our work.

Statistics 101

September 25, 2013
Dependent variable: a variable that represents the aspect of the world that the experimenter predicts will be affected by the independent variable.
Descriptive statistics: procedures used to summarize, organize, and simplify data.
Double blind experiment: an experiment in which neither the experimenter nor the subject knows whether the treatment is experimental or control.
Independent variable: a variable manipulated by the experimenter.
Inferential statistics: procedures that allow for generalizations about population parameters based on sample statistics.
Parameter: a numerical measure that describes a characteristic of a population.
Population: the entire collection of cases to which one attempts to generalize.
Sample: a subset of the population.
Statistic: a numerical measure that describes a characteristic of a sample.
Quasi-independent variable: a variable that resembles an independent variable but is not manipulated by the experimenter.

Joke on Data Scientists

September 25, 2013

Today I started reading the Moneyball. Back to Michael Lewis after almost two years. And guess what, the current buzz word in the valley is “Data Science”.

I was having a discussion with my manager regarding hiring a candidate for an open position. During the reviews meeting, we reached to a conclusion that the candidate was not so ok on machine learning and not so ok on programming. So, somebody in the room cracked a joke “sounds like a data scientist”.

But jokes apart, statistics, machine learning and programming put together is a formidable skillset in the industry today. So, I have decided to start a series of blog posts as a statistics refresher for myself.

And guess what, 2013 is also the international year of statistics. Sounds coincidental.

Serendipity !

MoneyBall review

September 25, 2013

Michael Lewis has a gripping writing style. He talks about different industries. Two years back, I had read Lewis Poker, before I joined investment banking. At that point of time, I had only some idea that I had gathered from variable sources on the internet about banking. What I found in Lewis Poker was that Michael Lewis made me feel a part of the industry. Two years later, now that I am reading Moneyball, I am going through the same feeling again. He talks about terms like “a soft tosser” which means not worth my time which the scouts used, .. which makes me feel that I am a part of the industry. His way of engaging the reader is emphatic.

Michael Lewis builds his character in front of the reader and then names the character. The reader goes through the process of transformation of the character quickly and this relates easily. He introduces the character David Beck and gives illustrations of how his hand might twist and turn in different directions. The reader can almost see that in front of him and then he names the David Beck as “The Creature”. Its as if, the reader sees his arms movement and then hears his name and agrees to the fact that he should be called “The Creature”.


Quotes from the Moneyball

September 25, 2013
  1. The human mind played tricks on itself when it relied exclusively on what it saw, and every trick it played was a financial opportunity for someone who saw through the illusion to the reality.

  2. sd
  3. sd
  4. sd


Cool feature in Flipkart’s user reviews trying to match a review to a particular item attribute

September 14, 2013

The flipkart Product Management and Research team have come up with a cool idea of trying to match a user review for a particular item to a specific attribute of the item only. They call it product features users are talking about. 



As you can see that they have identified operating systems, games, value for money and apps as the features for iphone. 

Now, based on a particular feature, you chose, you can see all the reviews that are clustered under that feature.


And then, you can select a particular review and read that review in detail. 



This is a real cool feature and will massively improve buyers experience. This will also in future lead the way for more granular recommendations. If flipkart knows what features in a product you are looking for, it can recommend you products which are good in that feature based on the recommendations of users who have used that feature. A strong case of collaborative filtering. Better recommendations in the future when they have a good data set and more money.

I thing this is a nice example, where the product management team and the research (NLP and machine learning) team have come together to bring out a new feature for flipkart.

What would be interesting to see, on how many other different products or categories is flipkart showing this feature.

For watches they are not.

Some other cool features on their website are, certified buyer reviews. This puts in more authenticity on the review and is held credible by the reader. They also write if there is a first time reviewer. 


Difference between Information Retrieval and Information Filtering

September 10, 2013

Information retrieval is about fulfilling immediate queries from a library of information available.

Example : you have a deal store containing 100 deals and a query comes from a user. You show the deals that are relevant to that query.

Information Filtering is about processing a stream of information to match your static set of likes, tastes and preferrences.

Example : a clipper service which reads all the news articles published today and serves you content that is relevant to you based on your likes and interests.

Argentina Debt Crisis, 2002 – A Debt Structuring Case Study

September 10, 2013

In 2002, Argentina faced a financial crisis and had no choice but to devalue its currency and default on its debt. Before this the Argentine peso was tied to the dollar at 1:1 and after devaluation it settled to 4:1. 

Solution : Debt Structuring. Argentina exchanged old debt with the new debt at 30 cents to the dollar and returns were linked in terms of GDP indexed bonds. 

What are GDP indexed bonds ?

Suppose a country has been growing in the last few years at an average rate of 3% and is expected to do so in the coming years. Suppose also that this country can issue debt using a fixed-income bond with a coupon of 7%. This country can issue a GDP-linked bond that pays 7% when output growth at the end of the year is exactly 3% and will pay more or less accordingly to its economic performance. That is, for example, if the country grows 1% instead of 3%, then the GDP-linked bond will pay a coupon of 5% instead of 7%. Conversely, if there is an unusually better economic performance and the country grows 5% instead of 3%, then the GDP-linked bond will pay a coupon of 9%. <Source Wikipidea>

Why GDP indexed bonds worked ?

The incentives of the creditors and the debtors were both aligned as both of them wanted to come out of recession. 

How is American Corporate Debt restructured ?

Bond is swapped for equity, with bondholders becomeing equity holders. Chapter 11 of bankruptcy filing.

What are vulture funds ?

They took advantage of a clause that all claimants are treated equal. That is if Argentina would pay the claimants who accepted debt structuring, they would also have to pay what they owed to other claimants.

Referrences : 


A view point on current run for emerging nations currencies

September 9, 2013
  1. Popular view : US QE tapering
  2. Contrary view : Shrink in consumption, leading to lower exports and hence, higher CAD. Eurozone is now running an overall surplus in CAD. They have run austerity measures and hence, the demand in the Eurozone has shrunk.

This post is in continuation to the post :