Methods of Data Collection and Biases

September 30, 2013

Methods :

  1. Simple Random Sampling
  2. Stratified Sampling – divide the population into non-overlapping subgroups called strata and choose SRS within each subgroup. Thus the variance within each subgroup is less than the overall population variance.
  3. Cluster Sampling
  4. Systematic Sampling – select the kth item – hidden patterns
  5. Convenience or Volunteer Sampling : select the first n points
  6. Convenience or Volunteer Sampling

 

Bias : 

  1. Selection Bias – predicting polls from twitter data.
  2. Measurement or Response Bias – the type of questions such that the people who answer it differ from the people who are not answering it. 
  3. Non-response Bias – if the individuals responding differ systematically from the people who are not responding. For example : a mandatory survey in canada which was sent to 1/5th of the people was changed to optional and was sent to 1/3rd of the people. Since, the response was voluntary and not mandatory, new immigrants were much less likely to respond to this survey.
Advertisements

Descriptive Statistics – starting with the data

September 27, 2013

There are the kinds of analysis that you can do when you start with any data set. This may be the starting point of all data science projects and it will give insights about the data. This is essential for both statisticians and also for consumer of statistical reports.

For quatitative variables :

  1. minimum, maximum
  2. median, quartile, inter quartile rang
  3. box plots
  4. mean
  5. spread of the data – standard deviation – sometimes there may be gaps in the data when we plot it as a histogram – outliers. When there are underlying special rules in the way the data is being generated, then there will be outliers in the data. For example : Some football clubs can play foreign players salaries above the salary cap, this will produce outlier salaries for those players. Another example : the top deal or product in an ecommerce site, gets the highest clicks by virtue of its position. This will create an outlier if ctr is considered, if the deals are ranked. Cleaning the data is an important first step in any statistical analysis. It is important to understand the reasons behind the outliers. In some cases, it is good to remove the outliers and in some cases it is not so good as we might lose valuable data signals. It is not unusual to report findings both with and without outliers.
  6. shape of the data – histograms
  7. skewed vs non-skewed, symmetric vs non-symmetric
  8. left skewed or negatively skewed – where it has a long left tail – mean < median < mode – the difference between the 3rd quartile and the median is smaller than the difference between the 1st quartile and median
  9. right skewed or positively skewed – where it has a long right tail
  10. extreme values or outliers – sometimes the data has a much better uniform shape when the outliers are removed

For categorical variables

  1. bar charts
  2. pie charts
  3. Examining the relationship between a quantitative variable and a categorical variable involves comparing the values of the quantitative variable among the groups defined by the categorical variable.

Missing Values

We must understand why the data for some of the variables are missing and the fact that they are missing might bias the result of our work.


Joke on Data Scientists

September 25, 2013

Today I started reading the Moneyball. Back to Michael Lewis after almost two years. And guess what, the current buzz word in the valley is “Data Science”.

I was having a discussion with my manager regarding hiring a candidate for an open position. During the reviews meeting, we reached to a conclusion that the candidate was not so ok on machine learning and not so ok on programming. So, somebody in the room cracked a joke “sounds like a data scientist”.

But jokes apart, statistics, machine learning and programming put together is a formidable skillset in the industry today. So, I have decided to start a series of blog posts as a statistics refresher for myself.

And guess what, 2013 is also the international year of statistics. Sounds coincidental.

Serendipity !


Introduction to R

January 7, 2013

1) How to read a csv file in R ?

data<-read.csv(filename,header=TRUE)

2) How to display the first n lines of the file ?

head(data,n) : The default value of n is 6.

3) How to display the last n lines of the file ?

tail(data,n) 

4) Calculate missing values in all the columns in the data set ?

colSums(data)

Other functions that can be used for this purpose are sapply and apply.

5) Calculate the mean of a column without the missing values ?

colMeans(data,na.rm=TRUE)
     Ozone    Solar.R       Wind       Temp      Month        Day 
 42.129310 185.931507   9.957516  77.882353   6.993464  15.803922 
 colMeans(data)
    Ozone   Solar.R      Wind      Temp     Month       Day 
       NA        NA  9.957516 77.882353  6.993464 15.803922 
 colMeans(data["Ozone"],na.rm=TRUE)
   Ozone 
42.12931 

6) Extract the subset of rows of the data frame where Ozone values are above 31 and Temp values are above 90. What is the mean of Solar.R in this subset?

colMeans(subset(data,(Ozone&gt;31 &amp; Temp&gt;90)))
 Ozone Solar.R    Wind    Temp   Month     Day 
 89.5   212.8     5.6    93.4     8.2    14.5

Additional info on Subset

7) Find the mean temperature in the Month of n ?

colMeans(subset(data,Month==n))
    Ozone   Solar.R      Wind      Temp     Month       Day 
    NA 190.16667  10.26667  79.10000   6.00000  15.50000 

Additional Resources :
1) Filling in nas with column medians in R

2) Apply function and its variants