September 25, 2013
Today I started reading the Moneyball. Back to Michael Lewis after almost two years. And guess what, the current buzz word in the valley is “Data Science”.
I was having a discussion with my manager regarding hiring a candidate for an open position. During the reviews meeting, we reached to a conclusion that the candidate was not so ok on machine learning and not so ok on programming. So, somebody in the room cracked a joke “sounds like a data scientist”.
But jokes apart, statistics, machine learning and programming put together is a formidable skillset in the industry today. So, I have decided to start a series of blog posts as a statistics refresher for myself.
And guess what, 2013 is also the international year of statistics. Sounds coincidental.
December 26, 2012
This post was long overdue. I participated in the benchmark trade bond pricing challenge and used a regression based approach to predict bond prices. Here is an outline of the approach.
- Build on the training set and predict on the test set. The dependent variable we are trying to predict is the bond price and the independent variables are last 10 trade prices.
- Prepare frequency charts based on callability is 0 or 1.
- Divide the data into 12 parts – callability, price > 100 or price < 100(bond price will always converge to 100 so the curve will look different), types of trade in the bond (dealer to dealer, dealer to client, client to client – quotes driven market)
- Some of the values were missing – missing value treatment (based on exponential weights)
- Run regression on these sub-data sets and analyze the results
- Some of the t-tests failed – bond ids and time to delay – p value that was kept as cut off for rejecting the coefficients is 3%
- R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R2 of 1.0 indicates that the regression line perfectly fits the data.
- R2 measures goodness of fit. But it will not detect overfit because it will increase with any new predictor (unless it has already reached 1).
- Assumptions of Regression
- L Linear relationship
I Independent observations
N Normally distributed around line
E Equal variance across X’s
- Multicollinearity : When two independent variables are correlated and its detection – Variance Inflation Factor
- A p-value is the probability of an observed or more extreme result arising by chance. So, if p<.03, then that probability is quite less and hence, we can keep that independent variable
I have been away from data science now because of job change and interviewing. New Year Resolution : get back to kaggle .