This post was long overdue. I participated in the benchmark trade bond pricing challenge and used a regression based approach to predict bond prices. Here is an outline of the approach.
- Build on the training set and predict on the test set. The dependent variable we are trying to predict is the bond price and the independent variables are last 10 trade prices.
- Prepare frequency charts based on callability is 0 or 1.
- Divide the data into 12 parts – callability, price > 100 or price < 100(bond price will always converge to 100 so the curve will look different), types of trade in the bond (dealer to dealer, dealer to client, client to client – quotes driven market)
- Some of the values were missing – missing value treatment (based on exponential weights)
- Run regression on these sub-data sets and analyze the results
- Some of the t-tests failed – bond ids and time to delay – p value that was kept as cut off for rejecting the coefficients is 3%
- R2 is a statistic that will give some information about the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression line approximates the real data points. An R2 of 1.0 indicates that the regression line perfectly fits the data.
- R2 measures goodness of fit. But it will not detect overfit because it will increase with any new predictor (unless it has already reached 1).
- Assumptions of Regression
- L Linear relationship
I Independent observations
N Normally distributed around line
E Equal variance across X’s
- Multicollinearity : When two independent variables are correlated and its detection – Variance Inflation Factor
- A p-value is the probability of an observed or more extreme result arising by chance. So, if p<.03, then that probability is quite less and hence, we can keep that independent variable
I have been away from data science now because of job change and interviewing. New Year Resolution : get back to kaggle .