scikit learn - Different results sklearn vs statsmodels and sklearn on different machines -
i'm finding 1 real head-scratcher. have python 2 notebook i'm using linear regression on laptop , desktop. on laptop sklearn
gives same results statsmodels. however, on desktop, statsmodels
gives correct result, sklearn
gives wrong result. number of coefficient estimates have blown 8 orders of magnitude larger should be, e.g., 304952680
vs -0.1271
. again, save notebook, pull on laptop, run again , statsmodels
vs sklearn
linear regression results equal. re-connect , re-run notebook again scratch on desktop and, again, statsmodels
correct, sklearn
linearregression
blows again. mystified. have ideas?
here 2 gists, linked through nbviewer. long, compare, example, cells 59 , 62, variable m12_cs_months_since_last_gift
. notebook, statsmodels (cell 59) agrees sklearn (cell 62). desktop, disagree (see blow variable in desktop cell 62). 1 thing may worth noting: data characterized large segments of predictor space corresponding same observed value. maybe suggests near collinearity suggested? i'll check singular values. additional suggestions or follow ups suggestion welcome. laptop 64 bit windows 8.1/statsmodels v.0.6.1/sklearn 0.17. desktop windows 10 64 bit, same statsmodels/sklearn module versions. notebook: http://nbviewer.jupyter.org/gist/andersrmr/fb7378f3659b8dd48625 desktop: http://nbviewer.jupyter.org/gist/andersrmr/76e219ad14ea9cb92d9e
i looked @ notebooks. looks performance both laptop , desktop models on training set virtually identical. means these large coefficient values balance each other out on training set. so, laptop's result isn't wrong, defies kind of interpretation might attach it. has larger risk of being on fit (i didn't see if scored on testing set, should). basically, if attempt apply fitted model example violates colinearity observed in training set, you'll ridiculous predictions.
why occurring on 1 machine , not another? basically, coefficients on set of colinear predictors numerically unstable, meaning small perturbations can lead large differences. differences in underlying numerical libraries invisible user can therefore lead significant changes in coefficients. if think in terms of linear algebra, make sense why happens. if 2 predictors colinear, sum of coefficients fixed either of 2 coefficients can grow without bound long other balances out.
what solution? if there real, exact dependence between these variables present, can ignore issue. however, wouldn't because never know. otherwise, either remove dependent columns manually (which not hurt prediction), pre-process automatic variable selection or dimension reduction technique, or use regularized regression method (such ridge regression).
note: it's possible i'm wrong in assumptions here. validate colinearity singular values. if so, please comment.
second note: there least squares solvers automatically 0 out dependent columns. if @ scipy.linalg.lstsq, can pass cutoff argument (cond
) in order 0 out small singular values. also, solvers more stable others, you've seen. can use more stable solver.
Comments
Post a Comment