scikit learn - Different results sklearn vs statsmodels and sklearn on different machines -

- March 15, 2012

i'm finding 1 real head-scratcher. have python 2 notebook i'm using linear regression on laptop , desktop. on laptop sklearn gives same results statsmodels. however, on desktop, statsmodels gives correct result, sklearn gives wrong result. number of coefficient estimates have blown 8 orders of magnitude larger should be, e.g., 304952680 vs -0.1271. again, save notebook, pull on laptop, run again , statsmodels vs sklearn linear regression results equal. re-connect , re-run notebook again scratch on desktop and, again, statsmodels correct, sklearn linearregression blows again. mystified. have ideas?

here 2 gists, linked through nbviewer. long, compare, example, cells 59 , 62, variable m12_cs_months_since_last_gift. notebook, statsmodels (cell 59) agrees sklearn (cell 62). desktop, disagree (see blow variable in desktop cell 62). 1 thing may worth noting: data characterized large segments of predictor space corresponding same observed value. maybe suggests near collinearity suggested? i'll check singular values. additional suggestions or follow ups suggestion welcome. laptop 64 bit windows 8.1/statsmodels v.0.6.1/sklearn 0.17. desktop windows 10 64 bit, same statsmodels/sklearn module versions. notebook: http://nbviewer.jupyter.org/gist/andersrmr/fb7378f3659b8dd48625 desktop: http://nbviewer.jupyter.org/gist/andersrmr/76e219ad14ea9cb92d9e

i looked @ notebooks. looks performance both laptop , desktop models on training set virtually identical. means these large coefficient values balance each other out on training set. so, laptop's result isn't wrong, defies kind of interpretation might attach it. has larger risk of being on fit (i didn't see if scored on testing set, should). basically, if attempt apply fitted model example violates colinearity observed in training set, you'll ridiculous predictions.

why occurring on 1 machine , not another? basically, coefficients on set of colinear predictors numerically unstable, meaning small perturbations can lead large differences. differences in underlying numerical libraries invisible user can therefore lead significant changes in coefficients. if think in terms of linear algebra, make sense why happens. if 2 predictors colinear, sum of coefficients fixed either of 2 coefficients can grow without bound long other balances out.

what solution? if there real, exact dependence between these variables present, can ignore issue. however, wouldn't because never know. otherwise, either remove dependent columns manually (which not hurt prediction), pre-process automatic variable selection or dimension reduction technique, or use regularized regression method (such ridge regression).

note: it's possible i'm wrong in assumptions here. validate colinearity singular values. if so, please comment.

second note: there least squares solvers automatically 0 out dependent columns. if @ scipy.linalg.lstsq, can pass cutoff argument (cond) in order 0 out small singular values. also, solvers more stable others, you've seen. can use more stable solver.

Search This Blog

Look

scikit learn - Different results sklearn vs statsmodels and sklearn on different machines -

Comments

Post a Comment

Popular posts from this blog

filehandler - java open files not cleaned, even when the process is killed -

java - Suppress Jboss version details from HTTP error response -

Sass watch command compiles .scss files before full sftp upload -