python - FeatureUnion in scikit klearn and incompatible row dimension -
i have started use scikit learn text extraction. when use standard function countvectorizer , tfidftransformer in pipeline , when try combine new features ( concatention of matrix) have got row dimension problem.
this pipeline:
pipeline = pipeline([('feats', featureunion([ ('ngram_tfidf', pipeline([('vect', countvectorizer()),'tfidf', tfidftransformer())])), ('addned', addned()),])), ('clf', sgdclassifier()),])
this class addned add 30 news features on each documents (sample).
class addned(baseestimator, transformermixin): def __init__(self): pass def transform (self, x, **transform_params): do_something x_new_feat = np.array(list_feat) print(type(x)) x_np = np.array(x) print(x_np.shape, x_new_feat.shape) return np.concatenate((x_np, x_new_feat), axis = 1) def fit(self, x, y=none): return self
and first part of main programm
data = load_files('ho_without_tag') grid_search = gridsearchcv(pipeline, parameters, n_jobs = 1, verbose = 20) print(len(data.data), len(data.target)) grid_search.fit(x, y).transform(x)
but result:
486 486 fitting 3 folds each of 3456 candidates, totalling 10368 fits [cv]feats__ngram_tfidf__vect__max_features=3000.... 323 <class 'list'> (323,) (486, 30)
and of course indexerror exception
return np.concatenate((x_np, x_new_feat), axis = 1) indexerror: axis 1 out of bounds [0, 1
when have params x in transform function (class addned) why don't have numpy array (486, 3000) shape x. have (323,) shape. don't understand because if delete feature union , addned() pipeline, countvectorizer , tf_idf work right features , right shape. if have idea? lot.
ok, try give more explication. when do_something, do_nothing x. in class addned if rewrite :
def transform (self, x, **transform_params): print(x.shape) #print x shape on first line before print(type(x)) #for information do_nothing_withx #construct new matrix shape (number of samples, 30 new features) x_new_feat = np.array(list_feat) #get new matrix in numpy array print(x_new_feat.shape) return x_new_feat
in transform case above, not concatenate x matrix , new matrix. presume features union that... , result:
486 486 #here print (data.data, data.target) fitting 3 folds each of 3456 candidates, totalling 10368 fits [cv] clf__alpha=1e-05, vect__max_df=0.1, clf__penalty=l2, feats__tfidf__use_idf=true, feats__tfidf__norm=l1, clf__loss=hinge, vect__ngram_range=(1, 1), clf__n_iter=10, vect__max_features=3000 (323, 3000) # x shape matrix <class 'scipy.sparse.csr.csr_matrix'> (486, 30) # new matrix shape traceback (most recent call last): file "pipe_line_learning_union.py", line 134, in <module> grid_search.fit(x, y).transform(x) ..... file "/data/maclearnve/lib/python3.4/site-packages/scipy/sparse/construct.py", line 581, in bmat raise valueerror('blocks[%d,:] has incompatible row dimensions' % i) valueerror: blocks[0,:] has incompatible row dimensions
to go further, see, if if put cross validation on gridsearchcv, modify sample size:
grid_search = gridsearchcv(pipeline, parameters, cv=2, n_jobs = 1, verbose = 20)
i have result:
486 486 fitting 2 folds each of 3456 candidates, totalling 6912 fits [cv] ...... (242, 3000) #this new sample size due cross validation <class 'scipy.sparse.csr.csr_matrix'> (486, 30) .......... valueerror: blocks[0,:] has incompatible row dimensions
of course if necessary, can give code of do_nothing_withx. don't understand, why sample size pipeline countvectorizer+tdf_idf not equal number of files load sklearn.datasets.load_files() function.
Comments
Post a Comment