python - FeatureUnion in scikit klearn and incompatible row dimension -


i have started use scikit learn text extraction. when use standard function countvectorizer , tfidftransformer in pipeline , when try combine new features ( concatention of matrix) have got row dimension problem.

this pipeline:

pipeline = pipeline([('feats', featureunion([ ('ngram_tfidf', pipeline([('vect', countvectorizer()),'tfidf', tfidftransformer())])), ('addned', addned()),])), ('clf', sgdclassifier()),]) 

this class addned add 30 news features on each documents (sample).

class addned(baseestimator, transformermixin): def __init__(self):     pass  def transform (self, x, **transform_params):     do_something     x_new_feat = np.array(list_feat)     print(type(x))     x_np = np.array(x)     print(x_np.shape, x_new_feat.shape)     return np.concatenate((x_np, x_new_feat), axis = 1)  def fit(self, x, y=none):     return self 

and first part of main programm

data = load_files('ho_without_tag') grid_search = gridsearchcv(pipeline, parameters, n_jobs = 1, verbose = 20) print(len(data.data), len(data.target)) grid_search.fit(x, y).transform(x) 

but result:

486 486 fitting 3 folds each of 3456 candidates, totalling 10368 fits [cv]feats__ngram_tfidf__vect__max_features=3000.... 323 <class 'list'> (323,) (486, 30) 

and of course indexerror exception

return np.concatenate((x_np, x_new_feat), axis = 1) indexerror: axis 1 out of bounds [0, 1 

when have params x in transform function (class addned) why don't have numpy array (486, 3000) shape x. have (323,) shape. don't understand because if delete feature union , addned() pipeline, countvectorizer , tf_idf work right features , right shape. if have idea? lot.

ok, try give more explication. when do_something, do_nothing x. in class addned if rewrite :

def transform (self, x, **transform_params):     print(x.shape)   #print x shape on first line before     print(type(x))   #for information     do_nothing_withx #construct new matrix shape (number of samples, 30 new features)      x_new_feat = np.array(list_feat) #get new matrix in numpy array      print(x_new_feat.shape)      return x_new_feat 

in transform case above, not concatenate x matrix , new matrix. presume features union that... , result:

486 486   #here print (data.data, data.target) fitting 3 folds each of 3456 candidates, totalling 10368 fits [cv] clf__alpha=1e-05, vect__max_df=0.1, clf__penalty=l2, feats__tfidf__use_idf=true, feats__tfidf__norm=l1, clf__loss=hinge, vect__ngram_range=(1, 1), clf__n_iter=10, vect__max_features=3000 (323, 3000) # x shape matrix <class 'scipy.sparse.csr.csr_matrix'> (486, 30) # new matrix shape traceback (most recent call last): file "pipe_line_learning_union.py", line 134, in <module> grid_search.fit(x, y).transform(x) ..... file "/data/maclearnve/lib/python3.4/site-packages/scipy/sparse/construct.py", line 581, in bmat raise valueerror('blocks[%d,:] has incompatible row dimensions' % i) valueerror: blocks[0,:] has incompatible row dimensions 

to go further, see, if if put cross validation on gridsearchcv, modify sample size:

grid_search = gridsearchcv(pipeline, parameters, cv=2, n_jobs = 1, verbose = 20) 

i have result:

486 486 fitting 2 folds each of 3456 candidates, totalling 6912 fits [cv] ...... (242, 3000) #this new sample size due cross validation <class 'scipy.sparse.csr.csr_matrix'> (486, 30) .......... valueerror: blocks[0,:] has incompatible row dimensions 

of course if necessary, can give code of do_nothing_withx. don't understand, why sample size pipeline countvectorizer+tdf_idf not equal number of files load sklearn.datasets.load_files() function.


Comments

Popular posts from this blog

java - Suppress Jboss version details from HTTP error response -

gridview - Yii2 DataPorivider $totalSum for a column -

Sass watch command compiles .scss files before full sftp upload -