python - PySpark groupByKey finding tuple length -


my questions based upon first answer this question. want count elements key, how that?

example = sc.parallelize([(alpha, u'd'), (alpha, u'd'), (beta, u'e'), (gamma, u'f')])   abc=example.groupbykey().map(lambda x : (x[0], list(x[1]))).collect() # gives [(alpha, [u'd', u'd']), (beta, [u'e']), (gamma, [u'f'])] 

i want output below

alpha:2,beta:1, gamma:1 

i came know answer below. why complex? there simpler answer? s contain key + values? why cannot len(s)-1 subtracting 1 remove key s[0]

map(lambda s: (s[0], len(list(set(s[1]))))) 

well, not complex. need here yet word count:

from operator import add  example.map(lambda x: (x[0], 1)).reducebykey(add) 

if plan collect can countbykey:

example.countbykey() 

you don't want use groupbykey here assuming there hidden reason apply after all:

example.groupbykey().mapvalues(len) 

why len(s) - 1 doesn't work? because example pairwise rdd or in other words contains key-value pairs. same thing applies result of groupbykey. means len(s) equal 2.


Comments

Popular posts from this blog

gridview - Yii2 DataPorivider $totalSum for a column -

java - Suppress Jboss version details from HTTP error response -

Sass watch command compiles .scss files before full sftp upload -