python - PySpark groupByKey finding tuple length -
my questions based upon first answer this question. want count elements key, how that?
example = sc.parallelize([(alpha, u'd'), (alpha, u'd'), (beta, u'e'), (gamma, u'f')])   abc=example.groupbykey().map(lambda x : (x[0], list(x[1]))).collect() # gives [(alpha, [u'd', u'd']), (beta, [u'e']), (gamma, [u'f'])] i want output below
alpha:2,beta:1, gamma:1 i came know answer below. why complex? there simpler answer? s contain key + values? why cannot len(s)-1 subtracting 1 remove key s[0]
map(lambda s: (s[0], len(list(set(s[1]))))) 
well, not complex. need here yet word count:
from operator import add  example.map(lambda x: (x[0], 1)).reducebykey(add) if plan collect can countbykey:
example.countbykey() you don't want use groupbykey here assuming there hidden reason apply after all:
example.groupbykey().mapvalues(len) why len(s) - 1 doesn't work? because example pairwise rdd or in other words contains key-value pairs. same thing applies result of groupbykey. means len(s) equal 2.
Comments
Post a Comment