python - PySpark groupByKey finding tuple length -
my questions based upon first answer this question. want count elements key, how that?
example = sc.parallelize([(alpha, u'd'), (alpha, u'd'), (beta, u'e'), (gamma, u'f')]) abc=example.groupbykey().map(lambda x : (x[0], list(x[1]))).collect() # gives [(alpha, [u'd', u'd']), (beta, [u'e']), (gamma, [u'f'])]
i want output below
alpha:2,beta:1, gamma:1
i came know answer below. why complex? there simpler answer? s contain key + values? why cannot len(s)-1 subtracting 1 remove key s[0]
map(lambda s: (s[0], len(list(set(s[1])))))
well, not complex. need here yet word count:
from operator import add example.map(lambda x: (x[0], 1)).reducebykey(add)
if plan collect
can countbykey
:
example.countbykey()
you don't want use groupbykey
here assuming there hidden reason apply after all:
example.groupbykey().mapvalues(len)
why len(s) - 1
doesn't work? because example
pairwise rdd or in other words contains key-value pairs. same thing applies result of groupbykey
. means len(s)
equal 2.
Comments
Post a Comment