python - PySpark groupByKey finding tuple length -

- May 15, 2014

my questions based upon first answer this question. want count elements key, how that?

example = sc.parallelize([(alpha, u'd'), (alpha, u'd'), (beta, u'e'), (gamma, u'f')])   abc=example.groupbykey().map(lambda x : (x[0], list(x[1]))).collect() # gives [(alpha, [u'd', u'd']), (beta, [u'e']), (gamma, [u'f'])]

i want output below

alpha:2,beta:1, gamma:1

i came know answer below. why complex? there simpler answer? s contain key + values? why cannot len(s)-1 subtracting 1 remove key s[0]

map(lambda s: (s[0], len(list(set(s[1])))))

well, not complex. need here yet word count:

from operator import add  example.map(lambda x: (x[0], 1)).reducebykey(add)

if plan collect can countbykey:

example.countbykey()

you don't want use groupbykey here assuming there hidden reason apply after all:

example.groupbykey().mapvalues(len)

why len(s) - 1 doesn't work? because example pairwise rdd or in other words contains key-value pairs. same thing applies result of groupbykey. means len(s) equal 2.

Search This Blog

Look

python - PySpark groupByKey finding tuple length -

Comments

Post a Comment

Popular posts from this blog

filehandler - java open files not cleaned, even when the process is killed -

java - Suppress Jboss version details from HTTP error response -

Sass watch command compiles .scss files before full sftp upload -