scala - How to retrieve record with min value in spark? -
lets have rdd -> (string, date, int)
[("sam", 02-25-2016, 2), ("sam",02-14-2016, 4), ("pam",03-16-2016, 1), ("pam",02-16-2016, 5)]
and want convert list ->
[("sam", 02-14-2016, 4), ("pam",02-16-2016, 5)]
where value record date min each key. best way this?
i assume since tagged question being related spark mean rdd opposed list.
making record 2 tuple, key first element allow use reducebykey method, this:
rdd .map(t => (t._1, (t._2, t._3)) .reducebykey((a, b) => if (a._1 < b._1) else b) .map(t => (t._1, t._2._1, t._2._2))
alternatively, using pattern matching clarity: (i find _* accessors tuples bit confusing read)
rdd .map {case (name, date, value) => (name, (date, value))} .reducebykey((a, b) => (a, b) match { case ((adate, aval), (bdate, bval)) => if (adate < bdate) else b }) .map {case (name, (date, value)) => (name, date, value)}
replace a._1 < b._1
whatever comparison appropriate date type working with.
see http://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs documentation on reducebykey, , other things can key/value pairs in spark
if looking plain old scala list, following work:
list .groupby(_._1) .mapvalues(l => l.reduce((a, b) => if(a._2 < b._2) else b)) .values .tolist
and again pattern matched version clarity:
list .groupby {case (name, date, value) => name} .mapvalues(l => l.reduce((a, b) => (a,b) match { case ((aname, adate, avalue), (bname, bdate, bvalue)) => if(adate < bdate) else b })) .values .tolist
Comments
Post a Comment