PySpark : how to split data without randomnize -
there function can randomize spilt data
trainingrdd, validationrdd, testrdd = rdd.randomsplit([6, 2, 2], seed=0l)
i'm curious if there way generate data same partition ( train 60 / valid 20 / test 20 ) without randommize ( let's use current data split first 60 = train, next 20 =valid , last 20 test data)
is there possible way split data similar way split not randomize?
the basic issue here unless have index column in data, there no concept of "first rows" , "next rows" in rdd, it's unordered set. if have integer index column this:
train = rdd.filter(lambda r: r['index'] % 5 <= 3) validation = rdd.filter(lambda r: r['index'] % 5 == 4) test = rdd.filter(lambda r: r['index'] % 5 == 5)
Comments
Post a Comment