PySpark : how to split data without randomnize -


there function can randomize spilt data

trainingrdd, validationrdd, testrdd = rdd.randomsplit([6, 2, 2], seed=0l) 

i'm curious if there way generate data same partition ( train 60 / valid 20 / test 20 ) without randommize ( let's use current data split first 60 = train, next 20 =valid , last 20 test data)

is there possible way split data similar way split not randomize?

the basic issue here unless have index column in data, there no concept of "first rows" , "next rows" in rdd, it's unordered set. if have integer index column this:

train = rdd.filter(lambda r: r['index'] % 5 <= 3) validation = rdd.filter(lambda r: r['index'] % 5 == 4) test = rdd.filter(lambda r: r['index'] % 5 == 5) 

Comments

Popular posts from this blog

java - Suppress Jboss version details from HTTP error response -

filehandler - java open files not cleaned, even when the process is killed -

gridview - Yii2 DataPorivider $totalSum for a column -