PySpark : how to split data without randomnize -


there function can randomize spilt data

trainingrdd, validationrdd, testrdd = rdd.randomsplit([6, 2, 2], seed=0l) 

i'm curious if there way generate data same partition ( train 60 / valid 20 / test 20 ) without randommize ( let's use current data split first 60 = train, next 20 =valid , last 20 test data)

is there possible way split data similar way split not randomize?

the basic issue here unless have index column in data, there no concept of "first rows" , "next rows" in rdd, it's unordered set. if have integer index column this:

train = rdd.filter(lambda r: r['index'] % 5 <= 3) validation = rdd.filter(lambda r: r['index'] % 5 == 4) test = rdd.filter(lambda r: r['index'] % 5 == 5) 

Comments

Popular posts from this blog

java - Suppress Jboss version details from HTTP error response -

gridview - Yii2 DataPorivider $totalSum for a column -

Sass watch command compiles .scss files before full sftp upload -