apache spark - StandardScaler returns NaN -


env:

spark-1.6.0 scala-2.10.4 

usage:

// row of df : dataframe = (string,string,double,vector) (id1,id2,label,feature) val df = sqlcontext.read.parquet("data/labeled.parquet") val sc = new standardscaler() .setinputcol("feature").setoutputcol("scaled") .setwithmean(false).setwithstd(true).fit(df)    val scaled = sc.transform(df) .drop("feature").withcolumnrenamed("scaled","feature") 

code example here http://spark.apache.org/docs/latest/ml-features.html#standardscaler

nan exists in scaled, sc.mean, sc.std

i don't understand why standardscaler in mean or how handle situation. advice appreciated.

data size parquet 1.6gib, if needs let me know

update:

get through code of standardscaler , problem of precision of double when multivariateonlinesummarizer aggregated.

thanks @zero323

i locate problem : there value equals double.maxvalue , when standardscaler sum columns, result overflows.

simply cast column scala.math.bigdecimal works.

ref here:

http://www.scala-lang.org/api/current/index.html#scala.math.bigdecimal


Comments

Popular posts from this blog

gridview - Yii2 DataPorivider $totalSum for a column -

java - Suppress Jboss version details from HTTP error response -

Sass watch command compiles .scss files before full sftp upload -