apache spark - StandardScaler returns NaN -
env:
spark-1.6.0 scala-2.10.4
usage:
// row of df : dataframe = (string,string,double,vector) (id1,id2,label,feature) val df = sqlcontext.read.parquet("data/labeled.parquet") val sc = new standardscaler() .setinputcol("feature").setoutputcol("scaled") .setwithmean(false).setwithstd(true).fit(df) val scaled = sc.transform(df) .drop("feature").withcolumnrenamed("scaled","feature")
code example here http://spark.apache.org/docs/latest/ml-features.html#standardscaler
nan exists in scaled
, sc.mean
, sc.std
i don't understand why standardscaler
in mean
or how handle situation. advice appreciated.
data size parquet 1.6gib, if needs let me know
update:
get through code of standardscaler
, problem of precision of double
when multivariateonlinesummarizer
aggregated.
thanks @zero323
i locate problem : there value equals double.maxvalue
, when standardscaler
sum columns, result overflows.
simply cast column scala.math.bigdecimal
works.
ref here:
http://www.scala-lang.org/api/current/index.html#scala.math.bigdecimal
Comments
Post a Comment