Utility function to compute mean and standard deviation on a given dataset.
Utility function to compute mean and standard deviation on a given dataset.
- input data set whose statistics are computed
- number of features
- number of examples in input dataset
(yMean, xColMean, xColSd) - Tuple consisting of yMean - mean of the labels xColMean - Row vector with mean for every column (or feature) of the input data xColSd - Row vector standard deviation for every column (or feature) of the input data.
Load labeled data from a file.
Load labeled data from a file. The data format used here is <L>, <f1> <f2> ... where <f1>, <f2> are feature values in Double and <L> is the corresponding label as Double.
SparkContext
Directory to the input data files.
An RDD of LabeledPoint. Each labeled point has two elements: the first element is the label, and the second element represents the feature values (an array of Double).
Save labeled data to a file.
Save labeled data to a file. The data format used here is <L>, <f1> <f2> ... where <f1>, <f2> are feature values in Double and <L> is the corresponding label as Double.
An RDD of LabeledPoints containing data to be saved.
Directory to save the data.
Return the squared Euclidean distance between two vectors.
Helper methods to load, save and pre-process data used in ML Lib.