pyspark.mllib.linalg.distributed.
IndexedRowMatrix
Represents a row-oriented distributed Matrix with indexed rows.
pyspark.RDD
An RDD of IndexedRows or (int, vector) tuples or a DataFrame consisting of a int typed column of indices and a vector typed column.
Number of rows in the matrix. A non-positive value means unknown, at which point the number of rows will be determined by the max row index plus one.
Number of columns in the matrix. A non-positive value means unknown, at which point the number of columns will be determined by the size of the first row.
Methods
columnSimilarities()
columnSimilarities
Compute all cosine similarities between columns.
computeGramianMatrix()
computeGramianMatrix
Computes the Gramian matrix A^T A.
computeSVD(k[, computeU, rCond])
computeSVD
Computes the singular value decomposition of the IndexedRowMatrix.
multiply(matrix)
multiply
Multiply this matrix by a local dense matrix on the right.
numCols()
numCols
Get or compute the number of cols.
numRows()
numRows
Get or compute the number of rows.
toBlockMatrix([rowsPerBlock, colsPerBlock])
toBlockMatrix
Convert this matrix to a BlockMatrix.
toCoordinateMatrix()
toCoordinateMatrix
Convert this matrix to a CoordinateMatrix.
toRowMatrix()
toRowMatrix
Convert this matrix to a RowMatrix.
Attributes
rows
Rows of the IndexedRowMatrix stored as an RDD of IndexedRows.
Methods Documentation
Examples
>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(6, [4, 5, 6])]) >>> mat = IndexedRowMatrix(rows) >>> cs = mat.columnSimilarities() >>> print(cs.numCols()) 3
New in version 2.0.0.
Notes
This cannot be computed on matrices with more than 65535 columns.
>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(1, [4, 5, 6])]) >>> mat = IndexedRowMatrix(rows)
>>> mat.computeGramianMatrix() DenseMatrix(3, 3, [17.0, 22.0, 27.0, 22.0, 29.0, 36.0, 27.0, 36.0, 45.0], 0)
The given row matrix A of dimension (m X n) is decomposed into U * s * V’T where
whose columns are the eigenvectors of (A X A’)
(singular values) in descending order.
are the eigenvectors of (A’ X A)
For more specific details on implementation, please refer the scala documentation.
New in version 2.2.0.
Number of leading singular values to keep (0 < k <= n). It might return less than k if there are numerically zero singular values or there are not enough Ritz values converged before the maximum number of Arnoldi update iterations is reached (in case that matrix A is ill-conditioned).
Whether or not to compute U. If set to be True, then U is computed by A * V * s^-1
Reciprocal condition number. All singular values smaller than rCond * s[0] are treated as zero where s[0] is the largest singular value.
SingularValueDecomposition
>>> rows = [(0, (3, 1, 1)), (1, (-1, 3, 1))] >>> irm = IndexedRowMatrix(sc.parallelize(rows)) >>> svd_model = irm.computeSVD(2, True) >>> svd_model.U.rows.collect() [IndexedRow(0, [-0.707106781187,0.707106781187]), IndexedRow(1, [-0.707106781187,-0.707106781187])] >>> svd_model.s DenseVector([3.4641, 3.1623]) >>> svd_model.V DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, ...0.0], 0)
pyspark.mllib.linalg.Matrix
a local dense matrix whose number of rows must match the number of columns of this matrix
>>> mat = IndexedRowMatrix(sc.parallelize([(0, (0, 1)), (1, (2, 3))])) >>> mat.multiply(DenseMatrix(2, 2, [0, 2, 1, 3])).rows.collect() [IndexedRow(0, [2.0,3.0]), IndexedRow(1, [6.0,11.0])]
>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(1, [4, 5, 6]), ... IndexedRow(2, [7, 8, 9]), ... IndexedRow(3, [10, 11, 12])])
>>> mat = IndexedRowMatrix(rows) >>> print(mat.numCols()) 3
>>> mat = IndexedRowMatrix(rows, 7, 6) >>> print(mat.numCols()) 6
>>> mat = IndexedRowMatrix(rows) >>> print(mat.numRows()) 4
>>> mat = IndexedRowMatrix(rows, 7, 6) >>> print(mat.numRows()) 7
Number of rows that make up each block. The blocks forming the final rows are not required to have the given number of rows.
Number of columns that make up each block. The blocks forming the final columns are not required to have the given number of columns.
>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(6, [4, 5, 6])]) >>> mat = IndexedRowMatrix(rows).toBlockMatrix()
>>> # This IndexedRowMatrix will have 7 effective rows, due to >>> # the highest row index being 6, and the ensuing >>> # BlockMatrix will have 7 rows as well. >>> print(mat.numRows()) 7
>>> print(mat.numCols()) 3
>>> rows = sc.parallelize([IndexedRow(0, [1, 0]), ... IndexedRow(6, [0, 5])]) >>> mat = IndexedRowMatrix(rows).toCoordinateMatrix() >>> mat.entries.take(3) [MatrixEntry(0, 0, 1.0), MatrixEntry(0, 1, 0.0), MatrixEntry(6, 0, 0.0)]
>>> rows = sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(6, [4, 5, 6])]) >>> mat = IndexedRowMatrix(rows).toRowMatrix() >>> mat.rows.collect() [DenseVector([1.0, 2.0, 3.0]), DenseVector([4.0, 5.0, 6.0])]
Attributes Documentation
>>> mat = IndexedRowMatrix(sc.parallelize([IndexedRow(0, [1, 2, 3]), ... IndexedRow(1, [4, 5, 6])])) >>> rows = mat.rows >>> rows.first() IndexedRow(0, [1.0,2.0,3.0])