Tall Array Support, Usage Notes, and Limitations

Descriptive Statistics and Visualization

Function	Notes or Limitations
`geomean`
`harmmean`
`kurtosis`
`range`
`skewness`
`zscore`
`corr`	Only `'Pearson'` type is supported.
`tabulate`
`crosstab`	The fourth output, `labels`, is returned as a cell array containing `M` unevaluated tall cell arrays, where `M` is the number of input grouping variables. Each unevaluated tall cell array, `labels{j}`, contains the labels for one grouping variable.
`grpstats`	If the input data is a tall array, then all grouping variables must also be tall and have the same number of rows as the data. The `whichstats` option cannot be specified as a function handle. In addition to the current built-in options, `whichstats` can also be: `'Count'` — Number of non-NaNs. `'NNZ'` — Number of nonzeros and non-NaNs. `'Kurtosis'` — Compute kurtosis. `'Skewness'` — Compute skewness. `'all-stats'` — Compute all summary statistics. Group order is not guaranteed to be the same as the in-memory `grpstats` computation. Summary statistics for nonnumeric variables return NaNs. `grpstats` always operates on the first dimension. If the input is a tall table, then the output is also a tall table. However, rather than including row names, the output tall table contains an extra variable `GroupLabel` that contains the same information.
`binScatterPlot`	This function is specifically designed for visualizing large data sets. Instead of plotting millions of data points, which is not very feasible, `binScatterPlot` summarizes the data points into bins. This "scatter plot of bins" reveals high-level trends in the data.
`ksdensity`	Some options that require extra passes or sorting of the input data are not supported: `'Censoring'` `'Support'` (support is always unbounded). Uses standard deviation (instead of median absolute deviation) to compute the bandwidth.

Probability Distributions

Function Notes or Limitations

Function	Notes or Limitations
`datasample`	`datasample` is useful as a precursor to plotting a random subset of a very large data set. Sampling a large data set preserves trends in the data without requiring that you plot millions of data points. Supported syntaxes: `Y = datasample(data,k,'Replace',false)` returns `k` observations sampled uniformly at random from `data`, without replacement. `Y = datasample(data,k,1,'Replace',false)` returns a sample taken along the first dimension of `data`. `[Y,idx] = datasample(___)` also returns a tall logical index `idx`. `[___] = datasample(s,___)` specifies a random number stream `s` to generate random numbers. If no random number stream is provided then `datasample` uses the global stream. If the global stream does not support parallel streams, then it uses `'mrg32k3a'`.

datasample

datasample is useful as a precursor to plotting a random subset of a very large data set. Sampling a large data set preserves trends in the data without requiring that you plot millions of data points.
Supported syntaxes:
- Y = datasample(data,k,'Replace',false) returns k observations sampled uniformly at random from data, without replacement.
- Y = datasample(data,k,1,'Replace',false) returns a sample taken along the first dimension of data.
- [Y,idx] = datasample(___) also returns a tall logical index idx.
- [___] = datasample(s,___) specifies a random number stream s to generate random numbers.
If no random number stream is provided then datasample uses the global stream. If the global stream does not support parallel streams, then it uses 'mrg32k3a'.

Cluster Analysis

Function Notes or Limitations

Function	Notes or Limitations
`kmeans`	Only random sample initialization is supported. Supported syntaxes: `idx = kmeans(X,k)` performs classic k-means clustering. `[idx,C] = kmeans(X,k)` also returns the `k` cluster centroid locations. `[idx,C,sumd] = kmeans(X,k)` additionally returns the `k` within-cluster sums of point-to-centroid distances. `[___] = kmeans(___,Name,Value)` specifies additional name-value pair options using any of the other syntaxes. Valid options are: `'Start'` — Method used to choose the initial cluster centroid positions. Value can be: `'plus'` (default) — Select `k` observations from `X` using a variant of the kmeans++ algorithm adapted for tall data. `'sample'` — Select `k` observations from `X` at random. Numeric matrix — A k-by-p matrix to explicitly specify starting locations. `'Options'` — An options structure created using the `statset` function. For tall arrays, `kmeans` uses the fields listed here and ignores all other fields in the options structure: `'Display'` — Level of display. Choices are `'iter'` (default), `'off'`, and `'final'`. `'MaxIter'` — Maximum number of iterations. Default is `100`. `'TolFun'` — Convergency tolerance for the within-cluster sums of point-to-centroid distances. Default is `1e-4`. This option field only works with tall arrays.

kmeans

Only random sample initialization is supported. Supported syntaxes:

idx = kmeans(X,k) performs classic k-means clustering.
[idx,C] = kmeans(X,k) also returns the k cluster centroid locations.
[idx,C,sumd] = kmeans(X,k) additionally returns the k within-cluster sums of point-to-centroid distances.
[___] = kmeans(___,Name,Value) specifies additional name-value pair options using any of the other syntaxes. Valid options are:
- 'Start' — Method used to choose the initial cluster centroid positions. Value can be:
  - 'plus' (default) — Select k observations from X using a variant of the kmeans++ algorithm adapted for tall data.
  - 'sample' — Select k observations from X at random.
  - Numeric matrix — A k-by-p matrix to explicitly specify starting locations.
- 'Options' — An options structure created using the statset function. For tall arrays, kmeans uses the fields listed here and ignores all other fields in the options structure:
  - 'Display' — Level of display. Choices are 'iter' (default), 'off', and 'final'.
  - 'MaxIter' — Maximum number of iterations. Default is 100.
  - 'TolFun' — Convergency tolerance for the within-cluster sums of point-to-centroid distances. Default is 1e-4. This option field only works with tall arrays.

Regression

Function	Notes or Limitations
`cvpartition`	For tall arrays only stratified-`HoldOut` partitions are supported. `c = cvpartition(group,'HoldOut',p)` randomly partitions observations into a training set and a test set with stratification, using the class information in `group`. `P` is a scalar such that `0 < P < 1`. To obtain nonstratified partitions, set a uniform grouping variable from the data samples. For example, assuming `X` is a tall numeric array, you can use groups = X(:,1).*0; C = cvpartition(groups,'HoldOut',P)
`fitlm`	If any input argument to `fitlm` is a tall array, then all of the other inputs must be tall arrays as well. This includes nonempty variables supplied with the `'Weights'` and `'Exclude'` name-value pairs. The `'RobustOpts'` name-value pair is not supported with tall arrays. For tall data, `fitlm` returns a `CompactLinearModel` object that contains most of the same properties as a `LinearModel` object. The main difference is that the compact object is sensitive to memory requirements. The compact object does not include properties that include the data, or that include an array of the same size as the data. The compact object does not contain these `LinearModel` properties: `Diagnostics` `Fitted` `ObservationInfo` `ObservationNames` `Residuals` `Steps` `Variables` You can compute the residuals directly from the compact object returned by `LM = fitlm(X,Y)` using RES = Y - predict(LM,X); S = LM.RMSE; histogram(RES,linspace(-3S,3S,51)) If the `CompactLinearModel` object is missing lower order terms that include categorical factors: The `plotEffects` and `plotInteraction` methods are not supported. The `anova` method with the `'components'` option is not supported.
`fitglm`	If any input argument to `fitglm` is a tall array, then all of the other inputs must be tall arrays as well. This includes nonempty variables supplied with the `'Weights'`, `'Exclude'`, `'Offset'`, and `'BinomialSize'` name-value pairs. The default number of iterations is 5. You can change the number of iterations using the `'Options'` name-value pair to pass in an options structure. Create an options structure using `statset` to specify a different value for `MaxIter`. For tall data, `fitglm` returns a `CompactGeneralizedLinearModel` object that contains most of the same properties as a `GeneralizedLinearModel` object. The main difference is that the compact object is sensitive to memory requirements. The compact object does not include properties that include the data, or that include an array of the same size as the data. The compact object does not contain these `GeneralizedLinearModel` properties: `Diagnostics` `Fitted` `Offset` `ObservationInfo` `ObservationNames` `Residuals` `Steps` `Variables` You can compute the residuals directly from the compact object returned by `GLM = fitglm(X,Y)` using RES = Y - predict(GLM,X); S = sqrt(GLM.SSE/GLM.DFE); histogram(RES,linspace(-3S,3S,51))

Classification

Function Notes or Limitations

Function	Notes or Limitations
`fitcdiscr`	Supported name-value pairs are: `'ClassNames'` `'Cost'` `'DiscrimType'` `'PredictorNames'` `'Prior'` `'ResponseName'` `'ScoreTransform'` `'Weights'` For tall arrays and tall tables, `fitcdiscr` returns a `CompactClassificationDiscriminant` object, which contains most of the same properties as a `ClassificationDiscriminant` object. The main difference is that the compact object is sensitive to memory requirements. The compact object does not include properties that include the data, or that include an array of the same size as the data. The compact object does not contain these `ClassificationDiscriminant` properties: `ModelParameters` `NumObservations` `ParameterOptimizationResults` `RowsUsed` `XCentered` `W` `X` `Y` Additionally, the compact object does not support these `ClassificationDiscriminant` methods: `compact` `crossval` `cvshrink` `resubEdge` `resubLoss` `resubMargin` `resubPredict`

fitcdiscr

Supported name-value pairs are:
- 'ClassNames'
- 'Cost'
- 'DiscrimType'
- 'PredictorNames'
- 'Prior'
- 'ResponseName'
- 'ScoreTransform'
- 'Weights'
For tall arrays and tall tables, fitcdiscr returns a CompactClassificationDiscriminant object, which contains most of the same properties as a ClassificationDiscriminant object. The main difference is that the compact object is sensitive to memory requirements. The compact object does not include properties that include the data, or that include an array of the same size as the data. The compact object does not contain these ClassificationDiscriminant properties:
- ModelParameters
- NumObservations
- ParameterOptimizationResults
- RowsUsed
- XCentered
- W
- X
- Y
Additionally, the compact object does not support these ClassificationDiscriminant methods:
- compact
- crossval
- cvshrink
- resubEdge
- resubLoss
- resubMargin
- resubPredict

Dimensionality Reduction

Function Notes or Limitations

Function	Notes or Limitations
`pcacov`, `factoran`	`pcacov` and `factoran` do not work directly on tall arrays. Instead, use `C = gather(cov(X))` to compute the covariance matrix of a tall array. Then, you can use `pcacov` or `factoran` on the in-memory covariance matrix. Alternatively, you can use `pca` directly on a tall array.
`pca`	`pca` works directly with tall arrays by computing the covariance matrix and using the in-memory `pcacov` function to compute the principle components. Supported syntaxes are: `coeff = pca(X)` `[coeff,score,latent] = pca(X)` `[coeff,score,latent,explained] = pca(X)` `[coeff,score,latent,tsquared] = pca(X)` `[coeff,score,latent,tsquared,explained] = pca(X)` Name-value pair arguments are not supported.

pcacov, factoran

pcacov and factoran do not work directly on tall arrays. Instead, use C = gather(cov(X)) to compute the covariance matrix of a tall array. Then, you can use pcacov or factoran on the in-memory covariance matrix. Alternatively, you can use pca directly on a tall array.

pca

pca works directly with tall arrays by computing the covariance matrix and using the in-memory pcacov function to compute the principle components.
Supported syntaxes are:
- coeff = pca(X)
- [coeff,score,latent] = pca(X)
- [coeff,score,latent,explained] = pca(X)
- [coeff,score,latent,tsquared] = pca(X)
- [coeff,score,latent,tsquared,explained] = pca(X)
Name-value pair arguments are not supported.

More About

Was this topic helpful?

Documentation

Tall Array Support, Usage Notes, and Limitations

Descriptive Statistics and Visualization

Probability Distributions

Cluster Analysis

Regression

Classification

Dimensionality Reduction

See Also

More About

Statistics and Machine Learning Toolbox Documentation

Other Documentation

Support