The most commonly used function to calculate the matrix rank is probably the one in the MATLAB software package [1]. It is based on the calculation of singular values of the matrix A, and the number of singular values that are greater than some tolerance defines the rank. The algorithm is simple:
Mathematically, the same rank calculation is used in the software package Octave, but it completely equivalent. The function “rank.m” works surprisingly well but I wonder what the theoretical foundation behind it could be. By using SVD, the original data is modeled as A = U*S*V’ and if only a r factors are kept, it is becomes A = U(:,1:r) * S(1:r,1:r) * V(:,1:r)’ + E. The singular values of A are equal to the square root of the variance captured by each factors. (Square root of variance is the same is standard deviation).
If the singular values are scaled so the first one is equal to 1, the contribution for an additional factor measured as added standard deviation that is smaller than the machine precision, eps = 2.2e-16 for double precision, will be used as the cutoff for when factors with insignificant singular values are added. However, if additional data is added to a matrix, the singular values will in practice always become larger. This is related to the sum of squares of a matrix which is equal to the sum of all the singular values squared. Therefore the cutoff limit should be scaled by something and in the MATLAB rank function, it is scaled with the factor max(size(A)’).
Again, this works well, but it is difficult to understand the principle behind it. Questions that come to mind are
- Why are standard deviations used for the cutoff?
- Why not use the variance instead which is related to the variance explained?
- Why is it scaled with max(size(A)’)?
- Why not look at the root mean square errors (rms) of the residuals instead?
- Can it be improved in performance and can it become more intuitive?
To address these questions, and to create something that was easy for myself to understand, I made a new algorithm for calculating the rank, called RANKS where the tailing letter s stands for search (as in rank-search). I don’t know it is better, and it is not my intention to make the superior or best algorithm for rank estimation. The purpose is just to make something reasonable, understandable for myself,
s=flipud(svd(X));
csq = cumsum(s.^2);
maxval = csq(end);
if maxval > 0,
csq_nrm = csq/maxval;
df = max(m,n)*(min(m,n):-1:1)’;
msq = csq_nrm./df;
r = sum(msq > rms_tol^2);
else
r = 0;
end
%for i=1:length(msq),
% fprintf(‘%5g %20e\n’,i,sqrt(msq(i)));
%end
%semilogy(1:min(m,n),sqrt(msq),‘bo-’,[1 min(m,n)],[rms_tol rms_tol],‘k-’); figure(gcf);
The algorithm can be simplified but the above makes it easy understand the idea. The full algorithm, modified to avoid divisions and unnecessary operations is following below.
%RANKS Matrix rank *search*.
% RANKS(X,rms_tol) is the number of linearely independent
% row or columns with the root mean square error of
% residuals larger than rms_tol, slightly different to
% the standard RANK function in MATLAB.
%
% RANKS(X) provides an estimate of the number of linearly
% independent rows or columns of a matrix A using a
% default root means square error rms_tol = 2.2e16 (at
% machine precision).
%
% RANKS is searching over r for the SVD model with r factors
% U(:,1:r)*R(1:5,1:r)*V(:,1:r)’
% where the relative root mean square of the residuals of
% E = X — U(:,1:r)*R(1:5,1:r)*V(:,1:r)’
% is less than rms_tol.
%
% See also RANK
%
% Martin Andersson, 2012
if nargin==1
rms_tol = eps;
end
rms_tol=eps;
[m,n]=size(X);
s=flipud(svd(X));
csq = cumsum(s.^2);
maxval = csq(end);
df = max(m,n)*(min(m,n):-1:1)’;
r = sum(csq > df.*(maxval*rms_tol^2));
Finally a remark about the difference between the classic RANK algorithm implemented in MATLAB and the new RANKS algorithm presented here and why nobody cared about it. The reason is probably that this topic is near to being meaningless. Nobody seemed to argue about it because anyway, if the calculated the rank is slightly different between the two algorithms, it does not change much about the conclusions about the original data set.
😛
[1] Dongarra, J.J., J.R. Bunch, C.B. Moler, and G.W. Stewart, LINPACK Users’ Guide, SIAM, Philadelphia, 1979.