Hi all,
This is a Java question, despite the set up looking long!
I am writing a program that involves constructing and using the inverse of a large correlation matrix many times. Suppose you have a data matrix X with d columns representing d variables and n rows representing the different data points. A correlation matrix (R)_ij for the data is a matrix with R_ij = corr(X_i,X_j), where X_i means 'row i'. The type of correlation functions I use are purely based on the distance between the columns. E.g.
R_ij = prod_{k=1}^d exp{-c_k|X_ik - X_jk|^{p_k}}
where the powers p_k and constants c_k are known and input by the user. (These will change each time we construct the matrix).
OK that's the goal. I am an R programmer and the construction can be done very very quickly in R using the function dist(). What dist() does is give a matrix of distances between the X's without using loops. The without loops part is extremely important because loops are very very slow in R. So, in R you would first pre-scale the columns by c_k^(1/p_k). Then the code would look like
D <- dist(rep(1:n))
for(i in 1:d){ D = D+dist(X[,i])^p[k]}
exp{-D}
Note the product went into the exponent as a sum. OK that's super quick. About 1.5 seconds for a 3000x5 matrix (generating a 3000x3000 matrix).
The problem is that in the code I am writing, I have to do this (along with many other operations) thousands and thousands of times. Of course we can do that in R, but it takes too long. Loops are just bad in R. For example, making the matrix using a double loop the "obvious" way, takes 22 minutes for the same size matrix as dist does in 1.5s! So, impressed by the relative speed of java in loops, I decided to code there. I require fast cholesky decompositions, so have been using the jBLAS package and making my matrices DoubleMatrix types.
Here is my correlation function method. It requires jblas.
public static Double correlationFunction(DoubleMatrix x1, DoubleMatrix x2,
DoubleMatrix CorrLengths, DoubleMatrix tPowers){
DoubleMatrix diff = x1.sub(x2);
diff = MatrixFunctions.abs(diff);
DoubleMatrix clengths = DoubleMatrix.diag(CorrLengths);
DoubleMatrix lhs = diff.mmul(clengths);
tPowers = tPowers.sub(1.0);
DoubleMatrix rhs = MatrixFunctions.powi(diff,tPowers);
rhs = rhs.transpose();
DoubleMatrix minusLogAns = lhs.mmul(rhs);
Double mMLans = minusLogAns.get(0);
return Math.exp(-mMLans);
}
public static DoubleMatrix CorrMatrix(DoubleMatrix tData, DoubleMatrix CorrLengths,
DoubleMatrix tPowers){
int n = tData.rows;
DoubleMatrix ans = new DoubleMatrix(n,n);
Double tempCorrelation = null;
for (int i=0; i<n; i++) {
ans.put(i,i,1);
}
for (int j=1; j<n; j++) {
for (int k=0; k<j; k++) {
tempCorrelation = correlationFunction(tData.getRow(j),tData.getRow(k),
CorrLengths, tPowers);
ans.put(j,k,tempCorrelation);
ans.put(k,j,tempCorrelation);
}
}
return ans;
}
The first function is a correlation function to get the ijth point in the matrix, using some matrix algebra to calculate the exponential correlation function. The second fills the matrix. This is what I would call the "obvious" way. This code works.
I compared the time it takes to read in a 3000x5 data matrix and create the correlation matrix
R dist, 1.5s
Java 6.2s.
(the answers were the same). This time difference tells me that my Java code is horribly inefficient. This task will have to be repeated tens of thousands of times in a loop! Saving around 4s is therefore important.
My question is, is there a method like "dist" in Java that I don't know about? Failing that, am I making any bad time-wasting mistakes in my code? It works and it produces the right answers, but I may be doing something in a loop that there are standard functions/fast libraries for. A fruitless search in google hasn't helped. Running R in java looks bad because what time you save using dist is more than lost transfering the matrices from what I've read.
Cheers