Fast Distance Matrices in Java

Question

DannyW 0 Newbie Poster

12 Years Ago

Hi all,

This is a Java question, despite the set up looking long!

I am writing a program that involves constructing and using the inverse of a large correlation matrix many times. Suppose you have a data matrix X with d columns representing d variables and n rows representing the different data points. A correlation matrix (R)_ij for the data is a matrix with R_ij = corr(X_i,X_j), where X_i means 'row i'. The type of correlation functions I use are purely based on the distance between the columns. E.g.

R_ij = prod_{k=1}^d exp{-c_k|X_ik - X_jk|^{p_k}}

where the powers p_k and constants c_k are known and input by the user. (These will change each time we construct the matrix).

OK that's the goal. I am an R programmer and the construction can be done very very quickly in R using the function dist(). What dist() does is give a matrix of distances between the X's without using loops. The without loops part is extremely important because loops are very very slow in R. So, in R you would first pre-scale the columns by c_k^(1/p_k). Then the code would look like

D <- dist(rep(1:n))
for(i in 1:d){ D = D+dist(X[,i])^p[k]}
exp{-D}

Note the product went into the exponent as a sum. OK that's super quick. About 1.5 seconds for a 3000x5 matrix (generating a 3000x3000 matrix).

The problem is that in the code I am writing, I have to do this (along with many other operations) thousands and thousands of times. Of course we can do that in R, but it takes too long. Loops are just bad in R. For example, making the matrix using a double loop the "obvious" way, takes 22 minutes for the same size matrix as dist does in 1.5s! So, impressed by the relative speed of java in loops, I decided to code there. I require fast cholesky decompositions, so have been using the jBLAS package and making my matrices DoubleMatrix types.

Here is my correlation function method. It requires jblas.

public static Double correlationFunction(DoubleMatrix x1, DoubleMatrix x2, 
                                             DoubleMatrix CorrLengths, DoubleMatrix tPowers){
        DoubleMatrix diff = x1.sub(x2);
        diff = MatrixFunctions.abs(diff);
        DoubleMatrix clengths = DoubleMatrix.diag(CorrLengths);
        DoubleMatrix lhs = diff.mmul(clengths);
        tPowers = tPowers.sub(1.0);
        DoubleMatrix rhs = MatrixFunctions.powi(diff,tPowers);
        rhs = rhs.transpose();
        DoubleMatrix minusLogAns = lhs.mmul(rhs);
        Double mMLans = minusLogAns.get(0);
        return Math.exp(-mMLans);
    }

    public static DoubleMatrix CorrMatrix(DoubleMatrix tData, DoubleMatrix CorrLengths, 
                                          DoubleMatrix tPowers){
        int n = tData.rows;
        DoubleMatrix ans = new DoubleMatrix(n,n);
        Double tempCorrelation = null;
        for (int i=0; i<n; i++) {
            ans.put(i,i,1);
        }
        for (int j=1; j<n; j++) {
            for (int k=0; k<j; k++) {
                tempCorrelation = correlationFunction(tData.getRow(j),tData.getRow(k),
                                                      CorrLengths, tPowers);
                ans.put(j,k,tempCorrelation);
                ans.put(k,j,tempCorrelation);
            }
        }
        return ans;
    }

The first function is a correlation function to get the ijth point in the matrix, using some matrix algebra to calculate the exponential correlation function. The second fills the matrix. This is what I would call the "obvious" way. This code works.

I compared the time it takes to read in a 3000x5 data matrix and create the correlation matrix

R dist, 1.5s
Java 6.2s.

(the answers were the same). This time difference tells me that my Java code is horribly inefficient. This task will have to be repeated tens of thousands of times in a loop! Saving around 4s is therefore important.

My question is, is there a method like "dist" in Java that I don't know about? Failing that, am I making any bad time-wasting mistakes in my code? It works and it produces the right answers, but I may be doing something in a loop that there are standard functions/fast libraries for. A fruitless search in google hasn't helped. Running R in java looks bad because what time you save using dist is more than lost transfering the matrices from what I've read.

Cheers

java

2 Contributors
4 Replies
725 Views
1 Day Discussion Span
Latest Post 12 Years Ago Latest Post by DannyW

~s.o.s~ 2,560 Failure as a human

12 Years Ago

AFAIK, writing number crunching algorithms in pure Java never was a viable solution. Scientific Java applications typically use third party libraries tuned towards doing fast math operations. Though I haven't personally worked with it, take a look at JBlas. R performs better because I think it is a language specifically tuned for statistical computing.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

DannyW 0 Newbie Poster · Answer 1 · 2012-07-20T11:45:16+00:00

Thanks for the reply. If you notice I already am using jBlas, though there is nothing in there that does this. R is fast at matrix operations because they are coded in C. However, since R is not compiled and has poor memory handling, is not good for coding anything that needs to go in a loop with many iterations. Oh well.

~s.o.s~ 2,560 Failure as a human Team Colleague Featured Poster · Answer 2 · 2012-07-20T12:08:20+00:00

Whoops, my bad, totally missed that part. I read a post somewhere claiming that Java with jBlas takes a the same amount of time as R for the given test case; I guess it was for a specific use case. I would totally recommend hitting the jBlas mailing list since they might be able to suggest something out of their own experiences. If you already have and the reply was not satisfactory, maybe try to max out the performance by branching out computations to different threads and then merging them (if it's possible).

DannyW 0 Newbie Poster · Answer 3 · 2012-07-20T12:54:14+00:00

I think I read the same test case. That was for matrix multiplication. Indeed, this is the reason I'm using jBLAS and I can confirm through my own tests that the performance is the same as R. The cholesky decomposition, which is the killer O(n^3) flops operation is as fast as R and C with jBLAS and that's why I'm using it. I guess the problem here is that they don't have methods for distances and my "by hand" method is slow. Good tip on the jBlas mailing list. Cheers