Dear all,
I am busy implementing a small program and I'd like to work with threads. First of all, I'd like to specify that I am certainly not an expert user of C++ (I often use Perl -ouch- but in this case, Perl was far too slow for the program I wanted to use).
That being said, my program is divided into two main parts :
a) reading and filling a set of vectors (with functions)
b) running one function to compute a score (with vectors loaded in part a)) and then running them again on a vector of randomized vectors of Elem (bootstrap to estimate the significance of the score).
The vectors loaded in the first part are declared as :
vector <string> datasetNames; // vector of string
vector <Elem> rankedData; // vector of Elem. An elem is an object that contains one int and one double
vector <vector <bool> > datasets;// vector of vectors of booleans
vector <vector <Elem> > shuffledRankedVectors; // vector of numerous (at least 5000) vectors of Elem
In the begininning, I was using a function es (see below) that was computing the score once on the real values (vector <Elem> rankedData) and a given number of times on the randomized values (vector <vector <Elem> > shuffledRankedVectors).
double es (vector <Elem> &rankedData, int rankedSize, vector <bool> &dataset, vector <double> ess)
Even if it was far better than the Perl performances, the bootstrap part stayed far too slow, so I wanted to use threads. To this, unfortunately, I read that I could only pass ONE argument to the function es. I thus wrote a function run_es_pval (see below) that would take a struct that contains integers and pointers to the objects I needed to run my calculations.
struct input_struct {
int start;
int end;
vector <string> *datasetNames;
int rankedSize;
vector <Elem> *rankedData;
vector <vector <bool> > *datasets;
vector <vector <Elem> > *shuffledRankedVectors;
};
void* run_es_pval(void *inptr) {
input_struct in = *((input_struct*)(inptr));
int start = in.start;
int end = in.end;
vector <string> *datasetNames_ptr = in.datasetNames;
vector <string> datasetNames = *datasetNames_ptr;
int rankedSize = in.rankedSize;
vector <Elem> *rankedData_ptr = in.rankedData;
vector <Elem> rankedData = *rankedData_ptr;
vector <vector <bool> > *datasets_ptr = in.datasets;
vector <vector <bool> > datasets = *datasets_ptr;
vector <vector <Elem> > *shuffledRankedVectors_ptr = in.shuffledRankedVectors;
vector <vector <Elem> > shuffledRankedVectors = *shuffledRankedVectors_ptr;
for (int i = start; i < end; i++) {
string datasetName = datasetNames[i];
vector <double> esResults;
esResults.resize(rankedSize);
// real computations
double esi = es(rankedData, rankedSize, datasets[i], esResults);
// random controls (bootstrapping)
double pval = getpval (shuffledRankedVectors, rankedSize, datasets[i], esi);
// print pvalues
cout << datasetName << "\t" << esi << "\t"<< pval<< endl;
}
}
pthread_t* h1 = new pthread_t;
pthread_attr_t* atr = new pthread_attr_t;
pthread_attr_init(atr);
pthread_attr_setscope(atr,PTHREAD_SCOPE_SYSTEM);
pthread_create(h1,atr,run_es_pval,(void *) &thisthreadvalues);
However, I observe two main issues :
- when using directly the function run_es_pval, it takes twice more space in memory (which means that with more than 1 thread, the memory usage should increase again) ... but at least, it's working!
- when using the function run_es_pval through pthread_create, I obtain a segmentation fault on running.
So my question is thus double
- How not to increase the memory usage even if I use 1000 threads?
- Why do I obtain a segmentation fault?
At the moment, the only solution I see would be to store all my vectors as global variables but I don't find it very elegant!
I thank you a lot for the time you spent reading my issue.
Regards,
Sylvain