Hi all!
I've been working on some school/hobby project for some time now. It's a kind of a simple statistical tool for analysis of data from psychological experiments. Not being professional programmer I encountered a couple of problems concerning data storage. Let me explain the how the data is structured now and why I think it's wrong ;-)
- I have my main data organized in dictionaries (OrderedDict). In these dictionaries I store data for each experiment part (with keys like "Exp1", "Exp2").In each dictionary entry there's a list (of "subjects") of lists (subjects results). it's called main_data
- I also have extra data for each subject (like sex, age-band, type of experimental treatment and many more). these data is stored in another OrderedDict, with keys being variable name (like "sex"), and data being a list with data for all subjects).it's called extra_data
All data are sorted by subject id number, that way when I check, for example, the fifth entry from "Exp1" list, I can also check fifth entry from "sex" list from extra data dictionary and know that my fifth subject is a man/woman.
One of the most important features I need in my program is the ability to perform calulations only for a part of subjects (i.e. only men). Now I do it like this: filtering function input is a dictionary like this {"sex": [1,2], "age-band":[1,2,3,4]} and so on with all variables. All keys in this input dictionary are the same as keys in extra_data (lets call this dict input_dict). When filtering, I iterate over input_dict keys and values. For each subject I check if value for this key in extra_data is in values in input_dict. If it is (i.e. I have 1 in "sex" in input_dict and 1 in "sex" in extra_data for particular subject), I copy main_data for each experiment part (exp1, exp2 and so on) for this subject to a new dictionary.
The problem is that when my data sets get quite big (about 20 experiment parts, and about 20 extra_data variables, about 300 subjects) this approach is very slow, because it involves a lot of coping of data.
So my question is, how do you think should I organize and filter data to make it work faster? I'd be grateful for any ideas.
Sorry for such a long post.
Best regards
Yemu