Hi all!

I've been working on some school/hobby project for some time now. It's a kind of a simple statistical tool for analysis of data from psychological experiments. Not being professional programmer I encountered a couple of problems concerning data storage. Let me explain the how the data is structured now and why I think it's wrong ;-)

- I have my main data organized in dictionaries (OrderedDict). In these dictionaries I store data for each experiment part (with keys like "Exp1", "Exp2").In each dictionary entry there's a list (of "subjects") of lists (subjects results). it's called main_data

- I also have extra data for each subject (like sex, age-band, type of experimental treatment and many more). these data is stored in another OrderedDict, with keys being variable name (like "sex"), and data being a list with data for all subjects).it's called extra_data

All data are sorted by subject id number, that way when I check, for example, the fifth entry from "Exp1" list, I can also check fifth entry from "sex" list from extra data dictionary and know that my fifth subject is a man/woman.

One of the most important features I need in my program is the ability to perform calulations only for a part of subjects (i.e. only men). Now I do it like this: filtering function input is a dictionary like this {"sex": [1,2], "age-band":[1,2,3,4]} and so on with all variables. All keys in this input dictionary are the same as keys in extra_data (lets call this dict input_dict). When filtering, I iterate over input_dict keys and values. For each subject I check if value for this key in extra_data is in values in input_dict. If it is (i.e. I have 1 in "sex" in input_dict and 1 in "sex" in extra_data for particular subject), I copy main_data for each experiment part (exp1, exp2 and so on) for this subject to a new dictionary.

The problem is that when my data sets get quite big (about 20 experiment parts, and about 20 extra_data variables, about 300 subjects) this approach is very slow, because it involves a lot of coping of data.

So my question is, how do you think should I organize and filter data to make it work faster? I'd be grateful for any ideas.
Sorry for such a long post.

Best regards
Yemu

Recommended Answers

All 6 Replies

I think you should work with a Subject class, each Subject instance containing the results to all the experiments and the extra data, like this

from random import randint

class Subject(object):

    def __init__(self, sid):
        self.sid = sid
        self.results = dict()
        self.sex = 1
        self.age_band = 1

class Experiment(object):
    def __init__(self, eid):
        self.eid = eid

def main():
    # create a list of 20 experiments
    experiments = [Experiment(i) for i in range(20)]
    # create a list of 300 subjects
    subjects = [Subject(i) for i in range(300)]

    # create a random result list for each experiment and each subject
    for exp in experiments:
        for sub in subjects:
            sub.results[exp] = [randint(0,20) for i in range(5)]

    # create random extra data for each subject
    for sub in subjects:
        sub.sex = randint(1, 2)
        sub.age_band = randint(1, 9)

    # function to select a subset of a subjects sequence
    def select(subjects, **kwd):
        for s in subjects:
            b = True
            for key, value in kwd.iteritems():
                if not getattr(s, key) in value:
                    b = False
                    break
            if b:
                yield s

    # get a list of selected subjects
    selection = list(select(subjects, sex = [1], age_band = [1, 2, 3, 4]))
    print len(selection)

    # The results of the selected subjects are available through the instances
    # print the results of the selected subjects for the 3rd experiment:
    print [ sub.results[experiments[2]] for sub in selection ]

if __name__ == "__main__":
    main()

You can also add extra data to Experiment instances.

All I can propose, that you use a database.
The ordered dictionary is only usable, if you know the queries beforehand or you have hardly any data.

I have made, a diagram about, how I understand your data structure, but Dia crashed and I have lost it.
If you are still interested in solving this problem, let me know.

I would use an sqlite database, with the following tables:

  • experiment, list of experiments
  • subject, list of possible subject
  • subject_data, list of possible values of a subject, if it is not numerical
  • experiment_data, in which experience which subject has got which subject_data or a numerical value.

If this model is right, then a filter of men is roughly:

select something
from 
experiment_data as e
,subject as s
,subject_data as sd
where e.subject_id=s.id
and sd.subject_id=s.id
and e.subject_data_id=sd.id
and s.name="sex"
and sd.name="man"

A second for the use of SQLite, then you can select where sex=='Male', etc. If you want to keep your original structure, a dictionary of lists might work better, with each position in the list holding a specitic data value.
exp_dict[Exp#] = [sex, age_band, type_of treatment]

Otherwise, you would want a separate dicitonary wth "Male" and "Female" as keys, pointing to a list of experiment numbers, or whatever is the key of the main dictionary, so you don't have to iterate through all of the main dictionary's keys, to the sub-dictionary of gender.

thank you very much for your solutions!
I have to take a close look at them, and decide what would suit me best. And to do that I have to understand them well first ;-)
best regards
y

Python3 offers the named tuple, it sounds like something you may be interested in. Here is an example ...

# named tuple instances require no more memory than regular tuples
# tested with Python 3.1.1

import collections as co

EmpRec = co.namedtuple('EmpRec', 'name, department, salary')

bob = EmpRec('Bob Zimmer', 'finance', 77123)
tim = EmpRec('Tim Bauer', 'shipping', 34231)

fred_list = ['Fred Flint', 'purchasing', 42350]
# create an instance from a list
fred = EmpRec._make(fred_list)

# create and instance from an existing instance
john = fred._replace(name='John Ward', salary=49200)

# create a default instance for hourly manufacturing workers
default = EmpRec('addname', 'manufacturing', 26000)
mike = default._replace(name='Mike Holz')
gary = default._replace(name='Gary Wood')
carl = default._replace(name='Carl Boor')

# access by named index
print(bob.name, bob.salary)  # Bob Zimmer 77123
# or access by numeric index
print(tim[0], tim[2])  # Tim Bauer 34231

print('-'*40)

# access from a list of instances
emp_list = [bob, fred, tim, john, mike, gary, carl]
for emp in emp_list:
    print( "%-15s works in %s" % (emp.name, emp.department) )

print('-'*40)

# convert an instance to a dictionary via OrderedDict
print( dict(bob._asdict()) )
"""
{'department': 'finance', 'salary': 77123, 'name': 'Bob Zimmer'}
"""

# list the fieldnames of an instance
print(bob._fields)  # ('name', 'department', 'salary')

Note: Python 2.6 includes the named tuple already.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.