I have the following scenario:
There are several trips, each with several people, and each people has several data, like... idk., height, weight, hair color, and of course passport number, name...
So the point is to retrieve these from several sources, store them, then make statistics.
I'm only interested in speed, the disk usage is irrelevant. I use Java DB (Apache Derby), or I may use anything cross platform through JDBC.
I can think of 2 designs:
- Normal: I create an autoinc surrogate key for the people, and use joins between trips and people tables.
- Simpler: I store the joined data in only one table, for each person the trip datas will be replicated. Key is tripDate/tripName/person.seat.
And here is the problem:
For sequential querying the Normal form is faster, because there is no replicated data about trips, so optimally there are less clusters accessed. Also this way there are 2 lookups, but I think this is irrelevant. But the problem occurs when some data source is slower than others, or they inserts data in random order (like paper based ones), and the data becomes fragmented: I can sort the trips table by date, but the autoinc field in the person table keeps it fragmented. On the other hand, if I create only 1 table, there are large amounts of data replicated, but everything can be sorted by date and searched sequentially.
So which way is the better after all? Or any other idea?
Thanks for helping!