Member Avatar for sjvr767

Hi,

Firstly, I'm not a programmer nor a web-developer, but rather a statistician. I am doing equity-related (shares / stocks) research and I need to place my data in a database (MySQL), since it is too big to work with in a spreadsheet.

I have weekly closing price, volume, market cap, PE etc data for about 900 companies over a 16 year period. Obvioulsy, I can't throw it all into one table, so I was considering a table per company with year and week defining the primary keys. But a join accross 900 table in order to find the 20 companies with the lowest market cap in the first week of 1990 would be very cumbersome.

So I'd like to know from anyone here how they would go about designing this database.

First I suggest you sit down with a good book on relational database design ('Database Design For Mere Mortals' by Michael Hernandez and 'MySQL' by Paul DuBois come to mind). We can help you help yourself, but can't do your work for you.

Member Avatar for sjvr767

Oh, I fully understand that you can't do the work for me and I'll have a look to see if I can't get hold of those books.

Was thinking (read "drawing diagrams") that I should maybe do a table per year. It will still result in a large number of records (less than 47 000), but fewer records than if I dumped all into one table. Making joins across 16 tables (one per year) should be easier than doing it across 900!

My big concern is the type of performance I'll get if I have a table with so 47 000 records... Is that big? Is too big?

Anyway, I'll experiment and report back on my results (might take a few days). Any further pointers are more than welcome in the mean time.

47000 records is trivial - MySQL can handle billions of records, and many companies use it for large databases and heavy traffic.

I'm not sure what kinds of queries you will be making, but one point to consider in any relational database design, is that it is driven by those anticipated queries. Maybe post a few pseudo-queries that you expect to run into, and that might get some suggestions.

You mentioned some of your data already - 'closing price, volume, market cap, PE '. And 900 companies is also trivial. I would look at holding your various points of interests in one table, along with a company ID. Then you would need a bridge table to tie that with company data in a separate table.

Or you could put all your various points of interest in separate tables (how many of these are you tracking?). Again, bridge tables would be needed to tie these to the various companies.

It depends on how you are going to be using the data.

Indexing can greatly speed up queries too.

hi,

storing some data (physically) in 900 different tables seems to be a rather bad idea, because you would have union them firstly to draw any useful information from them. Aside from the fact that handling 900 different tables when adding actual data every week isn't that easy-going. I would like to suggest you that give a meaningful example with concrete data, something like this:

week, clPrice, volume, marketC, PE(?), Company, ...
1, 100, 1000, 30%, ??, INTG
2, ....
...
10, ....

You should add further missing but important attributes to that big table. Don't worry about decomposition into smaller tables; this 2nd step is quite simple if one has broadly understand your data. To speed up processing your data you can add some indexes.
btw, 900x52x16 = 748800 records isn't that much data for any RDBMS, even MS Access will cope with this task.

brs, cliff

Member Avatar for sjvr767

Hi,

Thanks trudge for all your input. I couldn't get my hands on those specific books, but stumbled across some tutorials on the net that got me started in the right direction.

While my database isn't exactly 2NF it is pretty close, with slight amount of redundancy in order to facilitate the type of queries I'll do. Taking your advise of keeping the type of queries I'd make to the database really helped in the design process.

So, the end result (after much python scripting in order to populate the tables) is quite good. I have a table containing the share code (primary key), full name and status (delisted or not) for all the companies in my study. Then I have a table for each year which contains the share code, week, close, volume, cap, pe, dy and year for each company. Share code and week form the primary key. The year column, though somewhat redundant, is there to ease joins across multiple year tables.

Tesuji will notice that the above design looks quite similar to his suggestion. I appreciate his input too, but separating the company name out of the year tables seemed like the logical thing to do. I have also gotten my hands on some fundamental data, which is yearly, and separating the company details out of the yearly tables helps when adding in that kind of data.

Thanks again for all the help!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.