Hi,
I am building a database that will be populated with financial data. Taking into account what the data looks like, how the data will be organised, what I am trying to do with this data, my hardware, budget and competency, I am looking for the most efficient design for super fast querying of this data (not concerned with writes). I am hoping to get some guidance as to how I should progress.
What the Data Looks Like
I have effectively collected monthly stock data over a 25 year period. So for every stock in the world, I have collected its associated financial data (approximately 50 mostly numeric fields) at a monthly frequency over a 25 year period.
e.g.
Microsoft, May 2000 -> associated financial data (50 fields)
Cisco, May 2000 -> associated financial data (50 fields)
Microsoft, June 2000 -> associated financial data (50 fields)
Cisco, June 2000 -> associated financial data (50 fields)
Microsoft, July 2000 -> associated financial data (50 fields)
How the Data will be organised
The data will be organised into one large table. Initially there will be approximately 3 million rows and 50 columns. The database will grow by approximately 15,000 rows per month.
What I am trying to do with this data
The data will used in a few different ways:
1. Time Series analysis where I will be retrieving some or all the financial data for a stock over the entire 25 year period e.g. I will query for all Microsoft's financial data over a 25 year period at monthly intervals. I may perform this sort of query quite frequently especially when I request these data arrays for a group of stocks e.g. I will query for some or all the financial data over the 25 year period at monthly intervals for Stocks A, B, C, D and E.
2. Aggregation of columns where I will be aggregating financial data for a group of stocks over its entire 25 years history. For example, I may want to aggregate the financial data for all stocks that meet a particular criteria (such as being located in a specific region, size above a certain amount, similar types of businesses, highly profitable business etc) at monthly intervals over a 25 years period. These queries are likely to be quite dynamic so I will never know how they will be aggregated in advance.
3. Same as 1 & 2 but at a point in time only. e.g. give me all the financial data for Stock A as of May 2001 or lets aggregate all the financial data of all Japanese companies in June 2011.
I would like to be able to perform these queries very, very fast e.g. in milliseconds.
Hardware
Dual Core Xeon Server 3.5gHz 3GB RAM.
Budget
Free open source technology.
Competency
I am a good programmer (Java, C, C++, .NET and PHP), but know little about database design beyond basic MySQL/SQL Server. My experience in database is limited to SQL type databases but I am willing to try something new if it will serve my purpose.
I have tried to do this before using one large table and MySQL. I have used both MYISAM and Infobright storage engines. Infobright is good but when you're pulling back time series data over 50 columns it will provide little benefit over MYISAM. When you're aggregating over 50 columns it's significantly faster, but still takes a few seconds. I have been reading about in-memory solutions but have no idea how to implement these (also, I have no idea how much memory I would need to place a table that is 3 million rows long and 50 columns wide in-memory). I have also been reading about Map/Reduce solutions but am not sure I will be able to get the blazing fast aggregate query results (especially since I only have one machine and not a cluster of nodes). Does anyone have any advice?
Rameez