Hello everyone.

I'm now working on a banner software which allows admins to see full statistics of banner displays, clicks and other information in selected time interval by minutes, hours, days etc.
I want to consult any experts who have experience using large tables with JOIN queries and optiomize my code as much as possible. Also everyone is welcome to suggest different approaches to complete such task.

I decided to store all data in mysql using MyISAM table engine.
I've attached table structure and sample data of 5000 rows in a file.

Table created with query

CREATE TABLE IF NOT EXISTS `sample_stats` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `bannerid` int(11) NOT NULL,
  `url` text NOT NULL,
  `dateline` int(11) NOT NULL,
  `ip` char(255) NOT NULL,
  `user_agent` char(255) NOT NULL,
  `platform` char(255) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `bannerid` (`bannerid`),
  KEY `dateline` (`dateline`)
) ENGINE=MyISAM  DEFAULT CHARSET=latin1 PACK_KEYS=1 ROW_FORMAT=FIXED ;

I have decided to add indexes to columns bannerid and dateline as they are heavily used in all queries getting statistics data.

I found that the best way to get data from the table for different time periods is self joining strategy. So I'm generating SQL in php with as much joins as it's necessary to get all required data.

For example, query to get data from 13 May 2009 to 16 May 2009 will be

SELECT 	COUNT(DISTINCT stat0.id),
		COUNT(DISTINCT stat1.id),
		COUNT(DISTINCT stat2.id)
FROM sample_stats AS stat
LEFT JOIN sample_stats AS stat0 ON (
	stat0.dateline>=1242172800
	AND
	stat0.dateline<1242259200)
LEFT JOIN sample_stats AS stat1 ON (
	stat1.dateline>=1242259200
	AND 
	stat1.dateline<1242345600)
LEFT JOIN sample_stats AS stat2 ON (
	stat2.dateline>=1242345600 
	AND 
	stat2.dateline<1242432000)
WHERE stat.id=(SELECT MIN(id) AS id FROM sample_stats)

This code on table with ~26 000 000 rows and 10 000 rows returned for each JOIN takes 50-60 minutes to complete.

EXPLAIN shows that all joins are using keys and range scan for dateline field.

Profiling shows that founding rows and all other operations getting done quite quick and the main time takes the stage "Sending data". So i tried to remove any selected data (COUNT(DISTINCT stat0.id) ...) from query and in that case query takes milliseconds even with tens of joins (maximum is 61 as I can see from my experience).

The sub-query

WHERE stat.id=(SELECT MIN(id) AS id FROM sample_stats)

added to limit the number of scaned rows in stats table (main table in from clause) to 1, otherwise it would scan all 26 000 000 rows in this table and join to all rows. That greatly improved performance of this query. As you can see I really do not need any data from table stats, it is only used as a fake table to join over tables to it and get rows for date intervals.

So this is all I am doing now and it works quite well for not such huge tables and small count of joins (intervals) and here my questions not so clear for me so far

1. How "Sending data" phrase of query can be improved. It seems that COUNT(DISTINCT ..) is too slow to get count of unique matched rows. I have read about filtered insert (using temporary table and insert into data using unique key and getting row count in that table) but I don't now enough about it to test and see if it's fasted than COUNT function on indexed column. Any ideas?

2. How else I can force engine not to scan main table without using subquery. Or may be any way to use fake (unreal) table instead if table with millions of rows?

3. At last is this really good to use multiple joins instead of separate simple queries for each time interval. So far it seems that with small count of rows (tens of thousands) joining is much quicker VS separate queries. But with growing number of rows joining gets slower and slower (but as I said founding rows for joins still is quite quick, but returning data is too slow). So may be any check for row count before deciding how much joins can be used?

4. I tried to benchmark my queries using timers in my PHP script but it seems that mysql heavily caches queries. The same query run once again takes only ~0.2 seconds. Tried to disable cache using

query_cache_type=0
query_cache_size=0

and changing my data (inserting and deleting rows) but the query is still to fast. As mentioned in MySQL documentation query cache is used only for exact the same query and only if table is not updated. But even changing query to something like

SELECT 1 AS a, COUNT...

did not take much that ~0.3 seconds to complete. So can any one suggest any optimal way to benchmark queries>

5. I'm adding keys in table creation query and inserting random data in it using PHP. phpMyAdmin shows that a lot of table data is in indexes(~25%) and from this I assume that index data is updated on every INSERT query and there is no need to do additional ANALYZE TABLE to populate index data. Is it right or something happens to index data during first query which is the main reason that over queries run much faster than first one?

6. Searched a lot connected with

PACK_KEYS=1 ROW_FORMAT=FIXED

but didn't realized is their usage performance gain or bottleneck?

That's all so far. Any suggestions are highly appreciated. Any questions connected with this topic are welcome.

Thanks.

Hello.
Sorry for double posting but I did a lot of changes in table scheme (thanks to Rick James at forums.mysql.com) and moved a lot of data to separate tables and normalized data. Here is my table definition now

CREATE TABLE `sample_stats` (
 `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
 `banner_id` int(11) unsigned NOT NULL,
 `location_id` smallint(3) NOT NULL,
 `url_id` int(11) unsigned NOT NULL,
 `page_id` int(11) unsigned NOT NULL,
 `dateline` int(11) unsigned NOT NULL,
 `ip` int(11) unsigned NOT NULL,
 `browser_id` smallint(3) unsigned NOT NULL,
 `platform_id` smallint(3) unsigned NOT NULL,
 PRIMARY KEY (`id`),
 KEY `bannerid` (`banner_id`),
 KEY `dateline` (`dateline`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 PACK_KEYS=1 ROW_FORMAT=FIXED

I've attached new sample data of 5000 rows in a file.

Thanks.

Hi.
Sorry for triple-posting this time :).

I just wanted to say that I've changed my main query to get data using sub-queries and not joins. It seems that mysql has problems with joining large tables.
I changed my query to something like

SELECT
   ( SELECT COUNT(*) FROM stat WHERE dateline ... ) AS count_1,
   ( SELECT COUNT(*) FROM stat WHERE dateline ... ) AS count_2,
   ...

and it solved my performance issue even for tables with much more data.

Anyway thanks to all who have thought on my problem and tried to solved it.

Bye.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.