Let’s supposing that you have log files of some sort pouring in and you want to put aggregate data representing the logs into an rdmbs. To begin, let’s start with a blank slate, i.e., just dumping the data in. And let’s have a simple table, that in mysql is created via
CREATE TABLE `history` (
`id` int(11) NOT NULL auto_increment,
`hits` int(11) NOT NULL,
PRIMARY KEY (`id`)
);
I did a pass each for both MyISAM and Innodb with a million inserts.
engine |
queries per insert |
seconds (lower is better) |
MyISAM |
10000 |
7.046952963 |
MyISAM |
1000 |
7.342753172 |
MyISAM |
100 |
8.521313906 |
MyISAM |
10 |
31.44731498 |
MyISAM |
1 |
135.3045712 |
MyISAM | load data infile |
4.927606106 |
Innodb |
10000 |
19.76374817 |
Innodb |
1000 |
30.58060002 |
Innodb |
100 |
89.54839206 |
Innodb |
10 |
723.135994 |
Innodb | load data infile |
17.25715899 |
A multi-value insert for three values looks like this
INSERT INTO today (hits) VALUES (?), (?), (?)
Then I execute with the three values.
The fact that inserts with 1000 values start to approach the load data infile numbers is a little compelling. But let’s suppose that we want to do every insert from a bulk load but we want to have a table (like history above) that has aggregate data, += style. Is it possible? Sure.
Here is one approach for mysql:
- Create a temp table, which I will call today
- Bulk load the data into today
- Run the query INSERT INTO history (SELECT * FROM today) ON DUPLICATE KEY UPDATE history.hits = history.hits + today.hits;
- Drop today
I would like to apply this strategy and contribute some pig code that allows for bulk insert. This would (I think) allow for some pretty large scale aggregating all from with a “simple” pig script. Would also like to start using chukwa, but it looks a little tough. I think the architecture would then look something like
web servers -> chukwa -> pig -> mysql
Think then I would be pretty well at yahoo! or facebook scale.
Guess we’ll see how it all goes 🙂
Enjoy!
Earl