cd $HOME

DISTINCT optimization

One must not underestimate the importance of the type of the index used for the queries.

I am currently working on a project dealing with graph analysis. The database schema is pretty straightforward, the graph is represented by a table with two columns. Each of them is a node and a row is a (directed) edge between these two nodes.


mysql> SHOW CREATE TABLE edges\G
*************************** 1. row ***************************
       Table: edges
Create Table: CREATE TABLE `edges` (
  `u` int(11) NOT NULL,
  `v` int(11) NOT NULL,
  PRIMARY KEY  (`u`,`v`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.02 sec)

Recall that the choice of the engine is important since MyISAM supports both BTREE and RTREE indexes whereas InnoDB supports only BTREE indexes.

Let’s try to count the number of different nodes

mysql> SELECT count(distinct u) FROM edges;
+-------------------+
| count(distinct u) |
+-------------------+
|             27743 |           
+-------------------+
1 row in set (2 min 0.24 sec)
MySQL will actually sort the entire table to compute the DISTINCT count. The I/O cost depends on the number of buffer pages available but this computation will results in a couple of thousands I/Os… You can speed up your query by adding a (unclustered) BTREE index on u. MySQL will just have to read the pages in the index (which are already sorted) and won’t scan the entire file anymore:

mysql> CREATE INDEX u_idx USING BTREE ON edges (u);
Query OK, 10510515 rows affected (1 hour 19 min 5.82 sec)

mysql> SELECT count(distinct u) FROM edges;
+-------------------+
| count(distinct u) |
+-------------------+
|             27743 |
+-------------------+
1 row in set (52.68 sec)

Recall that you add overhead to your INSERT because MySQL will from now on have to update the BTREE.

Back

Comments



The content of this field is kept private and will not be shown publicly.