Tags

, , , ,

MySQL Cluster does an automatic sharding/partitioning to the tables across data nodes, enabling databases to scale horizontally to serve read and write-intensive workloads, but what is the default sharding key used in partitioning the data?
According to the recent update (Oct, 2016) of the MySQL Cluster white paper, primary key is the default sharding key:

By default, sharding is based on hashing of the primary key, which generally leads to a more even distribution of data and queries across the cluster than alternative approaches such as range partitioning.

However, that is not the case in all MySQL Cluster versions so far!
In this post, I’ll do some test cases on MySQL Cluster (of 4 datanodes) to confirm the default sharding key.

Testing on MySQL Cluster 7.2.26

Creating a simple table of one column (as primary key), insert a single value in that table and check which partition(s) will be used to retrieve that value:
mysql> CREATE TABLE shard_check_pk (id int(11) primary key) engine=ndbcluster;
Query OK, 0 rows affected (0.38 sec)
mysql> INSERT INTO shard_check_pk values (1);
Query OK, 1 row affected (0.01 sec)
mysql> EXPLAIN PARTITIONS SELECT * FROM shard_check_pk WHERE id=1;
+----+-------------+-------------+------------+--------+---------------+---------+---------+-------+------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------------+--------+---------------+---------+---------+-------+------+-------+
| 1 | SIMPLE | shard_check | p3 | eq_ref | PRIMARY | PRIMARY | 4 | const | 1 | |
+----+-------------+-------------+------------+--------+---------------+---------+---------+-------+------+-------+
1 row in set (0.00 sec)

If the PK is the sharding key, then a PK value lookup will be checked in only one partition and – according to the previous explain output – that was the case already in this example, a single id value (id=1) does only require scanning one partition, ”p3”.

What is the case then if we didn’t specify a PK for a table in MySQL Cluster?

mysql> CREATE TABLE shard_check_no_pk (id int(11)) engine=ndbcluster;
Query OK, 0 rows affected (0.19 sec)
mysql> INSERT INTO shard_check_no_pk values (1);
Query OK, 1 row affected (0.00 sec)
mysql> EXPLAIN PARTITIONS SELECT * FROM shard_check_no_pk WHERE id=1;
+----+-------------+-------+-------------+------+---------------+------+---------+------+------+-----------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+------+---------------+------+---------+------+------+-----------------------------------+
| 1 | SIMPLE | shard_check_no_pk | p0,p1,p2,p3 | ALL | NULL | NULL | NULL | NULL | 2 | Using where with pushed condition |
+----+-------------+-------+-------------+------+---------------+------+---------+------+------+-----------------------------------+
1 row in set (0.00 sec)

As we can see in the previous explain plan, all partitions in the table (p0, p1, p2 & p3) will be scanned to retrieve a single id value (id=1). The reason for that is because MySQL Cluster creates a hidden column to be the sharding key on tables without PKs.

Up to here, the results are as expected but in the latter MySQL Cluster versions (7.3, 7.4 and 7.5), the explain plan has different output!

Testing on MySQL Cluster 7.5.4 (same results on 7.3 and 7.4):

mysql> CREATE TABLE shard_check_pk (id int(11) primary key) engine=ndbcluster;
Query OK, 0 rows affected (0.38 sec)
mysql> INSERT INTO shard_check_pk values (1);
Query OK, 1 row affected (0.01 sec)
mysql> EXPLAIN SELECT * FROM shard_check_pk WHERE id=1;
+----+-------------+-------------+-------------+--------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+-------------+--------+---------------+---------+---------+-------+------+----------+-------+
| 1 | SIMPLE | shard_check_pk | p0,p1,p2,p3 | eq_ref | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL |
+----+-------------+-------------+-------------+--------+---------------+---------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)

Starting from MySQL Cluster 7.3, the explain plan shows as the hidden column is always used as the default sharding key even on a table that has a PK (check my bug report about this case Bug #84374) which is not true according to the ndb_desc tool! Thanks Maurits for the correction.
If we specified the sharding key explicitly, the output will be corrected:

mysql> ALTER TABLE shard_check_pk PARTITION BY KEY (`id`);
Query OK, 1 row affected (1.10 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> EXPLAIN SELECT * FROM shard_check_pk WHERE id=1;
+----+-------------+-------------+------------+--------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+--------+---------------+---------+---------+-------+------+----------+-------+
| 1 | SIMPLE | shard_check_pk | p3 | eq_ref | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL |
+----+-------------+-------------+------------+--------+---------------+---------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)

Conclusion:

  1. Primary key is the default sharding key in MySQL Cluster 7.2 and a hidden column will be used if no PKs defined for a table.
  2. Until the bug got fixed and starting from MySQL Cluster 7.3, the hidden column is always the default sharding key even on a table that has a PK The PK is also the default sharding key but the explain plan shows as a hidden column is always used instead even on a table that has a PK!