Having big tables is one of the expected database problems, especially, for the fast growing database systems. In fact, big tables itself is not a problem but with big tables, the following problems are strongly expected:
- Retrieving data from big tables is so slow.
- It is a very hard job to maintain those tables like adding/removing an index, adding/dropping/modifying a column, … etc.
- System resources, especially, the IO system might not be able to handle such huge traffic of writes and reads.
- When it comes to the reporting queries, it might be a horrible nightmare!
- Always cause disk space problem.
All the above problems will show up the need for scaling! So, let’s check out what are the possible solutions for that problem:
- MySQL Partitioning: Is a good solution but we will still face disk space and server resources problems.
- MySQL Replication: Actually, MySQL replication is good for read scaling not write scaling which means only retrieving data and reporting queries problems can be solved but not the other ones!
- Galera Cluster for MySQL: Mainly for High Availability and also read scaling but not for writes as data will be available (replicate) on all nodes!!
- MySQL Sharding: In most cases, this is the most suitable solution for such problems and here I’ll talk about a promising storage engine called Spider for MySQL as a sharding solution.
Note: MySQL Cluster is one of the MySQL sharding solutions and can be used to solve such problems with specific circumstances and conditions.
What is sharding?
Sharding is splitting up (or horizontally partition) big tables among several isolated DB instances, each partition or DB instance called a database shard!
The problem of sharding (not all sharding solutions) is that a logic or a decision point is required between the application and the shards to determine which shard(s) should be selected for each query!
Now, what is Spider for MySQL?
Spider is a Storage Engine (SE) depends on XA transactions and partitioning which allows MySQL server to treating tables on different MySQL instances as they are only one table on a same instance!
Still not clear?! Let’s check the following figure:
With Spider for MySQL, we can divide a big table among multiple database instances (having the normal MySQL binaries) and from another MySQL instance (having Spider plugin) we will have – let’s say – a virtual table pointing to the tables in the other instances. The application will only need to connect to the Spider instance and will be able to get the required data from the other ones – through Spider – without the need to connect to them directly or even care about the logic for choosing which server(s) or shard(s) for each query.
Sounds interesting, right?! let’s check the following example for deeper dive:
If we have a big table called “logs” containing data for years 2012, 2013, 2014 and 2015 then we can divide that table among four MySQL instances by storing the data for years 2012, 2013, 2014 and 2015 on servers A, B, C and D respectively. Then we will create a mapping table on the Spider instance pointing to the other ones using MySQL partitioning specifying for each partition the connection information for the server in which will be used when hitting that condition.
If for ex. a query on the spider wanted to get the data for the year 2013 (hitting partition2 condition) then the host information for server B will be selected and the data will be retrieved from server B.
By the way, this is not only for selects but also for DMLs (insert, update and delete statements) the same rule will be used.
Advantages and disadvantages of Spider SE
- Divide the load among many database servers (scaling reads and writes) and processing in parallel.
- Independent from the application (storage layer).
- Easy to extend. When introducing a new host, a new partition needs to be added to the spider table.
- Creating non Spider SE tables is also possible on the Spider bundled server.
- If a shard failed, the application won’t be affected except for those queries which targeting that failed shard.
- Spider SE was introduced in MariaDB starting from version 10.0.4.
- A slave for the Spider server can be created to have all data in one server (XA transaction must be enabled).
- Spider supports SSL.
- Query cache and FULL TEXT indexes are not supported in Spider tables. However, FULL TEXT indexes can be added on the individual backend tables.
- Physical backups (like Xtrabackup) won’t backup the data on the spider tables as the physical data is stored on the shards (shards can be backed up individually).
- Spider server is a single point of failure (we might need to make it High Available by duplicating it and using VIP on the top of them).
- Depends on MySQL Partitioning, which means choosing wrong partition key or condition might kill the performance.
- Also as a result of using MySQL Partitioning, foreign keys are not supported.
Spider SE is not ready for production yet but at least its a promising Engine.
Spider SE for MySQL is a good solution for MySQL sharding, if you face big tables problem and you are evaluating MySQL sharding. Just give it a try!
In the next post, I’ll go through a simple example on how to implement Spider for MySQL.
If you want to have a slave you probably want to have it configured in the same way as the spider master, the slave also using spider to remotely access each slave shard. A colleague tried that some time ago and found issues. I need to try and reproduce that if I get some free time but making spider replication safe or confirming it is would be most interesting. It’s a project in an early stage of development but one with a lot of interesting possibilities for those people whose system grows to a point that scaling is needed, but sharding at the application level might be rather expensive. This is available in MariaDB and it would also be nice to see it in the community MySQL release too.
I think it depends on the purpose of having a slave for spider server.
If the slave is required to have all data collected in one server for the backup purpose then Spider SE should not be used and the actual SE used in the shards should be used instead. In other words, tables should be created on the slave using the same structure like it is on the spider but having InnoDB (for ex.) as SE not Spider.
While if a slave is required as a redundant node for the Spider master to provide HA in case of the Spider master failed then it should use the same structure like the spider master having spider SE used for those tables.
BTW, I’ve tested the first case already and indeed, all data was physically replicated (from the spider server’s binlogs) to the slave while no actual data is there on the spider master but the shards!
And yes, I like the spider idea as it is less expensive from the application point of view and having it ported to MariaDB is an advantage over MySQL.
Pingback: Spider for MySQL – Implementation | MySQL Step-by-Step Blog