No, not SHARED lessons, I mean SHARD lessons. I have to admit that until about a year ago I didn't really know the term shards in relation to databases. Now don't confuse that with not understanding how databases can be horizontally scaled. I was introduced to that concept and helped to define the various ways it can be done but we just called it splits. Regardless of what you call it, there are some interesting challenges that are introduced. The well known challenges of consistency are discussed ad nauseam, even by me, so I'm not going there with this article. But besides that, there are some other lessons to learn when applying the pattern to your data.
Lesson 1: Right Size Your Shards
Sounds like a fast food commercial when I put it in those terms but the idea is actually the same. Determining the initial number of shards can be tricky. You don't have infinite resources and no matter how good your tools are, more shards are more problematic to manage than fewer. Yet you also want to select a number that going to last you for a while. I would recommend picking a number that will give you 18-24 months of growth with a margin for safety.
Simple enough, right? Well, not quite yet. I also wouldn't start with a single digit number of shards even if that will last you for two years. Why start with 4 when the incremental overhead of 12 is manageable. You allow a longer growth path and by hosting multiple databases on the same physical machine, you limit your hardware expenses. Of course then why not 24 or more? Well, at some point you add unnecessary overheads and get diminishing returns.
Lesson 2: Use Math on Shard Counts
If you notice the numbers I selected in the previous lesson, they are multiples of 12. Why that and not multiples of 10 or 7 or just whatever number happens to be your lucky number? Well, let's say you are going to run multiple shards per physical machine. You're a small organization. You pick 10 as your number of shards and spread it on two machines. Five each, things are perfect. Now you find your growth means you need more capacity. Would be nice to add one machine, but that leaves you with a 3, 3, 4 split and one machine has 33% more load than the other two. In fact, your next physical growth is 5, which may not be terrible if you are running your databases on commodity hardware, as you should!
But had you started with 12, you find your growth options to have fewer cost step functions. You start with 2 boxes, 6 shards each. Move to 3 with 4 shards, then 4 with 3 shards, and finally 6 with 2 shards. If you had projected 12 shards would allow you to scale for 2 years, you can see how you are able to grow your hardware requirements in a smoother fashion than with 10 shards.
Lesson 3: Carefully Consider the Spread
The way you spread the data across the shards needs to be determined based on what you are hoping to optimize. Certainly the simplest strategy is a uniform distribution which can be achieved using simple modulo math on the primary access key. This approach requires very little to implement and works well when the primary goal is to evenly distribute the load and you have very few access paths.
There may be times however where you want different locality policies. For example, you may want to keep all members of a group on the same shard. This can improve performance for applications that tend to operate upon a group. It also introduces challenges in balancing loads as some groups may be larger or more active than others. The benefits of clustering by a group must be weighed against the challenges of balancing work loads.
Lesson 4: Plan for Exceeding Your Shards
No matter what number you pick, the possibility of reaching capacity of your databases is very real. You selected a shard pattern to allow for a scale out strategy. So as you reach the limits of your current hardware you really don't want to shift to s a scale up on those shards. That's why it is important to bake into your shard strategy your scale out plan.
From Lesson 3, if you are clustering by group, adding more hosts to support new groups is probably straightforward. This helps when your scaling challenge is a growth in the number of groups and not just a few groups growing larger. Shards that cluster based on a group will probably need to support a scheme to migrate groups to allow capacity to be more readily managed.
Uniform spreads can be expanded through another trick of math. Each shard can be treated as a root node in a shard tree. When it is time to grow the capacity, another layer of shards can be added to grow capacity geometrically. For simplicity, let's say you originally started with 4 shards. You calculate the shard number with the following formula:
shard = key % 4
Now you need to scale. Let's say we will expand from 4 to 16 hosts. The technique for computing the root shard and final shard is:
root = key % 4
leaf = (key / 4) % 4
You now have 4 * 4 or 16 possible shard combinations using root.leaf as the shard identifier. Furthermore migration can be done online simply by following the scheme of checking root.leaf first and if you do not find your record, check root. Appropriate locking mechanisms will obviously be required by migration scripts to insure that no data is lost or corrupted. And this technique does require a temporary increase in the number of database instances to (old shard + new shard) hosts (in this example, 20 hosts total).
Lesson 5: Shard Early and Often
No matter how disciplined you think or hope your team is, if you give them a monolithic schema, and tight deadlines, they will eventually create a mission critical query that assumes two rows are on the same physical host. And, in my experience without fail, those two rows can't possibly live on any sane shard spread strategy. So if you have any hint at all that you will need to scale out a schema, do it early. The longer you wait, the more your application will depend upon a single schema instance and the harder it will be to migrate to a shard schema.
Care to share lessons you have learned from shard patterns? I am sure there are more that are worth knowing!