Sharding Techniques

Introduction

Sharding can be summarized as a technique in which a database table can be split into multiple database servers to optimize read/write performance.

Benefits include:

  • Optimized query time
    • Instead of having one huge database table, you have multiple smaller tables in more than one machine. Smaller tables = faster lookup performance.
  • Reduce load on a single machine
    • Too much load on a single machine can cause the machine to crash or have bad performance. Splitting up the load into multiple machines helps prevent this.

Challenges include:

  • Management of sharding, especially in RDBMS where ACID compliance is desired
  • Can be complicated to implement

In this article, I cover three sharding techniques to improve the read/write performance of databases.

Horizontal Partitioning

Partitioning is simple and straightforward to implement, but not so great for long term use.

The very essence of partitioning is to just call a partitioning function that takes the modulo of the hashed key of the database row (i.e. primary ID) by the number of machines \(n\) available for sharding.

\(hash(key)\;\%\;N\;=\;D\), where \(D\) is the machine to store the key-value of the database row.

Pros:

  • Simple to implement

Cons:

  • Things can fall apart when the number of machines \(N\) increases or decreases. The partitioning function will need to be updated. Data from old machines need to be remapped somehow to account for newer machines.
  • Since we're only using modulo, it doesn't take into account which machine is impacted the most. This means the most impacted machine (hotspot) could be unlucky and get more impacted, which is not ideal.

Dynamic Sharding

Dynaming Sharding is implemented by having an external service that keeps track of which machine is responsible for storing a range of database keys.

Example:

{
    0-100: MachineA
    101-200: MachineB
    201-300: MachineC
    301-400: MachineD
    ...
}

This approach can be useful because there is more control over which machines are assigned to which database keys. The range of the keys can be updated at will, and if a machine goes down, they can be removed in the entries above.

Pros:

  • No need to change partitioning function when \(N\), the number of machines, increases or decreases. Everything is configured here, with more control.

Cons:

  • Separate service - separate host. If the host of this service dies, what would you do? Single point of failure. The host can be replicated to backup hosts to combat this.
    • If you're on the backup host, since the primary host is down, it would be good to disable writes and only allow read access.

Shoutouts:

  • This is similar to how MapReduce is implemented, where there is a controller entity that manages which machines are in charge of processing certain chunks of a large file.

Consistent Hashing

Consistent Hashing ensures that when a new machine is added, some of the rows from another machine is transferred off to the new machine. Conversely, if a machine is removed, the rows in the removed machine will be added to another machine. Traditional consistent hashing uses a ring based model and a hashing function (similar to partitioning) to determine where in the ring the new machine will be placed. A split of the database rows are then transferred from the neighbor.

Pros:

  • A much more graceful way to handle adding/removing of new hosts
  • Each time a machine is added or removed, remapping with consistent hashing involves much fewer machines, where as remapping in the partitioning method requires going through ALL of the machines.
  • Commonly used

Cons:

  • Load distribution can still be uneven

Shoutouts:

  • There are different variations out there, like Google Jump Hash, depending on your data storage needs.