As an organisation, it is almost impossible to avoid the daily influx of data, and more data means more opportunities for growth. However, this increase in data can result in a data management crisis, which is a challenge that businesses face when collecting large amounts of data from reviews, surveys, and customer feedback. Issues may also arise when fetching data from this database to address specific needs. Imagine a hospital with thousands of files stacked in a long pile. If data is not properly managed, it will be difficult to locate specific records, such as Joe Goldberg’s file, who was born on August 15th in Atlanta.
This here begs the question, how do we weaponize this influx of data for business growth while also managing it properly in a relational traditional database? The solution is not to stop collecting data; rather, it is to become more strategic about how it is managed. We require robust, organised databases that can withstand the heat. This allows businesses to harness the true power of their data and transform it from a jumbled mess into a resource.
In this article, we will look at two effective techniques for managing and handling data. We will also discuss their advantages and how they compare to one another.
In today’s world, it is commonly said that data is king, but how do we rule this kingdom without getting lost in the information jungle? This is where the concepts of data partitioning and sharding come in. They are introduced with some analogies, as follows.
Let’s say I’m Joe Goldberg, a bookworm, and I’m on the hunt for a book about criminal minds. Okay, let’s not get too creepy; I need a book on history. Rather than aimlessly strolling through the library, I can just go straight to the librarian, who will guide me to the “History” section – boom! A whole section just for me—all about history! That is data partitioning in action: dividing your data into sections for easy retrieval.
On another hand, consider a library divided into rooms, each dedicated to a different subject, such as history, science, or literature. If you want a history book, the librarian mentioned earlier will direct you to the “history room.” This configuration reduces congestion and makes it easier to locate what you need quickly. This is similar to data sharding, which involves partitioning data across multiple databases or servers, each handling a subset of the data. This configuration improves the speed and efficiency of data retrieval.
So, partitioning is technically sorting your bookshelf by genre, while sharding is having multiple bookshelves in different rooms, each with its own theme.
A logical table is a table in which your data is organized into columns and data types. These serve as the foundation of your database, allowing you to easily organize and use data in your queries. Columns and data types in a logical table help define the database and make it easy to pass into queries.
The code block shown below describes the process of creating and adding partitions to an SQL table.
In this example:
● We create a table named sales with columns that include id, product_name, and sale_date.
● We define a partitioning scheme that splits the table by ranges of years based on the YEAR function applied to the sale_date column.
● We then create two initial partitions, p_2023 and p_2024, to store data for those respective years.
Let’s analyse this code line by line.
This code establishes a new table named “sales” within the database. The table is defined with three columns:
● id: This column is an integer (INT) and acts as the primary key. Primary keys ensure each row has a unique identifier.
● product_name: This column stores product names as strings (VARCHAR) with a maximum length of 255 characters.
● sale_date: This column captures the date of each sale using the DATE data type.
The NOT NULL constraint specified after each column definition enforces that these columns cannot contain missing values. Finally, the ENGINE=InnoDB clause indicates that the InnoDB storage engine, a popular choice for MySQL tables, will be used for this table.
This code establishes a new table named “sales” within the database. The table is defined with three columns:
● id: This column is an integer (INT) and acts as the primary key. Primary keys ensure each row has a unique identifier.
● product_name: This column stores product names as strings (VARCHAR) with a maximum length of 255 characters.
● sale_date: This column captures the date of each sale using the DATE data type.
The NOT NULL constraint specified after each column definition enforces that these columns cannot contain missing values. Finally, the ENGINE=InnoDB clause indicates that the InnoDB storage engine, a popular choice for MySQL tables, will be used for this table.
These lines establish two initial partitions within the existing sales table structure. The ALTER TABLE sales command initiates the modification. Following this, ADD PARTITION creates each new partition. You can assign descriptive names to these partitions, as done here with p_2023 and p_2024. Critically, the VALUES LESS THAN clause defines the range of years each partition will hold. Rows where the year extracted from sale_date falls below 2024 will be stored in p_2023. On the other hand, rows with a year less than 2025 (but at least 2024) will be placed in p_2024. This approach effectively sorts the sales data based on the year it occurred.
A partitioning key is best described as a label attached to every database section. It’s the headline that gives meaning or definition to a section.
Think back to our library analogy—a library with all the books neatly organised. How do you identify which books belong to science, art, history, or that juicy romance section? A label!
The partitioning key works the same way for your data sections. It labels each one so that it knows exactly where to start looking when you issue a query.
Choosing the right partitioning key is crucial for efficient querying. Think of it as the primary search term in your database catalogue. You could define “region” as the key for geographically organised data, making it easier to filter data based on specific areas. Or maybe “timestamp” for data based on dates and times, and a customer ID to represent unique IDs peculiar to each customer in the database.
The partitioning key acts as a filter, directing the database engine to the exact partition where your desired data resides. By leveraging these labels, your queries become laser-focused, diving straight into the relevant data partition and delivering results in a flash.
Database sharding, as we discussed earlier, is a big foot in the door that involves a whole new shard (server) to store our data. This is a great technique when dealing with massive databases, while partitioning is peculiar for single-server databases.
Think of it like having several giant warehouses instead of just one. Each warehouse (server) is called a “shard,” and it holds a specific chunk of your data. This replication allows for fast access and scalability, especially for massive databases that would overwhelm a single server.
In the world of shards, the shard key acts like a GPS sensor that routes and directs every query to the particular column or database needed to be executed. The shard key, like the partitioning key, is used for massive database sets and is used to navigate and search through packs of data. It’s a special code/label that tells your queries exactly which shard to navigate to. Need user data for a specific region? The shard key acts like a zip code, directing your query to the right shard that holds that data.
Let’s briefly outline some major benefits of partitioning our database.
– Improved Query Performance: It is easy to issue queries through large chunks of data when partitioned. When scanning a partitioned section with a partitioning key of either date or customer ID, we can jump into the right section and locate the data we need. Imagine looking for a data piece from 2005. Our query can take an index of the partitioning key column, such as “year,” and scan it to find the specific piece. This beats going through the whole database and searching for that file.
– Better Data Management: Partitioning makes sorting and managing files much easier. Partitioning your data by year or customer ID simplifies data management because you can target specific sections and perform various operations on them. Also, data partitioning enables us to archive old files that are no longer in use into separate partitions and keep needed data readily accessible.
Some major benefits of database sharding are outlined below.
– Massive Database Handling: Sharding is very useful when handling large datasets, as it reduces the bulky nature of your database by dividing it into shards.
– Availability: In a situation where one server/shard goes down due to some technicalities, other shards will continue operating as intended; this allows the data across shards to be accessible while one is unavailable.
– Simultaneous Data Processing: Sharding allows various queries to run simultaneously on different shards to locate data. This essential feature is good for processing complex queries, as it allows us to get the job done in less time.
Disadvantages of Data Partitioning and Sharding
While partitioning and sharding are great data management techniques, there are some downsides to using them. Some of these are outlined below.
– Data Distribution Challenges: Issues may arise when selecting partitioning and shard keys to represent your data effectively. A poor decision can lead to uneven data partitioning.
– Cost Implications: Sharding can lead to higher costs. You might need additional hardware and infrastructure, especially if shards are spread across multiple servers or data centres. Additionally, the increased development and operational complexity can contribute to higher overall costs.
When choosing a method to opt for, we weigh in on the data size and various other factors. Partitioning is most useful when dealing with medium-sized databases and has an advantage over sharding in terms of query speed. This is because the search is run on only one database, allowing it to return search results quickly.
Sharding, on the other hand, comes in handy when dealing with massive databases, as you can split the data across multiple shards. Each shard holds a unique chunk of the data, making it super scalable for enormous databases. With sharding, you can easily grow your database by continuously adding more shards (horizontal scaling), with the added benefit that a malfunctioning shard does not affect the rest.
In this article, we have explored strategies for efficient data management, focusing on data partitioning and sharding. Partitioning breaks down large datasets into manageable chunks, while sharding distributes data across multiple servers. Both utilise partitioning keys and shard keys respectively, to efficiently locate specific data. While partitioning offers benefits in query performance and manageability, sharding excels at handling massive databases and concurrent processing.
NITDA IT Hub, Unilag is a platform created to drive Tech innovations by being the bridge between academia, industry, startups and investors interested in enhancing the digital economy of the nation.
Copyright © 2022 NitHub. All rights reserved.
Copyright © 2022 NitHub. All rights reserved.