Storage & compute separation for DataOps
To deliver data analytics solutions at speed with quality requires the right infrastructure and platform, giving the agility to orchestrate infra to develop / test / deploy and scale in real time as per demand. An important component to do this is separating storage and compute. Without this, the infrastructure is very heavy & clunky to use – hindering orchestration of environments on demand.
What does separating storage & compute mean
Let’s try to understand this with a crude example. A laptop has both processing power and storage (your hard drive) on the same physical board. If we place important data on a remote storage system connected through a network (let’s say a portable drive ) – that would mean your compute and storage are now separate.
In the world of databases – we can have the data sitting in a separate persistent layer / storage like Azure’s ADLS or Amazon’s S3. We can read/write data to / from the data store easily and at speed. For any processing on the data – we would have to use separate ‘compute’ resources which can be from the same cloud provider, a different vendor or an on-prem setup – depending on your architecture.
This would give you flexibility to scale your compute resources and storage resources independently of one another. Without this, it can be costly and dime consuming to scale your analytics projects, and to develop / test them.
OLTP vs OLAP
It helps to understand traditionally why we had tightly coupled storage and compute infrastructure and was not seen as a challenge before. For this, let’s examine how the requirements on databases are changing.
Typical software applications / OLTP (Online Transactional Processing) systems have unique requirements in terms of speed of processing along with size and type of data they process. Typically they have to process transactions within milliseconds performing CRUD operations for a smaller records at a time. Depending on the system, these numbers vary significantly – for example airline booking systems in some cases can process around 60,000 operations per second! – whereas a shopping cart website search query result can go up to a few seconds.
To achieve this speed – the processing and storage needed to be very close and easy to access. The speed on network and bandwidth has not always been this great as it is today, hence most traditional systems were designed to be on the same infrastructure – in other words – compute and storage were tightly coupled.
For data warehouses / data analytics applications / OLAP (Online Analytical Processing) the requirements are very different. These systems don’t necessarily care much about the speed of writing data – since many times this will be batch processing or near-real time. However they do require high compute power to process the data / transform it / model it and present it.
Often new analytics projects start with just testing the waters. It’s not uncommon for few iterations on smaller data sets to find what information / measures / metrics will be more helpful. As the project evolves, the demand on data size & processing speed may increase exponentially which can quickly pose a problem with traditional servers. Meaning – scaling infrastructure for analytics is significantly harder.
How compute & storage separation helps
Imagine having a server (on premise or in the cloud) procured for a project with x compute & storage capacity. When demand increases – it’s not going to be straight forward to add additional nodes (compute). On-premise will be significantly harder since you may have to procure new hardware. Even on cloud you’ll have to go through some steps to get that extra resources – which will have an associated additional cost regardless of if you use the resources or not.
If compute was separate – all that would need is to spin up more compute clusters as we need them and pay accordingly. Scaling becomes infinitely easy. Storage on the cloud is not a problem – it’s quite cheap and easy to get.
Sharing & usage of data
Data sitting in a data lake can have a lot of different users – who might be using it for varied processes. Some might be using if for real-time processing – like feeding data to an operational system, or near-real time processes being used for updating internal records for operational systems or processes preparing data models to be used for BI reporting / analytics purposes.
If access to this dataset is not readily available and the data lake cannot support the additional load on thee I/O’s and network, we might end up making copies of that data for different use cases which opens a whole new set of challenges and upkeep.
Having all that data on a cloud storage like an S3 bucket or ADLS makes access extremely easy. We would just need to change the configurations – rest of the hard work of routing and networking is handled by the cloud provider. If all your data resides on the cloud – access to that data for all services is infinitely easy & quick now.
Spinning up environments
For building DataOps practices – it’s vital we have an infrastructure that supports orchestration of environments on demand and scale down when not needed. Imagine having a data pipeline with 100s of tests running across different stages. Each stage of tests may require to spin up resources for compute and storage – and after execution free them up.
With tightly coupled systems – creating a new environment is a pain. It often means heavy infrastructure or licensing cost upfront. This can significantly hamper the development / improvement of your pipeline because of the limitations on environments and execution resources the teams would have.
Such environments tend to be one big shared resource pool – used by multiple teams and processes. These dynamics easily create problems of compute or even storage resources peaking and causing problems for teams / processes. You’ll often find teams trying to schedule jobs around one another and setting up limitations on when to run jobs to ensure they don’t interfere.
Finally – you may often need different environments with different security controls if you are working with personal data. For example that could mean you may need 2 staging environments – one with tokenized personal data and the other un-tokenized. With compute and storage separate -real easy to do. If tightly coupled – this would require a huge upfront cost.
With on-prem servers, it can become a challenge to move your datacenter. Managing datacenters and maintaining them is a tricky and costly business. In some cases companies can’t move offices just because they cannot move their data center! Moving your data to the cloud saves you all that hassle and also in many cases is way more cost effective.
Pay as you go
Storage cost is extremely cheap today – especially through cloud providers. With on-prem servers / local storage the cost can be 5x higher and still will not be as effective (low latency / high bandwidth) compared to the cloud providers.
Migrating to separate compute
Activities needed to make the shift would primarily depend on what cloud provider you choose and what tech stack you are migrating to. Finalizing a cloud provider can be a time consuming project, but clearing your tech debt to support / scale up to the new desired platform will be an ever bigger task. Often teams would need to migrate their jobs / workflows to newer versions or different platforms to be ready for the migration.
End of the day – as a data analytics team our main business is not be to build /manage datacenters, instead we worry only about how to process raw data and serve it reliably. Cloud providers main job is to run datacenters and develop services to use them. So let’s leave what they do best to them and focus on what we should primarily focus on – deliver analytics to support Data Driven Decision Making !