cloud

Data pipeline’s architecture – Open source Vs Managed services

Deciding what tech stack to work with can be an overwhelming decision due to the ever changing business demands, vast considerations and choices of platforms, the skills you can hire easily for and most importantly – the Cost.

 

As soon as cost is mentioned, often people think the answer is using open source – because it’s free, which is not an entirely true statement. Open source does not always mean your not investing anything – in many cases you might be investing the most costly asset- time of the engineers to build what open source platforms lack out the box, and managing team’s capacity is one of the most important aspect of software development.

 

There are a couple of dimensions to look at this problem from which all indirectly affect cost. let’s look at this from different perspectives to get a holistic view.

 

Building services

 

To build data pipelines from ingesting data, cleansing, curating and modelling – end of the day all that really requires just plain SQL to manipulate data. However to get to a point for us to write those SQL (like) queries we need to build a lot of underlying services for:

 

    • Read data from storage locations (Databases, APIs, Kafka, FTP etc.) in different formats (SQL DBs, JSON, XML, CSV etc.) and save them in some distributed filing system (HDFS, ADLS etc.).

 

    • For curation set configs, read data from different formats / locations in dataframes, perform cleansing and curating then saving back.

 

    • Write lots of util functions for activities which are commonly repeated and UDFs (User defined functions), which are quite a few for data pipelines.

 

    • Maintain these util libraries to support changes in underlying platform / libraries (e.g. Spark, configs, logging etc).

 

    • Build added services on common utlis for other common activities like pushing data to other platforms, querying data from various locations across the data pipeline or serving through APIs for real-time use cases.

 

All of these activities take considerable time and effort to develop and mature, which can mostly be done more effectively using managed services, who have teams specializing in just building libraries and services on top of open source technologies.


Keeping up with releases


After building these services it does not end there. The underlying technologies are in constant flux, often the decided tech stack also keeps on changing as the demand changes and the industry evolves. That means maintaining these services is often a big piece of work too.


Furthermore, flakiness in the operation of these services has a direct impact on the analytics solutions being designed. Teams who choose to build these services on open source foundations often find a good portion production issues attributed to these services not working as intended.


Reusing common services


The activity of building these services on top of vanilla open source libraries is often not given it’s due importance. Engineers often don’t consider this as a significant piece of work – and easily end up writing their own logic for something which has been solved already.


A theme I have seen across multiple organizations is how much rework teams end up doing. Instead of reusing what some other team has done, they all end up writing their own ‘reusable service’ which is being reused only by them! This is a massive overhead and lots of unnecessary rework teams end up doing.


Ultimately it’s SQL


To get the insights and analytics we need – the core activity is writing SQL logic across the pipeline, which considering all the points mentioned above usually turns out to be a very small portion of the total effort. Most of the team’s energy is spent in managing and maintaining these services – and the most value to the business is coming from that small portion of time we invest in writing those SQLs.


Departmental Objectives


As discussed, building these services is a time consuming activity. Another dimension to look through is – what are your departmental objectives? As an analytics team are you tasked with generating insights for the organization? Or are you meant for creating a platform for business users to generate insights themselves.. And yes there is a big difference.


Generating insights


This is where most teams start from – business users want to get insights into their operations and start seeking help to generate those from a newly formed analytics team. As they see value, the number of use cases grow and so does the team. Depending on the adoption of data driven decision making practices, organizations can stay in the mode of still heavily dependent on analytics team to generate the BI reports for them.


In some cases analytics teams wise up a little and give access to creating some custom reports for business users also – freeing up engineers time from just translating a small request into a basic SQL query to join a few tables / columns in a view.


If this is where your team is today, then perhaps your primary job is to build data models and insights for your team. The part of building common services ‘to support’ this work is a distraction and not necessarily you want to spend more time perfecting yourself.


Truly self service


As organizations mature and business users catch up on being tech savy enough to query data from a data warehouse and build reports for themselves – you’ve started to move to self-service. By then it’s highly likely you are bringing considerable amount of data from various parts of the organization, meaning your compute resources are shooting up and using managed services may be offsetting the benefit.


At this stage – it may make sense for the core data engineering team to start building those common services and make that their primary job – since by now all they are concerned with is pumping new data sources into the data warehouse on a solid foundation, while rest of the organization has enough skilled people to generate reports and basic insights from data sitting in the data warehouse or analytics teams running ML models from the data lake.


Tech stack considerations


Another common reason for choosing open source and building services on top by the department is to have flexibility & portability – which is hoped to make the platform future proof. At first it does make sense, however there are some nuances to this.


Vendor lock


Technology, platforms and maturity of design practices are all changing at an incredible pace. Within few years the whole picture can change and the tech stack which was once considered top of the line quickly becomes legacy and sluggish.


In this climate of change, relying heavily on any one cloud vendor does not sit well with organizations – and rightly so. Portability is on everyone’s mind – and many mature organizations get stuck having to work with one vendor, and may end up paying significantly high costs just to continue business as usual.


With open source, organizations hope they will maintain portability to allow switching platforms / cloud providers if the need arises. While it is true, e.g. if bulk of your pipeline is built with Spark – all platforms will support spark and make the move possible. However with data pipelines there are a lot more considerations then just support for e.g. Spark.


Across a typical data pipeline, there are a LOT of platforms you’d need integrations with. Plus a lot of peripheral services like scheduling, logging & monitoring, security / RBAC. Even if your main ETL pipelines are open source – there is a lot more going around which is not as portable and will have tight dependencies on the platform you are using. Many times I’ve seen teams struggle just to migrate from one version to the next – point being migration anywhere is going to be an overhead and painful process. Sure can get little easier if you use the right design principles, but still will pose a considerable challenge.


Future proof?


The question remains – is there a way to future proof our tech stack? The honest truth is its a balancing act, because no matter what tech stack you opt today – there is a very high chance it will need major upgrades few years down the line.


This is a sticky topic – but IMHO it is very hard to have a future proof tech stack. The aim should be to choose a tech stack which is scalable and has the capability to deliver features for the coming coming few years with very less maintenance.


Summary


Here are the summarized points to consider:


  • Building services on top of open source platforms / libraries can take considerable effort to build and mature

  • Maintaining these services is also an on-going activity

  • Teams sometimes struggle to reuse services and end up building their own logic for the same function again – especially true for larger or distributed teams

  • If the analytics department is primarily tasked with building insights, then trying to build these services will take way more time – taking you away rom the primary objective

  • Open source does not mean switching platforms / cloud providers is going to be easy – data pipelines require a lot of integrations and peripheral services which are not always easy to migrate

Given the points above – it may make more sense for analytics teams which are early in their journey or are in the growth trajectory to use managed services instead of going open source. The cost of using managed service might not be as high due to the volumes – and the focus at that point should be adoption of data driven decision making instead of cost optimization.


Once an organization has reached a level of maturity and the analytics team is now primarily moving towards supporting self-service, by then the volumes can outrun the benefit of using managed services – making it more cost effective to build services on open source platforms.

Storage & compute separation for DataOps

To deliver data analytics solutions at speed with quality requires the right infrastructure and platform, giving the agility to orchestrate infra to develop / test / deploy and scale in real time as per demand. An important component to do this is separating storage and compute. Without this, the infrastructure is very heavy & clunky to use – hindering orchestration of environments on demand.

 

What does separating storage & compute mean

 

Let’s try to understand this with a crude example. A laptop has both processing power and storage (your hard drive) on the same physical board. If we place important data on a remote storage system connected through a network (let’s say a portable drive ) – that would mean your compute and storage are now separate.

 

In the world of databases – we can have the data sitting in a separate persistent layer / storage like Azure’s ADLS or Amazon’s S3. We can read/write data to / from the data store easily and at speed. For any processing on the data – we would have to use separate ‘compute’ resources which can be from the same cloud provider, a different vendor or an on-prem setup – depending on your architecture.

 

This would give you flexibility to scale your compute resources and storage resources independently of one another. Without this, it can be costly and dime consuming to scale your analytics projects, and to develop / test them.

 

OLTP vs OLAP

 

It helps to understand traditionally why we had tightly coupled storage and compute infrastructure and was not seen as a challenge before. For this, let’s examine how the requirements on databases are changing.

 

OLTP

 

Typical software applications / OLTP (Online Transactional Processing) systems have unique requirements in terms of speed of processing along with size and type of data they process. Typically they have to process transactions within milliseconds performing CRUD operations for a smaller records at a time. Depending on the system, these numbers vary significantly – for example airline booking systems in some cases can process around 60,000 operations per second! – whereas a shopping cart website search query result can go up to a few seconds.

 

To achieve this speed – the processing and storage needed to be very close and easy to access. The speed on network and bandwidth has not always been this great as it is today, hence most traditional systems were designed to be on the same infrastructure – in other words – compute and storage were tightly coupled.

 

OLAP

 

For data warehouses / data analytics applications / OLAP (Online Analytical Processing) the requirements are very different. These systems don’t necessarily care much about the speed of writing data – since many times this will be batch processing or near-real time. However they do require high compute power to process the data / transform it / model it and present it.

 

Often new analytics projects start with just testing the waters. It’s not uncommon for few iterations on smaller data sets to find what information / measures / metrics will be more helpful. As the project evolves, the demand on data size & processing speed may increase exponentially which can quickly pose a problem with traditional servers. Meaning – scaling infrastructure for analytics is significantly harder.

 

How compute & storage separation helps

 

Scaling compute

 

Imagine having a server (on premise or in the cloud) procured for a project with x compute & storage capacity. When demand increases – it’s not going to be straight forward to add additional nodes (compute). On-premise will be significantly harder since you may have to procure new hardware. Even on cloud you’ll have to go through some steps to get that extra resources – which will have an associated additional cost regardless of if you use the resources or not.


If compute was separate – all that would need is to spin up more compute clusters as we need them and pay accordingly. Scaling becomes infinitely easy. Storage on the cloud is not a problem – it’s quite cheap and easy to get.


Sharing & usage of data


Data sitting in a data lake can have a lot of different users – who might be using it for varied processes. Some might be using if for real-time processing – like feeding data to an operational system, or near-real time processes being used for updating internal records for operational systems or processes preparing data models to be used for BI reporting / analytics purposes.


If access to this dataset is not readily available and the data lake cannot support the additional load on thee I/O’s and network, we might end up making copies of that data for different use cases which opens a whole new set of challenges and upkeep.


Having all that data on a cloud storage like an S3 bucket or ADLS makes access extremely easy. We would just need to change the configurations – rest of the hard work of routing and networking is handled by the cloud provider. If all your data resides on the cloud – access to that data for all services is infinitely easy & quick now.


Spinning up environments


For building DataOps practices – it’s vital we have an infrastructure that supports orchestration of environments on demand and scale down when not needed. Imagine having a data pipeline with 100s of tests running across different stages. Each stage of tests may require to spin up resources for compute and storage – and after execution free them up.


With tightly coupled systems – creating a new environment is a pain. It often means heavy infrastructure or licensing cost upfront. This can significantly hamper the development / improvement of your pipeline because of the limitations on environments and execution resources the teams would have.


Such environments tend to be one big shared resource pool – used by multiple teams and processes. These dynamics easily create problems of compute or even storage resources peaking and causing problems for teams / processes. You’ll often find teams trying to schedule jobs around one another and setting up limitations on when to run jobs to ensure they don’t interfere.


Finally – you may often need different environments with different security controls if you are working with personal data. For example that could mean you may need 2 staging environments – one with tokenized personal data and the other un-tokenized. With compute and storage separate -real easy to do. If tightly coupled – this would require a huge upfront cost.


Mobility


With on-prem servers, it can become a challenge to move your datacenter. Managing datacenters and maintaining them is a tricky and costly business. In some cases companies can’t move offices just because they cannot move their data center! Moving your data to the cloud saves you all that hassle and also in many cases is way more cost effective.


Pay as you go


Storage cost is extremely cheap today – especially through cloud providers. With on-prem servers / local storage the cost can be 5x higher and still will not be as effective (low latency / high bandwidth) compared to the cloud providers.


Migrating to separate compute


Activities needed to make the shift would primarily depend on what cloud provider you choose and what tech stack you are migrating to. Finalizing a cloud provider can be a time consuming project, but clearing your tech debt to support / scale up to the new desired platform will be an ever bigger task. Often teams would need to migrate their jobs / workflows to newer versions or different platforms to be ready for the migration.



End of the day – as a data analytics team our main business is not be to build /manage datacenters, instead we worry only about how to process raw data and serve it reliably. Cloud providers main job is to run datacenters and develop services to use them. So let’s leave what they do best to them and focus on what we should primarily focus on – deliver analytics to support Data Driven Decision Making !