Data Plumbing Essentials: Production Pipelines

Data Plumbing Essentials: Keeping Your Production Pipelines Leak-Free

There has been a lot of focus lately on data quality, governance policies and process automation to improve decision making by organisations. Data lives everywhere, in all sorts of shapes and formats. Sitting idle, data is useless. You need to be making use of it to generate value for the business. Which usually means extracting, transforming and loading data one way or the other and making sure it is fit for purpose to provide insight and drive decisions.

Today, we focus on a key process that makes it possible – data pipelines. This blog explores what constitutes a data pipeline and the complexities the data engineers encounter to make it robust. Simple exploratory data pipelines are straightforward to quickly put together for data exploration purposes. But it becomes very complex when integrating diverse data streams into digestible sets that analysts and systems can reliably consume on a recurring basis. This is what production data pipelines must do.

The creation of production data pipelines isn’t dissimilar to constructing a sophisticated railway system for data. It is an exercise in precision engineering, requiring meticulous planning, robust construction, and continuous maintenance. The goal is to ensure that data from various sources arrives at its destination accurately, on time, and ready for consumption.

What is the Use Case?

The journey begins with an in-depth consultation phase with the stakeholders. Data engineers must understand the business objectives, the data’s origin, its destination, and the insights that the analytics team aims to glean. This phase is critical, as it informs the choice of technology, the design of the pipeline, and the strategies for data integration, quality checks, and compliance with data governance standards.

It is crucial to validate the understanding of use case. I’ve seen many a project go haywire when the delivery team built something very different from what the business was expecting.

Designing the Data Pipeline

After the goals are set and validated, data engineers draft the architecture of the data pipeline. This design specifies how the data will be extracted, where it will be staged, and the transformations it will undergo. The process is often iterative, with frequent reviews to ensure alignment with business objectives and technical feasibility. This phase tends to take a long time to set in stone because of the constant back and forth between the business and the engineering teams.

Selection of Technologies

Selecting the right tools and technologies is paramount. Data engineers must consider the volume of data, the frequency of updates, and the complexity of transformations. Solutions like IOblend may be deployed for handling real-time data streams and complex integration use cases, whereas batch processing might rely on the MDS or Hadoop ecosystems. For data warehousing, technologies like Amazon Redshift, Google BigQuery, or Snowflake are popular choices.

The key here is to create an architecture and select associated technologies that fit the purpose best and not to hacksaw a pipeline design into something that was never meant for the job at hand. The biggest mistake many businesses make is limit themselves to a particular stack and then spend a lot of time and resources on workarounds. Always run a full life cycle cost analysis on any tech you bring in or use already.

Now, let’s see what the steps are for developing a production grade data pipeline.

Data Extraction (E)

The first concrete step in the pipeline is extracting data from source systems. This can range from simple databases to complex, distributed systems. Data engineers must navigate various formats and protocols, employing ETL tools to ingest the data into a staging area (physical or in-memory).

Typically, extraction includes data from APIs, databases, systems, ESBs, IoT sensors, flat files, blobs, etc. It can be structured or unstructured. Batch or real-time. If the data exists in a digital form, it can theoretically be ingested.

Regardless of the data extraction methodology, the pipeline must be able to connect to the source and ingest the data reliably and at a specified frequency.

Transformation and Enrichment (T)

Data rarely comes in a right shape ready for analysis. It will require cleaning, normalising, or enriching to become useful for the business – validating and transforming it.

One popular way is based on ingesting raw data into a cloud lake/warehouse and then working on it there. The Snowflake ecosystem is a notable example, where the data will get ingested via an ELT process into the warehouse and gets processed there by various tools.

Alternatively, the data can be “staged” virtually, in memory, for the use cases that consume real-time data or require pre-processing. Real-time data requires CDC, for instance, so the process is applied while in-transit.

This transformation process is where much of the magic happens. Engineers write scripts, often in SQL or Python, and turn the raw data into something that can answer business questions. There is a plethora of tools that specialise in data transformations, DBT being a popular choice for doing in-warehouse transforms.

Transformation layer takes care of the business logic, data cleaning, quality checks and governance, among other things. As indicated previously, this layer can either take place in a data warehouse or in-memory. In the case of the warehouse, the pipeline is split into EL and T parts, where the T takes place separately from the pipeline.

In an ETL case, transformations take place as part of the data pipeline. The manipulation of the data occurs “in-flight” before the data ever gets to its intended destination, be it the aforementioned warehouse or directly to the consumption layer (e.g. systems, apps, dashboards, etc).

Persisting the Data (L)

Once transformed, data needs to be persisted into a warehouse or system store for onward consumption and archiving. It must be organised into tables and schemas that reflect the business context and support efficient querying. The loading process may be scheduled in batches or streamed in real-time, depending on the nature of the data and the use cases.

Quality Assurance and Testing

Throughout the ETL process, quality assurance is crucial. Data engineers implement automated tests to verify that each stage of the pipeline behaves as expected. This might involve checking that data is complete, that transformations preserve data integrity, and that loading processes do not introduce errors.

Deployment and Monitoring

After thorough testing, the pipeline is ready for deployment in production.

Yet, the engineers’ work is never finished. They must monitor the pipelines, ensuring data flows smoothly and efficiently, and that any errors or bottlenecks are swiftly addressed (and don’t bring the house down in the process!).

There are numerous requirements placed on the production data pipelines to ensure they are robust and provide reliable data feeds (see the table below). This functionality is either scripted by the developers and/or provided by separate tools to varying degrees. The more critical the data pipeline is, the more thorough the checks will be.

Data lineage	Data tables management
Auditability	CI/CD versioning and deployment
Data quality management	Data archiving
Error management	Data monitoring
Data recovery	Scheduling
Late arriving data management	Automated alerting
Change Data Capture (CDC)	Schema drift management
Stream and batch processing	Cloud integration
Metadata management	On-prem integration
Reliable data ingestion	Testing framework
Complex data aggregations	High volume processing
Slowly Changing Dimensions (SCD)	Automatic state management

The Consumption Layer

Finally, the processed data is channelled into a system or analytics dashboard. Say, if you are running a reservation system, bookings data will be flowing in real time to power the algorithms. Alternatively, for data visualisations, tools like Tableau, Power BI, or Looker are employed to create interactive dashboards. Here, the data becomes insight, informing decisions and sparking ideas that drive the business forward.

How The Pipeline Comes Together

A multinational retail chain that wishes to integrate sales data from various point-of-sale (POS) systems across the globe. The data engineers decided to use Kafka for real-time data ingestion, ensuring a steady and scalable flow of sales data. They implemented an Apache Spark-based transformation layer that standardises and enriches data with currency conversion and time zone alignment. MS Synapse serves as the data warehouse, due to its powerful analytics capabilities and seamless integration with Microsoft services. The final dashboards are built in Power BI, offering regional managers real-time insights into sales trends and inventory needs.

The pipeline unlocks new analytical capabilities that directly lead to business value – cross-selling capabilities, dynamic product packaging and real-time up-selling.

Time and Cost Considerations

The time to construct a production data pipeline can vary tremendously, from a few weeks to several months or more, depending on the complexity and scale of the task. Simple pipelines might take shape quickly, while more intricate systems, with numerous data sources and sophisticated transformations, require much longer time frames, especially in mission-critical settings.

This is one of the biggest stumbling blocks to businesses adopting data analytics on a much larger scale.

Production pipelines take a considerable effort by very skilled data engineers to create and manage, especially since the design specifications vary from one pipeline to the next and keep evolving over time (e.g. new data sources, different transforms, changing sinks, etc).

The pipelines are expected to perform. If a pipeline stumbles in a mission critical system, the consequences are severe. Imagine your pipeline brings down the revenue management system, and the company loses a million bookings. I wouldn’t want to be the guy on the receiving end of the CFO’s wrath.

This is why production grade data pipelines must be robust and resilient. And that is why they take so much time to develop and maintain properly.

Try something different

At IOblend, we believe that developing and maintaining production data pipelines should be much simpler than it currently is. To that end, we have built a low code / no-code solution that embeds the production features into every single data pipeline, thus significantly cutting the pipeline development effort and cost.

Design, build and test each ETL component as you progress to “in-prod”. Whatever your use case is – data migration, simple or complex integrations, real-time and batch analytics, data syncing, pipeline automation, etc – IOblend will make it much easier and quicker to develop. Whether you do ETL or ELT, batch or real-time – it makes no difference. All data integration uses cases are covered in one tool.

The technology to help make data pipeline engineering easier is evolving fast. Find what works for your use cases best, explore and get using!

Complex World of Enterprise Data Estates

Large enterprises data estates are complex and costly to run and maintain. IOblend enables simplified data integration capabilities that alleviates complexities

October 20, 2023

Advanced data integration solutions: IOblend vs Pentaho

IOblend and Hitachi Pentaho are advanced data integration tools catering to the data needs of businesses. They differ in architecture design, features and cost.

October 18, 2023

Advanced data integration solutions: IOblend vs Fivetran

IOblend and Fivetran are both advanced data integration platforms that cater to the growing needs of businesses.

October 16, 2023

Advanced data integration solutions: IOblend vs Matillion

IOblend and Matillion are both advanced data integration platforms that cater to the growing needs of businesses.

October 13, 2023

puzzle pieces, puzzle, share-7696621.jpg

Data analytics

The Unmapped Challenges of Data Integration

Do not underestimate the complexities of data integration in your data projects. It’s not just about connecting the dots.

October 12, 2023

Advanced data integration solutions: IOblend vs Informatica

IOblend and Informatica are both advanced data integration platforms that cater to the growing needs of businesses, especially in real-time analytics use cases.

October 11, 2023

admin

See Full Bio