Data Plumbing Essentials: Production Pipelines

Data Plumbing Essentials: Keeping Your Production Pipelines Leak-Free

There has been a lot of focus lately on data quality, governance policies and process automation to improve decision making by organisations. Data lives everywhere, in all sorts of shapes and formats. Sitting idle, data is useless. You need to be making use of it to generate value for the business. Which usually means extracting, transforming and loading data one way or the other and making sure it is fit for purpose to provide insight and drive decisions.

Today, we focus on a key process that makes it possible – data pipelines. This blog explores what constitutes a data pipeline and the complexities the data engineers encounter to make it robust. Simple exploratory data pipelines are straightforward to quickly put together for data exploration purposes. But it becomes very complex when integrating diverse data streams into digestible sets that analysts and systems can reliably consume on a recurring basis. This is what production data pipelines must do.

The creation of production data pipelines isn’t dissimilar to constructing a sophisticated railway system for data. It is an exercise in precision engineering, requiring meticulous planning, robust construction, and continuous maintenance. The goal is to ensure that data from various sources arrives at its destination accurately, on time, and ready for consumption.

What is the Use Case?

The journey begins with an in-depth consultation phase with the stakeholders. Data engineers must understand the business objectives, the data’s origin, its destination, and the insights that the analytics team aims to glean. This phase is critical, as it informs the choice of technology, the design of the pipeline, and the strategies for data integration, quality checks, and compliance with data governance standards.

It is crucial to validate the understanding of use case. I’ve seen many a project go haywire when the delivery team built something very different from what the business was expecting.

Designing the Data Pipeline

After the goals are set and validated, data engineers draft the architecture of the data pipeline. This design specifies how the data will be extracted, where it will be staged, and the transformations it will undergo. The process is often iterative, with frequent reviews to ensure alignment with business objectives and technical feasibility. This phase tends to take a long time to set in stone because of the constant back and forth between the business and the engineering teams.

Selection of Technologies

Selecting the right tools and technologies is paramount. Data engineers must consider the volume of data, the frequency of updates, and the complexity of transformations. Solutions like IOblend may be deployed for handling real-time data streams and complex integration use cases, whereas batch processing might rely on the MDS or Hadoop ecosystems. For data warehousing, technologies like Amazon Redshift, Google BigQuery, or Snowflake are popular choices.

The key here is to create an architecture and select associated technologies that fit the purpose best and not to hacksaw a pipeline design into something that was never meant for the job at hand. The biggest mistake many businesses make is limit themselves to a particular stack and then spend a lot of time and resources on workarounds. Always run a full life cycle cost analysis on any tech you bring in or use already.

Now, let’s see what the steps are for developing a production grade data pipeline.

Data Extraction (E)

The first concrete step in the pipeline is extracting data from source systems. This can range from simple databases to complex, distributed systems. Data engineers must navigate various formats and protocols, employing ETL tools to ingest the data into a staging area (physical or in-memory).

Typically, extraction includes data from APIs, databases, systems, ESBs, IoT sensors, flat files, blobs, etc. It can be structured or unstructured. Batch or real-time. If the data exists in a digital form, it can theoretically be ingested.

Regardless of the data extraction methodology, the pipeline must be able to connect to the source and ingest the data reliably and at a specified frequency.

Transformation and Enrichment (T)

Data rarely comes in a right shape ready for analysis. It will require cleaning, normalising, or enriching to become useful for the business – validating and transforming it.

One popular way is based on ingesting raw data into a cloud lake/warehouse and then working on it there. The Snowflake ecosystem is a notable example, where the data will get ingested via an ELT process into the warehouse and gets processed there by various tools.

Alternatively, the data can be “staged” virtually, in memory, for the use cases that consume real-time data or require pre-processing. Real-time data requires CDC, for instance, so the process is applied while in-transit.

This transformation process is where much of the magic happens. Engineers write scripts, often in SQL or Python, and turn the raw data into something that can answer business questions. There is a plethora of tools that specialise in data transformations, DBT being a popular choice for doing in-warehouse transforms.

Transformation layer takes care of the business logic, data cleaning, quality checks and governance, among other things. As indicated previously, this layer can either take place in a data warehouse or in-memory. In the case of the warehouse, the pipeline is split into EL and T parts, where the T takes place separately from the pipeline.

In an ETL case, transformations take place as part of the data pipeline. The manipulation of the data occurs “in-flight” before the data ever gets to its intended destination, be it the aforementioned warehouse or directly to the consumption layer (e.g. systems, apps, dashboards, etc).

Persisting the Data (L)

Once transformed, data needs to be persisted into a warehouse or system store for onward consumption and archiving. It must be organised into tables and schemas that reflect the business context and support efficient querying. The loading process may be scheduled in batches or streamed in real-time, depending on the nature of the data and the use cases.

Quality Assurance and Testing

Throughout the ETL process, quality assurance is crucial. Data engineers implement automated tests to verify that each stage of the pipeline behaves as expected. This might involve checking that data is complete, that transformations preserve data integrity, and that loading processes do not introduce errors.

Deployment and Monitoring

After thorough testing, the pipeline is ready for deployment in production.

Yet, the engineers’ work is never finished. They must monitor the pipelines, ensuring data flows smoothly and efficiently, and that any errors or bottlenecks are swiftly addressed (and don’t bring the house down in the process!).

There are numerous requirements placed on the production data pipelines to ensure they are robust and provide reliable data feeds (see the table below). This functionality is either scripted by the developers and/or provided by separate tools to varying degrees. The more critical the data pipeline is, the more thorough the checks will be.

Data lineage	Data tables management
Auditability	CI/CD versioning and deployment
Data quality management	Data archiving
Error management	Data monitoring
Data recovery	Scheduling
Late arriving data management	Automated alerting
Change Data Capture (CDC)	Schema drift management
Stream and batch processing	Cloud integration
Metadata management	On-prem integration
Reliable data ingestion	Testing framework
Complex data aggregations	High volume processing
Slowly Changing Dimensions (SCD)	Automatic state management

The Consumption Layer

Finally, the processed data is channelled into a system or analytics dashboard. Say, if you are running a reservation system, bookings data will be flowing in real time to power the algorithms. Alternatively, for data visualisations, tools like Tableau, Power BI, or Looker are employed to create interactive dashboards. Here, the data becomes insight, informing decisions and sparking ideas that drive the business forward.

How The Pipeline Comes Together

A multinational retail chain that wishes to integrate sales data from various point-of-sale (POS) systems across the globe. The data engineers decided to use Kafka for real-time data ingestion, ensuring a steady and scalable flow of sales data. They implemented an Apache Spark-based transformation layer that standardises and enriches data with currency conversion and time zone alignment. MS Synapse serves as the data warehouse, due to its powerful analytics capabilities and seamless integration with Microsoft services. The final dashboards are built in Power BI, offering regional managers real-time insights into sales trends and inventory needs.

The pipeline unlocks new analytical capabilities that directly lead to business value – cross-selling capabilities, dynamic product packaging and real-time up-selling.

Time and Cost Considerations

The time to construct a production data pipeline can vary tremendously, from a few weeks to several months or more, depending on the complexity and scale of the task. Simple pipelines might take shape quickly, while more intricate systems, with numerous data sources and sophisticated transformations, require much longer time frames, especially in mission-critical settings.

This is one of the biggest stumbling blocks to businesses adopting data analytics on a much larger scale.

Production pipelines take a considerable effort by very skilled data engineers to create and manage, especially since the design specifications vary from one pipeline to the next and keep evolving over time (e.g. new data sources, different transforms, changing sinks, etc).

The pipelines are expected to perform. If a pipeline stumbles in a mission critical system, the consequences are severe. Imagine your pipeline brings down the revenue management system, and the company loses a million bookings. I wouldn’t want to be the guy on the receiving end of the CFO’s wrath.

This is why production grade data pipelines must be robust and resilient. And that is why they take so much time to develop and maintain properly.

Try something different

At IOblend, we believe that developing and maintaining production data pipelines should be much simpler than it currently is. To that end, we have built a low code / no-code solution that embeds the production features into every single data pipeline, thus significantly cutting the pipeline development effort and cost.

Design, build and test each ETL component as you progress to “in-prod”. Whatever your use case is – data migration, simple or complex integrations, real-time and batch analytics, data syncing, pipeline automation, etc – IOblend will make it much easier and quicker to develop. Whether you do ETL or ELT, batch or real-time – it makes no difference. All data integration uses cases are covered in one tool.

The technology to help make data pipeline engineering easier is evolving fast. Find what works for your use cases best, explore and get using!

IOblend vs Vendor Lock-In: Portable JSON + Python + SQL

The End of Vendor Lock-in: Keeping your logic portable with IOblend’s JSON-based playbooks and Python/SQL 💾 Did you know? The average enterprise now uses over 350 different data sources, yet nearly 70% of data leaders feel “trapped” by their infrastructure. Recent industry reports suggest that migrating a legacy data warehouse to a new provider can

February 27, 2026

IOblend JSON Playbooks: Keep Logic Portable, No Lock-In

The End of Vendor Lock-in: Keeping your logic portable with IOblend’s JSON-based playbooks and Python/SQL core 💾 Did you know? The average enterprise now uses over 350 different data sources, yet nearly 70% of data leaders feel “trapped” by their infrastructure. Recent industry reports suggest that migrating a legacy data warehouse to a new provider can

February 18, 2026

Real-Time Defect Detection with Agentic AI + ETL

Smart Quality Control: Embedding Agentic AI into ETL pipelines to visually inspect and categorise production defects 🔩 Did you know? “visual drift” in manual quality control can lead to a 20% drop in defect detection accuracy over a single eight-hour shift The Concept: Agentic AI in the ETL Stream Traditional ETL (Extract, Transform, Load) has long been the

February 12, 2026

Agentic AI ETL for Real-Time Sentiment Pricing

Sentiment-Driven Pricing: Using Agentic AI ETL to scrape social sentiment and adjust prices dynamically within the data flow 🤖 Did you know? A single viral tweet or a trending TikTok “dupe” video can alter the perceived value of a product by over 40% in less than six hours. Traditional pricing engines, which rely on historical sales

February 3, 2026

BCBS 239 Compliance with Record-Level Lineage

Regulatory Compliance at Scale: Automating record-level lineage and audit trails for BCBS 239 📋 Did you know? In the wake of the 2008 financial crisis, the Basel Committee found that many global banks were unable to aggregate risk exposures accurately or quickly because their data landscapes were too complex. This led to the birth of BCBS

January 28, 2026

Real-Time Churn Agents with Closed-Loop MLOps

Churn Prevention: Building “closed-loop” MLOps systems that predict churn and trigger automated retention agents 🔗 Did you know? In the telecommunications and subscription-based sectors, a mere 5% increase in customer retention can lead to a staggering profit surge of more than 25%. Closed-Loop MLOps A “closed-loop” MLOps system is an advanced architectural pattern that transcends simple predictive analytics. While

January 20, 2026

admin

See Full Bio