Q&A with Barry Devlin on Automated Data Pipelines

By |2021-12-03T13:05:42+00:00November 16th, 2021|

Automated data pipelines are a crucial piece of the data warehousing puzzle. They allow modern enterprises to maintain accurate and high-quality data that fuels their BI and analytics initiatives. We recently launched Astera DW Builder, a robust data warehouse builder that radically accelerates the construction, orchestration, and management of data pipelines to enable faster, data-driven decisions.

Astera DW Builder’s self-regulating data pipelines allow you to seamlessly populate your data warehouse with cross-enterprise data. Using the product’s automated ETL and ELT functionalities, you can cover the data journey from source to insights with minimal manual effort.

We recently had the chance to chat with Barry Devlin, a foremost authority on all things data warehousing, from business intelligence best practices to big data and beyond. Here are some of the insights Barry provided on key considerations for data pipelining in the modern EDW.

Modern data warehouses process massive volumes of data. When it comes to building data pipelines that can actively deliver these large amounts of data, are there any best practices that you’d recommend?

Well, before talking about best practices, I’d like to do one best practice upfront myself which is defining and elaborating on some of the phrases we’ll use here. [That’s] because there are very many different meanings of key words and phrases in the marketplace at the moment.

So first, data pipelines, many people use this phrase to mean a complete integrated pathway from source data all the way to the tools used by the businessperson. I see that your use of the term is less focused on that full soup to nuts pathway but rather the possibility of a shorter pipeline between, say, different components like an SAP instance and, a website click stream to an enterprise data warehouse. I believe that this is a better approach, and the one that’s more likely to succeed.  

Secondly, there is the question of what we mean when we say data warehouse. Because there are many contrasting approaches. So, in most cases, when I say data warehouse, I mean a two-layer structure consisting of an enterprise data warehouse, an EDW, and a loosely third normal form — or more likely today, data vault model — feeding multiple data marts, many of which may be dimensional in nature. And this doesn’t exclude other structures most of which are a simplification of this two-layer approach.

Finally, I’ll use the phrase “data or information preparation” to cover all types of processing like ETL, ELT, streaming, data warehouse automation, and even data virtualization because all these latter terms are specific instances of preparing data and information or the technical approaches to it.

So having said that upfront, let me talk about the best practices that you’ve asked me about and I’m going to limit myself to three, but there are many more.

So first up, I’d say easy-to-use comprehensive tooling and automation — and that’s fairly obvious. Data prep [preparation] is the most expensive part of a data warehouse. So, setting up teams of programmers to build multiple inconsistent and unmaintainable bespoke solutions is, or should be, a thing of the past. That’s point one.

The best practice two is: get the design phase right by involving the business closely and continuously in the process. This means iterative, agile development approaches [that is] less on the delivery of business apps but more on the delivery of useful parts of the data warehouse information scope.

And thirdly, I’d say maintenance and change management. These are key features that are often neglected. So, focus on ensuring that upgrades and maintenance are considered early and put a process in place that is closely linked to the initial design and build.

So, coming back to my first point, tooling and automation play a central role in making this happen. So, there are three best practices that I think are really important in this area.

ETL has been synonymous with data warehousing since the technology’s inception. How has this process evolved in the age of so-called big data that we have now? What type of innovations are you seeing that can bring down the cost and complexity of traditional ETL?

I believe we always have to learn from history. Data preparation, in my view, has evolved in fits and starts with frequent backward steps. For example, recently initial data lakes and cloud-based data warehouse were very often populated via scripts and similar programmatic solutions in Hadoop. That’s been a big disappointment to me.

It’s good to see now a refocus on automated tooling. The trend of the long term has been broadly from handcrafted and bespoke solutions to the use of ETL — and more recently ELT tools. This has been supplemented by a shift from batch delivery of data to a variety of incremental delivery approaches as data volumes increase and the business demands more and more timely information.

In the old days, ETL was all from operational systems often old-fashioned in design with many arcane codes. Understanding the complexities as well as deep business rules which were obsolete but were still in the system was a key requirement for an ETL programmer.

And then there was the absolute demand to limit the impact on the operational systems, [in terms of] both their design and performance. ETL engineers had to be deeply experts in the mysterious structures and content of these legacy source systems.

The good news now is the operational systems are often better designed, more modern, and sometimes cloud-based. So, there’s less need to deal with arcane systems and more focus on innovative design and target systems. Of course, [it] brings down the costs and complexity of data preparation.

However, external data sources like the internet of things (IoT) do drive new needs both in terms of timeliness of data and the types of analytics involved. So, on the technical side, we also have to consider streaming versus batch load or replication approaches.

Less archaic sources also enable a shift from bespoke, hand-coded transformations to more of a model or metadata-driven approach to information preparation which is really at the heart of data warehouse automation.

You mentioned a couple of times there ELT is as a newer approach that’s taken hold. How do you see that in relation to ETL? Are these complementary approaches or is there a versus type of thing going on here?

I think there’s a lot of questions around that but, first, I’d just like to say something that I think ETL was a really unfortunate choice of terminology in the early days. Extract, transform, and load — because it focuses attention on the sequence of actions rather than the overall process, which is why I prefer to use the term data or information preparation.

The order or location of the steps is less important to me than how we actually perform them. Hopefully, in some meta-data driven way. So, it’s worth noting as well that transformation is the most complex and important step of the process and understanding where it happens turns out to be a key design consideration.

So, that said and back to your question. ELT — extract, load, and transform — simply means that the transformation occurs in the target environment using the computing power [of the] approaches such as a relational database found there. Is that good? Yeah. Mostly.

Transforming on the source system, maybe that should have been called transform, extract, and load [TEL], is always done to some degree but was historically kept the absolute minimum to reduce source system impact. Transforming on an intermediate system — ETL — was then an obvious choice, minimizing impacts on both source and target systems.

However, target systems especially on the cloud have little or no performance constraints and the target relational DBMS has good transformational abilities, so ELT makes a lot of sense, and yes I do think they are complementary and particularly an approach that combines ETL and ELT seems to me to be the most flexible and powerful one, especially if there’s some intelligence and automation in choosing when and in what circumstances the function is pushed from the ETL server [i.e.,] where it is first defined and run to the target system. So, absolutely! Not a question of choosing between the two of them. I think they can both co-exist, and I think that’s a really good approach.

How do you ensure that correct data is brought into your data warehouse that is combined and transformed optimally to suit reporting and analytics requirements? What are some tips that you would give to organizations that are looking to ensure the quality of data in their enterprise data warehouse?

Oh, they’re great questions. I mean quality is central to all forms of decision-making based on data or I would say more proper information. So, the first thing for me is always a thorough and quality data or information modeling job. I know that some people will throw up their hands and say, “Oh, too slow [and] too expensive. Let’s do schema on read” But honestly, I don’t think there is any other way to get highly consistent data in your warehouse other than modeling upfront.

So, that’s really important, and I think this modeling can and must be done in stages — an agile staged approach. That’s a huge topic and probably for another day. I know that many organizations have settled on a fully dimensional data warehouse approach and that can be ideal for up to mid-size businesses.

But for larger enterprises, a consistent store of reconciled core data and EDW is generally preferable to create and maintain optimal quality data and information. And finally — beyond modeling, employing a good and flexible modeling team is probably the next most important aspect of implementing a top-class data preparation environment.

Especially, in mid-sized organizations, I would suggest that a data warehouse automation approach is probably best both in terms of quality and cost because typically that does include the modeling step in the process of moving forward.

[Say] this architecture you have is built. You have multiple data pipelines coming from a variety of source systems. You mentioned operational systems, obviously, IoT streams perhaps, [so] there’s a lot of possibilities there when you’re talking about the modern enterprise. You have different data latencies working here and interdependencies in many cases. Orchestrating all of this is obviously a tough ask. What would you say are the essential elements for getting that part of the equation correct?

Well, first, I’d like to say that we have always been dealing with multiple data pipelines coming from various source systems with different data latencies and interdependencies as you’ve mentioned — certainly, in large enterprises that I’ve dealt with.

I think the more important aspect today and you’ve mentioned it is that we’re looking to ingest and analyze huge quantities of raw data and that’s often that dirty data — and it’s unknown how dirty it is — from the web and the internet of things. And such data has very different characteristics from operational system data which I call process mediated data, or PMD for short, that were the main sources of data until let’s say 10 years ago.

So, given that difference in data quality, I’d say that trying to push all this new data into the same data warehouse as your PMD is a mistake. You need multiple pillars of data processing. I explained all that in my book “Business Unintelligence”, so if you want to go dive deeper there, please do so. Some of these pillars will be traditional data warehouses others will be more like data lakes, but all will, of course, be fed with data pipelines as I defined earlier in the interview.

So, key to all of that working well, of course, is to ensure that data in these pillars is well interconnected, related, and reconciled as opposed to what we saw in traditional data silos and what you call orchestration across the data pipelines is a key component in making this happen.

Now this brings us back to modeling but also to metadata because an essential element in orchestrating all these pipelines is metadata. Now I prefer to call it context-setting information, or CSI, because this word ‘meta’ is getting a bit abused these days, even Facebook has adopted it.

So, this kind of context-setting information is what is contained and mentioned in Gartner’s Data Fabric Architecture and is really this idea of being active and actively maintained and kept internally consistent and aligned with changing business often Gartner suggests you need to use AI [artificial intelligence] to do that.

So, getting this idea of active metadata involved is really important. My view is that we’re actually quite far from this today, but we really do have to focus on it because I think it’s really going to either make this whole system work or it’s going to fall apart.

So talking about the actual data warehouse and where it’s built. Obviously, the cloud seems to be where everybody is taking their BI systems nowadays. What are some considerations enterprise data teams should keep in mind when building data pipeline for these platforms?

Yeah, I mean everybody’s talking about cloud. In architectural terms, my view is cloud is simply [that it’s] another storage pool albeit huge and highly distributed, and if your data preparation tools support cloud sources and targets then you’re good to go, right? Well, I’d say not quite.

It’s [the] cloud’s distributed nature that you must first and always keep in mind. Data management was a lot simpler, I always say, when all your data was in a single mainframe. But copies of data are the bane of data governance and the more places and the further apart these copies are, the tougher the challenge becomes. So, the cloud enables and — to some extent — promotes data duplication, some of it [is] largely invisible.

So, my first advice in terms of cloud is to beef up on your data governance and management teams because they’re really going to be needed. The second consideration is what’s called data gravity. Where does the data originate?

If it comes from on-premises there will be costs to lifting it up to the cloud. [And] if it comes from the cloud, it’s generally more effective to keep it there.

Of course, most businesses today have data from both sources, so there’s no single correct answer as to where to store and process your data. But clearly the pressure is on to get it onto the cloud, often for financial reasons. But here’s the rob — the main thing here is to keep an eye on the relative costs of storage processing and data transport, and I think it’s that last one that can catch you out, especially when we’re talking about data preparation and data pipelines.

So, moving on here to the final question and I guess it’s the one we’ve been skirting around and mentioning in different ways and in some previous answers, but to just get an approximate where do you think automation fits into all of this? How can it make the process of building and maintaining data pipelines more efficient?

Yeah, I think you’re right. Everything now points towards automation. As I’ve already said, I mean I’m a big proponent of automation in data and information preparation. That said, I’m deeply wary of automation in other areas, particularly where life-affecting decisions are being made and I think we have to be careful about automation there.

But back to data preparation, I think anything that reduces the cost of what is effectively just plumbing is highly welcome. And automation should be applied at every stage of the process, wherever possible — from initial specification and design to daily operations and all the way through to ongoing maintenance and change management.

That’s a very broad and very wide scope of what it is that we would need to do and what we need to automate. I just don’t throw in one caveat, you know, based on my long years of experience and that is that automation tends to push you towards predefined transformation functions, and they’re great! But the ability to create custom transformations that are unique to your own environment is often key to the in-house success of data preparation tools.

Such needs may be only five percent or less than the [predefined] transformations but their importance is often easily underestimated because they’re often the most important transformations from a business point of view.

In my view, data warehouse automation tools need to allow these very specific transformation routines to be developed and embedded within the workflow. That said, automation does offer the opportunity to reduce or eliminate the repetitive and common parts of the data preparation process. And data warehouse automation tools also augment the human aspects of the process.

I’m thinking here about requirements definition, consensus building, change management, and so on. In this sense, DWA is a bit like artificial intelligence, in that it contains aspects of both automation and augmentation. So, I think that’s a particularly important aspect of what it is that we need to think about.

As a final thought, I’ve always felt that data warehouse automation like ETL was an unfortunate choice of name for this class of tools because these tools do more than automation as I’ve just said, and they can and must apply to a far wider range of data storage and management systems than just data warehouses.

All parts of the name have got their weaknesses. So, what the name should be? I don’t really know unless you’d like to try and call it the Swiss army knife of data management and I think that’s probably a good place to stop.

Astera DW Builder: Automate Your Data Pipelines

Astera DW Builder is a unified data warehousing solution that allows you to build a metadata-driven data pipeline architecture. The end-to-end platform offers a zero-code environment to develop your data warehouse on a logical level and automate your design and engineering processes to get rich insights that satisfy your business intelligence requirements.

The agile, automated solution comes with best-in-class ETL and ELT capabilities, including built-in workflow orchestration and job scheduling components, to build self-regulating data pipelines. Take Astera DW Builder for a test drive to learn how it can help deliver accurate and reliable data to your BI and reporting architecture.