Automated data pipelines serve as a backbone of an entirely data-driven ecosystem. They enable enterprises to extract data from disparate sources, apply transformations, and carry out the integration process efficiently, reliably, and quickly. More businesses are opting for data warehouse automation to improve data analytics and compete more strategically.
We recently launched Astera DW Builder, an end-to-end data warehouse automation platform that provides an iterative, code-free environment to design, develop, and deploy data pipelines at unprecedented speeds.
To educate modern businesses on self-regulating data pipeline, we hosted a live webinar titled – Futureproof Your Data Warehouse with Self-Regulating Data Pipelines on November 2nd, where we had an amazing opportunity to have a discussion with Paul Kellett. He has over 25 years of experience working on enterprise-level business intelligence projects for organizations.
In our Q & A session, we got some valuable insights on building high-quality, automated data pipelines, modern data warehouses processes, cloud data warehousing, and more.
Afnan: Modern data warehouses process massive volumes of data. Would you recommend any best practices that people should use to build data pipelines that can effectively deliver these large amounts of data to their data warehouse?
Paul: Yes, but I would also add that it’s not just the volumes of data. It’s the variety of sources, the variety of formats of the sources, and the fact that if you’re particularly in any corporate environment, you’re frequently accessing tens of systems — they’re in a constant state of change. So, the type of data that you’re getting will generally change.
Those systems don’t stand still — businesses innovate, and they change, so you’re looking at several problems here. You need to get it in reliably; you need [to process data] in a robust manner [with] as [few] interventions as possible. Historically people might build a whole series of extracts out of their source systems, they’d be doing point-to-point type solutions where you could have several different mechanisms for receiving the data. I would say, try and have [a] constant standard mechanism [and] you’ll have a single type of practice.
You need to then put in place tools that are suited for these things. So, avoid as much as possible handcrafted or point-to-point solutions. What we see for a lot of historical data warehouses is that there [are] a number of bespoke handcrafted solutions to get the data from System ‘A’ and a different one from ‘System B’. They end up essentially with quality and robustness issues, and they also end up with maintenance issues, and they tend to adapt rather slowly to change.
So, it’s a triple whammy in terms of doing that. You want to be using things that are doing the heavy lifting for you. You don’t want to be repeating standard stuff like error handling. You need it to be simple, easy, robust, consistent, and standard. My last point would on that question be to try, if possible, to go and pull the data from the source systems rather than have it provided to you as an extract.
Afnan: Data pipelines and ETL is basically a concept that has been synonymous with data warehousing since the inception of the technology. So, how do you think ELT and data pipelining have evolved in the age of big data? What type of innovations do you think could bring down both the cost and complexity of traditional ETL?
Paul: A lot of the costs historically have come probably from two main areas: one has been a lot of handcrafted type solutions, which are quite expensive and quite limited. Also — and I’m not getting at ELT tools here, but — they’ve been big [and] expensive. They require specialized resources and own dedicated infrastructure, hardware, server, platforms, and they [require] resources [that] are difficult to get hold of.
So, what we’re seeing now is a move to make those types of processes easier. So, rather than having to work out what you’re going to get, they go and make a map for you automatically. It’s a lot more click and point than has historically been the case. So, we’re seeing that is essentially bringing down the need, and [allows] for a lot more coding and to move that through.
Afnan: One major requirement that we’re seeing come up a lot is that more organizations want to build ELT pipelines now instead of traditional ETL pipelines. So, what do you think about that approach? Do you think it could work for every organization? or are there certain things that organizations need to keep in mind before going for ELT instead of ETL?
Paul: So firstly, there’s never one solution that works for everything. There are cases where ETL is quite suitable; in fact, preferred. But what we are seeing is that the preferred starting point nowadays would probably be ELT. The database software and architectures have improved substantially. One of the needs for historical ELT has been the database’s inability to process the large volumes of transforms required in the time scales. They can pretty much do a very large number of use cases.
I personally moved towards ELT. I can’t remember the last time I did ELT — it would have been at least ten years ago. Their main drive would be the wellness component. You’ve made your solution simpler. You have one less platform to go wrong and to put up [as well as] one less set of test platforms to go through. So, you’ve dropped your complexity.
You’ve also got cost, in that you don’t have those platforms to do it, so the things that were driving the need for that essentially have lessened. If I were looking today in a Greenfields environment, I would assume my starting point would be ELT and then move away from that if I felt I needed to because of some special circumstances.
Afnan: How can you ensure that you have the correct data in your data warehouse? and that it’s both being combined, consolidated, [and] transformed in a manner that it suits your reporting and analytics requirements?
Paul: So firstly, you can’t really get guaranteed correct data. The reason for that is you are reliant on the data the source systems give you and, as anyone who’s worked in the area will testify, they will often give you incorrect data or inconsistent data or data has issues when presented in a different fashion — [it] provides the wrong position.
But what you can do is try and give the best possible picture of the data in the best possible way. You shouldn’t be setting yourself up by saying we’re going to give perfect data because it doesn’t happen. Luckily, [it’s] not that important because you’re generally talking analytics and it’s about understanding data volume, so it’s not necessarily an issue if you manage it correctly.
If you want to get the best possible data, there’d be a couple of tactics I’d be advising, one of them would be [to] over-collect. If we take the example, say, sales transactions, you’re asked to provide sales reporting or sales analysis and someone will work out that you need fields A, B, and C of these two tables and then [fields] off this and this and this and you’ll get the data that’s required to do the problem.
My advice generally is if you need sales information, grab the entire sales transaction [and] all the associated data. Also, take it in as untransformed fashion as possible. Don’t run the risk of essentially in a transform or some derivation of the data putting in your own translational errors. Bring that into your data warehouse and do it there.
I’d also be looking to build in some feedback loops, so [that] I’d have ways to do high-level checks. Say that I’ve got the data I’m expecting to get, and that’s typically using trusted reports or data out of the source systems and then matching them against something similar in what you derive as you go through.
It’s important to understand what’s good enough for the business. For instance, historically, accounting transactions have to be perfect and within cents, but if your sales transactions are up by a little bit, it’s not the end of the world. So, I’d be using things like on top of that as well. There [are] standard tricks and techniques like standard data formatting [and] stripping trailing space like a comma. Make your mind up that you’re going to do these and do them [in a] standard [way].
Afnan: When you’re talking about collecting all this data from different sources. You will be dealing with multiple data pipelines, and all these pipelines will obviously have different data latencies and interdependencies. So, what do you think are the essential elements for basically orchestrating these pipelines?
Paul: There are pretty standard tested modeling techniques in dimensioning modeling used out there. Kimball [is] a very good place to start to go and look at the type of advice and design techniques they’ve given. These are very well suited for building your data warehouse in a manner so that your data is consistent and presents a common format as you go forward.
They will handle things like missing information, so if you don’t have XYZ coming through from a particular source, if you don’t know product definition, then you know you have standard techniques like I will go and create a domain product so at least my sales report adds up the product. I might not know the product information, but I’ll know that I’ve got information against a product called freight. I don’t know any more about that product, but that’s all I know.
The second thing is that you need to be driving the way you process your information of the data content [metadata], not how the data is processed or accessed. So, it’s things like if on Monday, you’re getting Sunday’s transactions, don’t assume you’re getting Sunday’s transactions. Drive it all off the dates within the data. So, always try and grab as many dates as possible out of the data, so you know what’s going on and then that way, you can match things again against each other.
So, then you’re going to get some inconsistencies between systems, particularly [when you’ve] got tens of systems delivering to your data warehouse invariably one of them will be down at some point one of them will be available. [and] this will happen on a frequent basis. To that end, present what is missing as part of your solution, don’t just present it and make it clear and obvious that we don’t have Monday’s inventory data for [say] distribution center 27.
Deal with it as part of your processing; those would be my main comments. Therefore, use the data to drive it; Kimball is king, and make sure that the business knows when you do get things that have not appeared.
Afnan: Cloud data warehousing has been getting a lot of steam, especially this year we’ve been hearing about it all over the place. So, what do you think are some considerations that enterprise data teams need to keep in mind when they’re building data pipelines specifically for a cloud data warehouse?
Paul: Okay, so I’m going to assume that when we’re talking [about] purchased cloud service in terms of hosting and managing your data warehouse infrastructure. So, from a technical perspective, there’s not a huge amount that is that different about going to the cloud.
[The] main technical differences would be you’re obviously on the internet as it were, and you may be moving large data volumes around, so you’ve got to give a lot of thought to how you are going to move those large data volumes around. Are your source systems and your cloud-hosted infrastructure — from a network perspective — close enough to each other that you can move these things? Plus, are they robust enough between your various systems so that you will have reliability, again, in the data.
The other element to just look at then is frequently with data warehousing solutions. There are dashboard-type elements, and dashboard-type elements have quite often had a quickness of reactive fitness. They require to react quite quickly to users in terms of you click here you go and get the next set down.
The latency does matter. If your ping time between your users and your cloud infrastructure is low, then that can make your dashboards look bad even though they’re not. Most of the considerations would be around either commercial, regulatory, or infrastructure. When you’re going to the cloud, you’re typically picking on a vendor. So, you’re now not technology reliant. Your reliant upon a vendor to have its systems to be up.
It’s very much about measuring the vendor and their abilities rather than the technology. Some of the potential regulatory issues are that — if I look here where I’m based — you basically are not allowed to take health data as an example out of the country without special permission because that’s personal data, and there are rules about what you do with personal data.
Similarly, you’ve got some data security that you need to have a look at in that you’re now given responsibility for looking after your data to a third party. Actually, they’re probably going to be better at data security than you are because it’s part of their life, but you still need to ensure that you check on that. And in fact, I’d say that’s probably one of the areas where you can rest a little easier, though.
One of the things about moving to the cloud is that you get a lot more capability [in terms of] your ability to adapt.[There are a] number of cases where I’ve been with clients and essentially their data warehouse has been set on 10-year-old architecture that’s been slowly creaking, [with] the daily loads arriving later and later in the morning. [So], you’re not getting your reports till noon, but the task of moving has been incredibly difficult.
They’ve had all sorts of issues relating to trying to recruit and own resources capable of doing that sort of work so that you can give away a lot of that problem to someone else. Don’t do it for cost purposes, though, because it will generally cost similar; even though the costing model may differ, you’re buying more [and] getting better. So those would be some of the considerations for moving to the cloud.
Afnan: Where do you think automation fits into all of this? [Using] automation and orchestration, how do you think you can make the entire process of building and maintaining your data pipelines more efficient?
Paul: Firstly, as much as possible, avoid point-to-point solutions have something that does the heavy lifting for you. So, you want something that it’s doing the monitoring for you. Frequently, these loads are happening in the middle of the night. You want standard sort of automated type abilities like being able to restart from a particular point in time, skip points, all that sort of job control and job management type stuff.
You want something that is essentially easy to build. The easier it is to put them [the system] together, the more data you’ll get. The faster you get it and the [fewer] errors you will have when you bring that data in, the more likely you are to do it the [way] business wants you to do it.
I mean, on a side point, I frequently comment that we are our own worst enemy. If we are successful in a business intelligence solution, we normally know that because we’re completely overloaded with demand. Okay, so you’ve got to be able to have to do these things in the easiest possible way to cope with that demand.
Historically, the cost of moving [and] creating data warehouses has been on the order of sixty percent, maybe even two-thirds, depending on your time scales on the ELT side. So, you really want to make sure that you’ve got something that does a lot of the repeatable tasks as much as possible for you in a simple as possible way because it’s such a large amount of what can be quite a significant cost.
Astera DW Builder: Automated Data Warehouse Platform
Astera DW Builder is an end-to-end data warehousing solution that allows you to develop automated data pipelines in a code-free environment. The unified platform comes with metadata-driven architecture and streamlines your design and engineering processes to provide accurate and relevant insight to facilitate better decision-making.
Enterprises can create self-regulating data pipelines by leveraging the advanced ETL and ELT capabilities like built-in workflow orchestration and job scheduling components of Astera DW Builder. Try Astera DW Builder today to find out how it can add value to your organization.
Organizations today are accumulating data more than ever. Traditional data management approaches, such as centralized data warehouses and siloed data...