Automation is transforming the modern corporate landscape, helping businesses to become faster and improve performance. It’s no surprise that the global artificial intelligence (AI) market is expected to break the $500 billion mark in 2023. AI holds the power to revolutionize business efficiencies and it’s high time for organizations still rely on manual, archaic methods to process documents.
In our recent webinar, we had the privilege to speak with Douglas Laney. Boasting 35 years of industry experience, Mr. Laney is a leading data and analytics expert. Currently, he’s serving as a Data and Analytics Strategy Innovation Fellow at West Monroe, where he consults business leaders on conceiving and implementing new data-driven value streams.
Douglas Laney has previously served as the Vice President and Distinguished VP Analyst at Gartner and is the recipient of a three-time Gartner thought leadership award. He has also originated the “Field of Infonomics,” developing methods to quantify information’s economic value and apply asset management practices to information assets.
In the webinar, we talked to Mr. Laney to get deeper insights into the key value drivers that make data extraction from unstructured document sources a vital task and how it can help streamline document processing.
Live Q&A Session with Douglas Laney on Data Extraction Automation
Host: Modern organizations are producing more and more data with time. It has been repeatedly stated that data is the currency of the future, so what are your thoughts on that? What value does data bring to an enterprise?
Douglas Laney: That’s a good question! You know, interestingly, information has always been a currency of sorts. Kings paid handsomely for information, for example, about their enemies’ troop movements. Even the term business intelligence was coined almost a century and a half ago by Richard Devins and Sinclair Hamilton in their 1865 book, The Encyclopedia of Commercial and Business Anecdotes, in which they recount how a gentleman named Sir Henry Furness was rewarded handsomely, including being given a diamond ring by King William for bringing him the news on battles throughout Holland and Flanders and France. The first credit bureaus were paid by banks in the early 1900s for compiling information and anecdotes about the repayment propensities of businessmen.
Today, however, we’ve really taken this to the next level. Codifying and automating, and even governing the practice of collecting and monetizing data. I think about the analogy between data and currency, where it falls short is in some of the unique economic qualities of data. Once you spend a dollar or a euro, you can no longer spend it again. You can only spend it one way at a time. Data, on the other hand, is more of a what economists call a non-depleting, non-rival risk asset. You can spend it over and over again without it getting used up. You can spend it or use it multiple ways simultaneously. So, the businesses that have capitalized on these characteristics of data are really the ones that are winning in today’s digital economy.
Host: Would you say data plays a key role in the production process?
Douglas Laney: Yeah! In fact, I have come to the conclusion that data is the fifth factor of production. You know, economists at the turn of the last century [I think around then] identified four key factors of production; land, labor, capital, and entrepreneurship, and increasingly data had become even a substitute for almost all of these [factors].
For example, no longer do manufacturers need massive warehouses because just-in-time inventory management systems substitute supply chain information for on-site inventory, and, of course, we’ve seen data and analytics replace number crunching and other knowledge workers, and today companies are paying for goods and services using data.
So, consider your own experience at the grocery store. Data and analytics are even used to come up with new business models, products, drugs, etc. So, I contend that data should be considered the fifth factor of production.
Host: We know that a lot of the data that businesses and organizations receive is in an unstructured format. Why does this unstructured data tend to be underutilized compared to structured data?
Douglas Laney: I think just because it is unstructured. Unstructured data is found in documents like PDFs, emails, social media, and multimedia. It’s just data that’s not organized into neat little rows and columns. Unstructured data has to be processed to extract discrete information and insights. I’ve often said that unstructured content can only be shared, edited, and read until you actually extract or add a certain kind of value or structure to it.
There’s a lot of meat in there, and because of vagaries and nuances and things like language and semantics, this kind of tagging unstructured data or extracting is difficult to do, let alone do so efficiently and consistently. However, since people say that between 80 and 90 of data today is unstructured, I think it’s a real fertile territory for those looking to gain a competitive advantage.
Going back a couple of decades or so, I came up with the concept of the three V’s of big data: volume, velocity, and variety. So, we also often talk about unstructured data having a significant volume. Well, of course, it does by its nature, but it also has a great variety of data sources as well that are unstructured for any organization.
Host: We have established that around 90 percent of enterprise data is, as you said, unstructured. Do you have some insights on how organizations can integrate this unstructured data into their existing data pipelines and data warehouses?
Douglas Laney: Yeah. You know it’s not just enough to drop unstructured content right into our data warehouses or data lakes. I suggest you first really need to extract data from that content or tag it and link to it in some way that makes it queryable. Even linking concepts across pieces of content to create a knowledge graph has proven to provide benefits for some organizations, especially those looking to do things like identifying fraudulent behavior or bad actors.
Host: We know that unstructured documents hold enormous value. What are the examples of unstructured data being used to generate innovative value streams for organizations?
Douglas Laney: Yes. [Here’s an example]. This insurance company realized it was sitting on an archive of adjuster reports. So, someone submits a claim, the insurance company investigates that claim, the investigator writes up a report, and that report is used to process an individual claim.
But what they realized was they could mine the content in those adjuster reports to identify the propensity or indications of fraud-like language that was used, or omissions or inconsistencies. When they deployed this text-mining algorithm against this data, they were able to subrogate or recover millions of dollars of previously paid out fraudulent claims and also bake that model into their claims processing system.
Another example is the manufacturer Lockheed Martin which makes fighter jets and other military types of equipment. They took an idea that I gave them to identify leading indicators of project issues like scope or budget or personnel, or technology-related issues by mining the project communications for the personnel on the projects rather than just using the old status reporting method.
They were looking for leading indicators of project issues, and in doing so, they ended up adding three times greater foresight into project issues than they once had and are saving hundreds of millions of dollars in cost overruns. I also just learned yesterday in speaking to a consultant in Ukraine how they’re using facial recognition to identify saboteurs and using maps and satellite imagery to help identify and publicize ever-shifting supply chain and evacuation routes.
Host: While we’re on the topic of using this unstructured data, can you tell us some common issues that organizations face when extracting the data from these unstructured sources?
Douglas Laney: Great question! It’s great to be aware of these and get in front of these issues. I mentioned before the issue with multiple languages, and even ambiguities within a language are difficult. Creating glossaries and synonyms and identifying sentiment by doing sentiment analysis is as much of an art as it is a science sometimes. Indexing and classifying, and tagging content, determining what’s relevant or not. Natural language processing — we’re also dealing with large size of data, typically.
What do we retain or not retain? Retention is important. How do we forget something if we’ve learned something? At what point do we diminish its value over time? It’s also difficult to gauge the quality of unstructured content. It’s much easier to determine the quality of structured content. Then, of course, security, privacy, consent, and masking personally identifiable information are also key more technology-related issues.
Host: Can you shed some light on automation? Automation as the future. Why is it such a key facet in this journey?
Douglas Laney: Look at some of the challenges I mentioned before. Most forms of unstructured content are too voluminous to manually tag or index, or extract, even using crowdsourcing methods. [Using] multiple humans to crowdsource this kind of effort tends to result in a high degree of inconsistency as well. We look at, for example, how Facebook tags posts that don’t meet their criteria their standards.
The way they do that is, you know, this tends to be inconsistent. There’s also a latency in doing that as well. So, to get usable real-time or near real-time insights from unstructured content of almost any volume or velocity, you really need to automate it
Host: What would you say as advice? Would you suggest organizations get on board [with automation]? What benefits can they get from automating their extraction process?
Douglas Laney: I would suggest kind of start small. Identify and pilot ways to leverage unstructured content. Perhaps run some workshops to identify the potential value streams that are involved in doing so. This is something that I do with clients all the time — run these ideation workshops. And we’re looking not only at structured content but also unstructured content, and then be aware of those challenges and make sure that you’re prepared to deal with all those challenges that I mentioned with unstructured content.
Even after you pilot and realize that if you did it manually as part of the pilot, it’s probably not going to scale, that degree of the manual effort’s not going to scale. So, it really should compel you to look at ways to automate.
Host: So, getting back to the first thing that we discussed, [that] data is the currency of the future. Where do you see data extraction in the future?
Douglas Laney: I think we’ve done a great job of building data extraction capabilities for structured data assets. And I think in the future, obviously, because of the volume and the potential value embodied in unstructured data, I think we’re going to start to see more and more organizations make unstructured data extraction, tagging, [and] classification a core part of their data management capabilities and portfolio of tools.
An Automated Data Extraction Solution for Modern Enterprises
Astera ReportMiner is an enterprise-grade data extraction solution that users can use to simplify and streamline document processing. Combining the power of automation, parallel processing, and smart data extraction, our code-free platform makes it easier for organizations can instantly transform a large volume of unstructured data into actionable insights.
With Astera ReportMiner, you don’t have to rely on manual data entry processes. You can automatically load files from a configured location and then write the extracted data to a preferred destination. Using our solution, you spend less time in data extraction and more time using it. Contact our team to get started with Astera ReportMiner today.
Authors:
- Ammar Ali