Trying to deliver on AI without a data strategy that addresses data quality challenges is like putting the cart before the horse, writes Paul Morgan, head: data, analytics and AI at Altron Karabina.
ChatGPT exploded onto the global landscape at the end of 2022, and now Artificial Intelligence (AI) is the hot topic on everyone’s lips. Discussions abound on how well AI can write a blog post, code a website or drive a vehicle, and whether it should.
According to Statista, the global AI market is valued at $142.3 billion in 2023, while McKinsey found that the adoption of AI-driven solutions had doubled since 2017. All impressive figures – but what has become apparent to adopters of AI is that without a reliable mechanism to collate, clean and pre-process the data to power the AI engines, it is extremely unlikely that the expected benefits of AI will actually arrive.
Judson Althoff, executive vice-president and chief commercial officer at Microsoft, puts it this way: “As leaders look to embrace AI, it becomes more critical than ever to prioritize having a data-driven business, fortified with digital and cloud capabilities. This approach will help organizations leverage generative AI as an accelerant to transformation.”
AI algorithms are trained on large sets of unstructured and structured data in the hope of providing insights for decision-making processes – but the trainers need to be sure that this source material is accurate, unbiased and appropriate. ChatGPT is a good example of this – the OpenAI Large Language Model was trained on an extraordinarily large body of text; 570Gb of documents from Reddit, Wikipedia, CommonCrawl, GitHub and other sources, and can consolidate query responses into summaries of information that are highly pertinent to the user.
However, to ensure that the source texts were suitable for business consumption, human labellers were used to clean and curate the source data sets, red-flagging documents that contained misogynist, abusive, racist or other unacceptable content, as well as disinformation.
In contrast, Microsoft didn’t follow this approach in 2016 when they released their Twitter-trained chatbot, Tay, and then had to strongarm the bot off the playing field after 16 hours of embarrassing and offensive tweets.
Clean data is necessary in both AI and traditional analytics and it is generally accepted that there are at least six areas for data quality that jointly answer the “clean” label- accuracy, completeness, consistency, validity, integrity and uniqueness. While data quality tools have been available for decades to assist in the data quality process, there is still a considerable amount of work involved, both human and automated, to guarantee high levels of data quality. It goes without saying that there are now also AI tools available that can help improve the quality of your data.
Clean data is also essential for traditional data science use cases. It’s very difficult to identify customer segments when you don’t know if the 50 Paul Morgans in your mailing list are the same person, five different people or 40 different people – and then forecasting potential revenue improvements from one customer will be very different from 40. Equally meaningless is forecasting cash flow if contracted supplier payment terms are left blank on your financial system or ERP supplier records.
Data quality and stewardship have for the longest time been important success factors for delivering accurate historical analysis but become even more important for allowing accurate prediction into the future and identifying possible insights. It is very easy to go off on a completely incorrect tangent if an AI-generated insight or prediction is based on incorrect data.
Model bias is another area which indicates care must be taken over the data used for algorithms. Many real-life stories have occurred where AI output discriminates against people based on race, gender, religion and other demographics. This isn’t an easy task to overcome if you have inadequate data for certain groupings available for training AI models, but if you can’t remove the bias in the training data, then you need to ensure you have tested the outputs for discernible bias at the end of the process.
In the rush to use AI for real business value, let’s not forget that both humans and AI models need good samples of historical or training data sets to learn from. At Altron Karabina we have been solving customer data challenges for over 2 decades. Feel free to contact our data team if you want to discuss your data quality and data cleanliness challenges that your AI data needs – or any other data issues.