In many use cases, occasional mistakes are fine. It doesn’t matter if one out of ten movie recommendations were not what the browser was looking for, so long as most of the time the individual gets to watch something that they enjoy. If not, many people would simply stop watching that channel and move on to another.
By Gary Allemann, MD of Master Data Management
However, the outcome is vastly different when it comes to high stakes decision making, where errors can literally kill. Poor data practices, for example, has led to hundreds of artificial intelligence (AI) tools built to help diagnose Covid-19 missing the mark; Google Flu Trends missing the flu peak by 140%, and IBM’s cancer treatment AI suffering reduced accuracy.
To solve these problems at scale requires the application of good, old fashioned data cleansing. Google’s research shows that AI’s effectiveness is no longer limited by the models (algorithm) but by the quality of the data.
Data cascades
Poor quality data fed into AI or Machine Learning (ML) models frequently lead to multiple negative, downstream events that Google calls data cascades. These data cascades are driven by conventional AI/ML practices that undervalue data quality but are now rendering high-stakes AI useless. “Data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact.”
High-stakes AI refers to the increasing use of AI and ML in making life and death decisions – in areas such as public health, conservation, or justice. More data compounds the problem. AI models are better built with smaller, high-quality data sets than with vast data sets of dubious or poor-quality data.
Data scientists and other advanced analytics specialists need support to ensure that they are supplied with high-quality datasets that accurately reflect the real-world in order to develop and train models that can safely make high-stakes decisions.
Embrace DataOps
Ironically, these issues are preventable.
DataOps supports cross-functional data analytics teams, agile methodologies, and modern data management tools that enhance collaboration and data curation, capture and share a sound understanding of available data, and speed time to insight by developing a culture of data excellence. People, supported by tools, remain the key to the delivery of high-quality data for AI.
DataOps helps to ensure that data integrity is managed through the entire data lifecycle – from data creation through to maintaining live data after deployment of a model. It is this capability that ensures that models can be safely moved into production, particularly for high-stakes applications.