Many companies are certainly seeing the advantages of capturing, analysing and using new sources of data in their business.
By Barry de Waal, chief executive of strategy and sales at 9TH BIT Consulting
But the concept of ‘big data’ is often discussed without properly addressing some of the biggest stumbling blocks to generating value from one’s data.
The concept of data drift should be demanding greater attention among CIOs and their teams, as new data-gathering tools (like sensor logs, user interactions and web clickstreams) are continually evolving and advancing.
And as we bring in entirely new data sources – such as data from IoT devices and other smart objects – the realm of big data only becomes more complicated.
So, just what is data drift?
At a high-level it’s essentially the mutation of data’s core characteristics, caused by the operation, modernisation or maintenance of the systems that produce that data. Though we’re referring to digital concepts, parallels can be drawn with the organic world – as the structure and behaviour of data changes, it starts to look a little like genetic mutations occurring within animal and plant species.
As data morphs in form, it changes the way that it interacts with downstream applications, causing a ripple effect throughout the organisation’s analytics stack. Managing big data flows can become an extremely difficult endeavour.
The repercussions of poor data fidelity
At a more technical level, data drift can be defined in three categories:
* Schematic drift: a change to the structure of the data at the source – such as addition, deletion, type changes and order changes to data fields.
* Semantic drift: while the data structure may change, the meaning of the data may change (such as moving from imperial to metric measurement, or from IPv4 addresses to IPv6).
* Infrastructure drift: a change to the data processing capabilities and software – causing incompatibilities throughout the analytics stack.
The result is ultimately that the business starts losing confidence in its data, or begins to make misguided decisions due to unreliable data. It can cause huge amounts of extra work and wasted energies, as teams scramble to resolve data integrity issues.
In the worst-possible cases, data drift could even incur problems with financial and regulatory reporting, which can have disastrous consequences both in terms of financial loss and company reputation.
Thinking differently
A recent whitepaper from global leader in data performance management, Streamsets, a 9TH BIT Consulting partner, guides organisations on how to avoid data drift from occurring.
Starting from the first step of ingesting data, it’s important for your systems to have specified ingest pipelines based on the intent of the data, rather than rigidly fixating on the schema of the data itself. This sets the tone for increased flexibility and resilience in your data operations.
Once the data is ‘in-stream’, organisations should specify the relevant processing steps to cater for drift, helping to sanitise your data so that it is ready for consumption by the next actors in the value chain.
At the same time, one should be continually monitoring the data flowing through the pipelines, to understand the ever-changing nature of the data and to set new data drift conditions that then route traffic to exception processing systems.
As data is monitored and cleansed, the problem of semantic drift is solved by tools that effectively analyse both the data and metadata for the data stream, with any anomalies being raised for either human inspection or automated exception processing.
When it comes to big data, we’re heading into a new era, transitioning away from static, traditional data that’s stored in very stable enterprise applications, towards a chaotic world where data is ever-changing.
It’s only by thinking differently about how you ingest, channel, monitor and analyse that data that an organisation can truly unlock the myriad benefits of big data – and derive the insights that power greater business success.