A solid data management strategy can pay dividends for any organization looking to unlock the potential of data. Still, the road to data-driven decision-making is fraught with challenges and puzzles. […]
Some call data the new oil. Others call it the new gold. Philosophers and economists may argue about the quality of the metaphor, but there is no doubt that organizing and analyzing data is vital for any business looking to make data-driven decisions.
And for that, a solid data management strategy is key. Data management encompasses data governance, data ops, data warehousing, data engineering, data analytics, data science and more and, when done right, can give organizations a competitive edge in any industry.
The good news is that many facets of data management are already well researched and based on solid principles that have evolved over decades. They may not be easy to use or understand, but thanks to the work of scientists and mathematicians, organizations now have a set of logistical frameworks for analyzing data and drawing conclusions. More importantly, we also have statistical models that delineate the boundaries of our analysis with meaningful error rates.
However, for all the benefits that have come from studying data science and the various disciplines that underlie it, we are still sometimes puzzled. Companies often come up against the limits of the specialist area. Some of the paradoxes relate to the practical challenges of collecting and organizing so much data. Others are philosophical in nature and test our ability to reason about abstract properties. And then there's the growing concern about privacy with so much data being collected.
Below are some of the many mysteries that make data management such a challenge for many organizations.
Unstructured data is difficult to analyze
Much of the data stored in a company's archives has no structure at all. A friend of mine wants to use an AI to search the text notes of his bank's call center agents. These datasets could contain insights that could help improve the Bank's lending and services. But the notes were made by hundreds of different people who had different ideas about what to write down about a particular call.
Also, employees have different writing styles and skills. Some wrote almost nothing at all. Others write down too much information about the respective calls. A text by itself doesn't have much structure, but when you have a stack of texts written by hundreds or thousands of collaborators over dozens of years, the existing structure is even weaker.
Even with structured data, it often remains unstructured
Good scientists and database administrators control databases by determining the type and structure of each field. Sometimes, in the name of even better structure, they limit the values in a particular field to integers in certain ranges or to predefined choices.
And even then, the people filling out the forms that are stored in the database find ways to add errors and irregularities. Sometimes fields are left blank. Others use a hyphen or the initials "na" when they think a question doesn't apply. Some people even write their names differently from year to year, from day to day, or even from line to line on the same form.
Good developers can catch some of them with validation. Good data scientists can also reduce some of these uncertainties through cleanup. But it's still annoying that even the most well-structured tables have questionable entries – and that those questionable entries can introduce unknowns and even errors into the analysis.
Data schemas are either too strict or too loose
No matter how hard the data teams try to formulate schema constraints, the resulting schemas for defining the values in the various data fields are either too strict or too lax. When the data team adds strict constraints, users complain that their answers are not included in the narrow list of acceptable values. If the schema is too accommodating, users can add strange values with little consistency. It's almost impossible to get the scheme just right.
Data protection laws are extremely strict
Privacy and data protection laws are strict and becoming stricter. Regulations like GDPR, HIPPA, and a dozen others can make collecting data very difficult, but leaving it lying around waiting for a hacker to break in is even more dangerous. In many cases, it's easier to spend more money on lawyers than on programmers or data scientists. This headache is why some companies just dump their data as soon as they can get rid of it.
The cost of data cleansing is enormous
Many data scientists will attest that 90% of the work is collecting the data, putting it into a consistent form and dealing with the endless holes or errors. The person who has the data will always say, "It's all in a CSV file and ready to go." But they don't mention the blanks or the misstatements. One can easily spend ten times as much time cleaning up data for use in a data science project than launching the routine in R or Python to actually perform the statistical analysis.
Users are increasingly suspicious of how their data is handled
End-users and customers are becoming increasingly suspicious of a company's data management practices, and some AI algorithms and their use only add to the fear, leaving many people very unsettled about what happens to the data that captures their every move. These fears drive legislation and often land companies and even well-meaning data scientists in public trouble.
And not only that: the data collection is intentionally loaded with wrong values or wrong answers. Sometimes half the job is dealing with malicious partners and customers.
The integration of external data can be an advantage – and a fiasco
It's one thing when a company takes ownership of the data it collects. The IT department and the data scientists are in control of this. But more and more aggressive companies are figuring out how to integrate their own information with third-party data and the vast seas of personalized information on the Internet.
Regulators are cracking down on data usage
In a recent example from Canada, the government investigated how some donut shops were pursuing customers who also bought from the competition. A recent press release stated:
"The investigation found that Tim Hortons' contract with a US-based third-party location service provider contained language so vague and revealing that it would have allowed the company to sell 'de-identified' location data for its own purposes." And what for? To sell more donuts? Regulatory authorities are becoming ever more vigilant when it comes to personal data.
Your data schema might not be worth it
We imagine that an ingenious algorithm could make everything more efficient and profitable. And sometimes such an algorithm is actually possible, but the price can also be too high. As a result, consumers – and businesses alike – are increasingly questioning the value of targeted marketing based on sophisticated data management systems. Some point out that we often see ads for something we've already bought because the ad trackers haven't figured out that we're no longer interested.
The same fate often awaits other clever systems. Sometimes rigorous data analysis will identify the worst-performing factory, but that doesn't matter because the company has a 30-year lease on the building. Businesses need to be prepared that their ingenious data science insights may lead to an answer that is unacceptable.
At the end of the day, data-related decisions are often a matter of opinion
Numbers can be very precise, but it often depends on how people interpret them. After all the data analysis and AI magic, most algorithms require a decision to be made as to whether a value is above or below a threshold. Sometimes scientists want a p-value less than 0.05. Maybe a police officer wants to issue tickets for cars that are 20% faster than the speed limit.
The cost of storing data is constantly increasing
Yes, hard drives are getting thicker and the price per terabyte keeps falling, but our programmers are collecting the bits faster than prices can fall. Internet of Things (IoT) devices are uploading more and more data, and users expect to rummage through a rich collection of those bytes forever. Meanwhile, compliance officers and regulators are demanding more and more data for future audits.
It would be one thing if someone actually looked at some of the bits, but we have limited time. The percentage of data that is actually accessed again continues to decrease. However, the price for storing the ever-increasing amounts of data continues to rise.