“We continue to see a ‘more is better’ attitude inside many organizations, collecting data for data’s sake, without fully considering the risks and do we really need it”, as stated by Charles S. Golvin (Sr. Director Analyst at Gartner).
The main risks are not only achieving wrong conclusions and missing the targeted goal(s), but also to increase costs, by keeping harvesting, storing and uselessly processing “digital” garbage.
On daily basis, companies spend a considerable amount of money in collecting customer information, processing and mining data with the solely goal of optimizing their business strategies. Although data driven business decisions have dramatically increased in the past few years, identifying the right data is still one of the most critical challenges. An interesting, although slightly counterintuitive suggestion was proposed by Prof. Erich Bradlow: we need to focus on “better data, not big data”.
“The more data, the better.” That’s according to Kenneth Cukier, data analyst for The Economist. Cukier claims that big data and Machine Learning are our hope for the future, since we will be able to solve problems and take educated decisions to tackle the biggest challenges of our lives: „You have more information. You can do things that you couldn’t do before. “
His statements “The more data computers collect, the smarter and smarter they get” is indeed supported by technological progress in data storing, fast data streaming and more efficient digital architectures. In his famous TED Talk, Cukier shares his vision of a perfect digital world where only those things which make sense are connected and related with one another, and more importantly, even if they do not make sense right now, they will certainly do make sense in the next future. A rather similar optimistic vision, although a bit terrifying to some extent, has been presented in several occasions by Google CEO Sundar Pichai: not only “the more, the better”, but also the smarter computers will get. To achieve this goal, we need to go “Quantum” on our computers. The aforementioned “the more, the better” paradigm relies on the implicit assumption that ML models will improve alongside hardware infrastructure, being able to spot hidden but real trends and correlations amongst all the data, automatically discarding the false ones. Harvesting all the data to support our lives for good.
Unfortunately, as explained by Dr. V.Flovik, “Correlation does not imply Causation”, meaning that the natural human quest for patterns and explanations may lead to spot apparent correlations, also known as “Spurious Correlations”, which may occur when some events are associated but not causally related, due to the presence of third, unseen lurking variables. To ensure that also lurking variables are included in our data lake, “the more, the better” strategy therefore applies.
As written by Dr. M. Mizrahi, trying to predict the future is like “playing God”, and one should “think carefully about all the ways in which the use of new technologies could go seriously wrong”. Disregarding any moral and ethical evaluation, the main unresolved question is when the decisions, based on data analysis, start influencing the community members, actively modifying the future. Such an indirect influence may induce customers to take decisions and risks they would have never taken if not influenced. While in the past the lag-time between converting decisions into actions was much longer, nowadays any action is just one click away.
A solution to improve the quality of the data lakes is indeed sharing information, models and analytics with costumers, building trust and transparency with customers. Since lurking variables are unknown from both parties, only sharing information both ways may shed some light. At the end of the day, “sharing is caring”. Especially if it is both ways.