26/06/2018
"Big" was not always the adjective that would go with "Data" to describe its complexity and size. Towards the end of last decade, along with the popularity of Hadoop ecosystem, the term Big Data became mainstream. People started using it to classify data which was complex enough to require a cluster (like a Hadoop ecosystem) and the traditional architectures were no more enough to process and store it. Soon the term Big Data became one of the most popular buzzwords in IT. The recent spike in the use of Artificial Intelligence and Machine Learning has given Big Data even more spotlight as they depend on the data available.
Interestingly the exact definition of Big Data remains a bit elusive. The consensus is that the 3Vs of data (the Volume, Velocity, and Variety) define Big Data. Whenever storage and processing requirements of data become demanding due to these Vs, we can safely classify it as "Big Data." Looking at the timeline, the rise in interest in Big Data coincides with the phenomenal drop in prices of the hardware. With the ever decreasing prices of commodity hardware combined with the popularity of Hadoop ecosystem companies, people and governments have started to collect even such type of data which they never even thought to be stored before. In current IoT era, the mesh of sensors around makes it possible to track and record every move. Data is being collected at a scale no one could even imagine before. For example, governments are collecting as much data as possible in the hope that it will enable them to govern better. Many IT companies are collection details of log generated by their hardware infrastructure and software systems, in the hope, it will eventually help them building more reliable systems. And then, the social media are trying to record any information they can get from their users. No double we are on a data collection spree of a scale never seen before. Surprisingly, even in spite of all this data collection spree, a major hindrance in applying Machine Learning or Artificial Intelligence to solve many important problems is the insufficient availability of the relevant data. Even if the data is available, the quality of the data is not good enough to formulate conclusive evidence of the phenomenon of interest.
Currently, the data collection process is seldom tied to the eventual purpose why it is being collected. Usually, there is a disconnect between the data collection process and the data consumption system about the requirements, which is not always unavoidable. We need smarter data collection process that can focus on the data relevant to the data consumption system. Trying to collect as much data as possible does not always mean that sufficient evidence of what is actually important is it is being collected. In many cases, these are the rare, occurring patterns that carry the important information. Many of the current data collection processes are unable to adjust the granularity of their pre-defined batch processes when those comparatively rare occurring data-points are generated. As a simple example, my heart-rate sensor attached to my smartwatch may continue to record my heart rate after 10 minutes even when heart-rate would be unusually high during the middle of the night. While still within the acceptable range, the data collection process should have focused on it and have created many more data points focusing on that anomaly. IT companies using server logs for better reliability constantly face the battle when to enable detailed server logging. Once enabled, it generates so much data, analyzing which becomes a challenge of its own. The current log generation processes mostly lack the ability to automatically adjust the logging level according to the state of the system. Through this smart data collection process Smart Data, not essentially Big Data needs to be collected and stored to get meaningful insights.
Data consumption requirements may not always be known as the data collection phase. But whenever possible, with the advent of powerful real-time data processing engines like Apache Spark, new design patterns are emerging where live data aggregation and real-time dashboards are used to find the right focus and granularity of the data collection process. The underlying process or phenomenon that is being recorded needs to be monitored through real-time processing engines, and as soon as the potentially the important trends appear, the focus of the data collection process needs to be increased for example by triggering more sensors or increasing the granularity of the batch process.
The truth is that in the era of IoT devices we are collecting and storing large amounts of data that we would never use. We are clogging our storage and data processing pipelines with useless information; still, when the time comes, the information about the trends we are really interested in is missing. Wherever possible, an intelligent data collection process which self-tunes the data collection according to the data consumption requirements will result in Smart Data, which would be more useful for answering the business questions using descriptive analytics or machine learning algorithms. A smart data collection process would generate Smart Data, which may or may not be classified as Big Data, but would contain enough evidence of the patterns relevant to the process of interest resulting in much better results. As the data engineers and data scientists, our emphasis should be to try to use existing technologies to make data collection systems smarter wherever possible.
Author: Imran Ahmad
Related Training:
Big Data