A term that has been gaining substantial amount of curiosity in the recent past and perhaps one that would keep growing in importance as the era of Internet and the information flow starts to become more widely available, is Big Data. Although the word has been ringing all around me and my place of work for quite sometime, what really triggered my interest are two books that I am currently alternating between – “ The long Tail by Chris Anderson , a book that describes how endless choice is creating unlimited demand, and Big Data by Viktor Mayer Schonberger and Kenneth Cukier, a book that sets forth to describe the concept that would revolutionize the way we live and think.
Wikipedia defines big data as
a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications
But perhaps a more fitting definition is one that is described in the book “Big Data” – a large set of data derived from a sample size N where
. The reason I find the latter more befitting is because, data sets do not always have to be big as long as it encompasses the entire world. For e.g. the book describes a study done on the corruption in Sumo Wrestling in Japan . The study collected data from almost 65,000 matches across 7 years in Japan to find a correlation. The data in this case was not as big as one would imagine it to be. But the fact that it “surveyed” the entire set of matches across those 7 years, rather than limiting itself to certain samples from those, made me lean towards calling it a “big data”.
Big data changes the fundamental aspect of life by giving it a quantitative dimension
says Viktor and Kenneth in their book. Humans have long tried to quantify several aspects of human behavior in order to gain insights to perform predictive analysis. Now one of the terms that I used in my previous paragraph is of interesting relevance – “survey”. Surveys perhaps were one such primitive form of gathering relevant data. One of the major challenges of a survey was the fact that your sample size is now N < all, which means that you now have the data associated with the population that actually took your survey. The results then become biased to the characteristics of that limited population, which does not neccessarily portray the entirety. As this problem started to evolve, statisticians found that the results were perhaps more accurate if the sample set of the population was chose at random, rather than just increasing the sample size. Studies have shown that extrapolating the survey done on a random sample set yield a more accurate results as compared to a large sample size across a specific set of the population. Now this still does not solve one of the challenges that I'd like to call as active polling vs passive polling. In almost all cases, a survey deals with the study of a specific set of questions answered by a specific group of people or simply put, a survey is an active polling. To truly understand a human behavior, this would prove to be inaccurate especially because when answering a question, humans tend to stop and think. THis can be quite analogous to studying the human nature when interacting with a group of people, by having a tutor or a professor in the group. The mere awareness of a study being conducted could potentially skew the behavior. Whereas, if the same group of people can be "passively" observed, the information gathered can be closer to being accurate. The same can be told about any methods of predictive analysis. Big Data analysis methodologies in my view prove to be far more passive in its ways of polling data and hence tend to lean more towards being accurate.
In the coming weeks, as I wander through the world of Big Data, I plan to post more examples and insights into this amazing field that has been gaining significant relevance in today's world. I plan to talk about one aspect in each of my posts so as to limit yet another challenge of big data, known as information overload! But that does not entirely solve the problem. My plan is also to engage more interaction among my reader to gain more information, as I meander through. Feel free to enthral me with your comments.