Ngrams and Google


When talking about Big Data, one initiative is worth a mention – The Google Ngrams . Google, in its on magnanimous way, started an program to digitize every single printed document, within the copyright limits back, in 2004. Started as a partnership with some of the well-known libraries around the globe such as the New York Public Library, the Harvard University Library and
 Bodleian Library at University of Oxford , the plan was to make high resolution digital images of all printed documents – books magazines et al – and save them in a huge repository that is searchable through com.

As the collection grew, Google realized the potential to actually digitize them one word at a time. Through a tool known as reCAPTCHA they then started to extract every word from every single image that was scanned. What was born out of it was an amazingly large data set from words dating back to 1500. By 2012, they had almost 15% of all the printed books digitized and that amounted to almost 700 billion words! What came out of this was Google Ngrams !

An “ngram” is a sequence of letters of any length, which could be a word, a misspelling, a phrase or gibberish

Google Ngrams is a searchable word repository, which graphs the occurrence of a word or a phrase in a “corpus of books” (as Google themselves puts it). It then plots those occurrences across time and the result is a visualization of how frequent the words were used over time.

As curious as I was, I decided to try out a few of the “jargons” of today to see how far back it was used. The results were alarming!



The word “technology” (keep in mind the search is case sensitive) was used as long back as early 1500s, which is ok considering it is quite a defined term in the English dictionary. But what was even more puzzling is that the word “Internet” was used in the 1590s! Now what can that be referred to! Also, although the whole slew of ARPANET and packet switching started to evolve in the 1960s it wasn’t until 1990s when the word “Internet” started to be used widely in printed form!


Not only SQL but also ….


Keeping with the theme of Big Data, as we spoke a couple of days back , the concept of N=all suddenly started to give rise to a whole slew of new challenges – that which is an obvious consequence of dealing with such large chunks of data. Storage and retrieval! The ability to quickly retrieve, analyze and correlate data to derive information becomes essential when it comes to dealing with big data. And for such massive amounts of data, relational databases do not seem to jive all that well. One of the major reasons for this is the fact that relational (although I may now safely call it, the traditional) databases require a structure to the data that it can store. Now when you are trying to correlate between the users’ location data Vs the local deals (as an example) and add on the users’ personal credit card usage, the data does not always fall into a structured pattern for it to be stored in a relational database. Along came NoSQL . The name was borrowed from the 1998 open source RDMS developed by Carlo Strozzi, and was later popularized by Eric Evans of Rackspace.

Unlike SQL or any of the other traditional databases, noSQL can be viewed more as a collective term for a variety of new data storage backends, with the concept of transactions taken out of it. With its eternally loose definitions, a noSQL can possibly aggregate data from rows that span across multiple tables in a traditional relational database. Now this obviously results in enormous chunks of data posing storage challenges. However with the costs associated with storage decreasing rapidly, this can be ignored when compared to the potential that you now have. Couchbase , one of those companies that have caught on quickly to this new revolution in data storage and retrieval with its document-oriented database technology, outlines an interesting article on why noSQL .

They are not the only ones that have grown into this new idea. Hadoop , is yet another one of those, that has quickly become a new household name. Developed and sustained by a group of unpaid volunteers, Hadoop is a framework to process large data sets, perhaps know as big data. Rumored to have been spun off as a free implementation of Google MapReduce , several big names have built services and solutions around this framework, some of the notable ones being Amazon Web Services (AWS), VMWare Hadoop Virtual Extensions (HVE), IBM BigInsights.

Yet another database that has been gaining popularity off late is MongoDB – a project spun off by 10Gen . Like Couchbase, this is also a document-oriented database and has started to pick up several implementations including SAP, MTV and Sourceforge.

With an “unstructured” database comes the challenges of querying it. Mongo uses a skewed version of JSON (known as BSON or Binary JSON) for representing queries whereas Couchbase has adopted a SQL-like query language that is slowly becoming a standard world wide, known as unQL (Unstructured Query Language).

While all these are still in the nascent stages of development, as the big data wave is rapidly approaching it peak, let me leave you with a slide deck from the QCon London 2013 presented by Matt Asay, VP of Corporate Strategy at 10gen on the “Past, Present and Future of noSQL.

The new library of Alexandria – the power of Big Data


A term that has been gaining substantial amount of curiosity in the recent past and perhaps one that would keep growing in importance as the era of Internet and the information flow starts to become more widely available, is Big Data. Although the word has been ringing all around me and my place of work for quite sometime, what really triggered my interest are two books that I am currently alternating between – “ The long Tail by Chris Anderson , a book that describes how endless choice is creating unlimited demand, and Big Data by Viktor Mayer Schonberger and Kenneth Cukier, a book that sets forth to describe the concept that would revolutionize the way we live and think.

Wikipedia defines big data as

a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

But perhaps a more fitting definition is one that is described in the book “Big Data” – a large set of data derived from a sample size N where


. The reason I find the latter more befitting is because, data sets do not always have to be big as long as it encompasses the entire world. For e.g. the book describes a study done on the corruption in Sumo Wrestling in Japan . The study collected data from almost 65,000 matches across 7 years in Japan to find a correlation. The data in this case was not as big as one would imagine it to be. But the fact that it “surveyed” the entire set of matches across those 7 years, rather than limiting itself to certain samples from those, made me lean towards calling it a “big data”.

Big data changes the fundamental aspect of life by giving it a quantitative dimension

says Viktor and Kenneth in their book. Humans have long tried to quantify several aspects of human behavior in order to gain insights to perform predictive analysis. Now one of the terms that I used in my previous paragraph is of interesting relevance – “survey”. Surveys perhaps were one such primitive form of gathering relevant data. One of the major challenges of a survey was the fact that your sample size is now N < all, which means that you now have the data associated with the population that actually took your survey. The results then become biased to the characteristics of that limited population, which does not neccessarily portray the entirety. As this problem started to evolve, statisticians found that the results were perhaps more accurate if the sample set of the population was chose at random, rather than just increasing the sample size. Studies have shown that extrapolating the survey done on a random sample set yield a more accurate results as compared to a large sample size across a specific set of the population. Now this still does not solve one of the challenges that I'd like to call as active polling vs passive polling. In almost all cases, a survey deals with the study of a specific set of questions answered by a specific group of people or simply put, a survey is an active polling. To truly understand a human behavior, this would prove to be inaccurate especially because when answering a question, humans tend to stop and think. THis can be quite analogous to studying the human nature when interacting with a group of people, by having a tutor or a professor in the group. The mere awareness of a study being conducted could potentially skew the behavior. Whereas, if the same group of people can be "passively" observed, the information gathered can be closer to being accurate. The same can be told about any methods of predictive analysis. Big Data analysis methodologies in my view prove to be far more passive in its ways of polling data and hence tend to lean more towards being accurate.

In the coming weeks, as I wander through the world of Big Data, I plan to post more examples and insights into this amazing field that has been gaining significant relevance in today's world. I plan to talk about one aspect in each of my posts so as to limit yet another challenge of big data, known as information overload! But that does not entirely solve the problem. My plan is also to engage more interaction among my reader to gain more information, as I meander through. Feel free to enthral me with your comments.