We live in the connected world where more than 3.7 billion humans are connected to internet. We use many digital gadgets daily. Communicate with our friends and the world by posting messages, likes, forwards on Facebook, twitter, through emails, mobile apps like Whatsapp and so on. Millennial are recording every event of their life using photos and videos of what food they are eating, which movie they are watching, on which airport they are waiting for the flight etc. etc. What we are doing is generating humongous amount of data with our interactions. Internet search engines are inseparable part of our daily life. Google processes 3.5 billion searches a day!
Why should we care about this? The challenges to handle such large amount of data seem to be very daunting. Does it even makes sense to attempt to embark on such an arduous task? Let us assume that we are interested in knowing how a particular disease is spreading across a country. This as it can cause epidemic. We are interested in this because we can take some action to prevent the outburst. As a part of the solution, one may try to communicate with all the hospitals and private medical practitioners to get some head start on the information. With this data, aid can be provided to crisis hit areas. However, this would be a huge effort and would also be time consuming.
Another way to look at this issue is with the help of search engines. Observe which search queries are related to the disease or medicines are being seen on the search engine. How many of them are as recent as one week or 15 days. Analysis of such data would be very useful in tracking how such disease is spreading. This will be very quick, almost real time. Hence it makes lot of sense to process such data.
In distance MBA in AI and ML, big data related processing and analytics would form on important part of the curriculum.
3 V’s of Big Data
Let us see what characteristics the data need to satisfy so that it can be called as big data. At start, typically three qualities volume, velocity and variety, popularly called as 3 V’s, were used to qualify anything as big data.
Very large volume is the first characteristics of big data. Here are some sample statistics about number of users which produce large volume of data everyday using comments, posts, photos, videos, likes etc on different platforms on social media. Here are examples of large volumes.
- Facebook: 2000 million users
- Google+: 111 million users
- Instagram: 1000 million users
- LinkedIn: 562 million users
- Pinterest: 200 million users
- Reddit: 542 million users
- Snapchat: 186 million daily users
- Twitter: 326 million users
- WhatsApp: 900 million users
- Youtube: 1500 million users
Data put on social media qualifies as big data. So is the case with the data collected from thousands of sensors in IoT based application.
Velocity refers to the speed with which data is generated from human interactions with social media, mobile apps, websites etc.
Let us look at some statistics about what happens in every minute on social media.
- Snapchat users share 527,760 photos
- More than 120 professionals join LinkedIn
- Users watch 4,146,600 YouTube videos
- 456,000 tweets are sent on Twitter
- Instagram users post 46,740 photos
It is very peculiar characteristics of big data. One should be able to handle this velocity to get insights and use it for the competitive advantage.
Variety refers to different types of data. We generate different types of data like text files, PDFs, excel sheets, emails, photos, databases, videos and data generated by sensors,. This includes structured data like database records and unstructured data like comments, likes etc. The unstructured data cannot retrieved using structured Query Language (SQL). Hence, a different type of databases known as NO-SQL databases are used to handle such data. MongoDB, CouchDB are example of such databases.
At the start, when big data as concept was introduced 3 V’s were focused. As more and more applications were developed using big data, additional V’s got added.
Veracity, Validity, Volatility
Dictionary meaning of word ‘Veracity’ is ‘conformity to facts’ or ‘accuracy’. In data processing the bigger challenge in cleanliness of the data. As data is gathered from multiple sources, chances of getting lot of noise or bad data are very high. If you use this so called dirty data without cleaning it, predictions and analysis would not be useful or may be sometimes grossly wrong. Accuracy or cleanliness of data refers to Veracity. Along with veracity, validity of data in the context is also equally important.
Volatility refers to how long the data would be valid in the business context. For example, comments of Facebook about a movie being released recently may not have longer impact. User sentiments may keep on changing every week. Processing the data quickly and taking corrective actions faster is required.
It makes sense to process this big data only if it has value. Extracting value becomes difficult due to velocity and volatility.
Big data technologies are designed to handle these challenges. Tools used in Hadoop eco-systems like Map Reduce, PIG, HIVE, HBase, Sqoop, Spark, Storm etc. are popular for big data processing. There are NO-SQL database like MongoDB are also used. Distance MBA in AI and ML would cover some of these technologies along with analytics. After completion of the distance MBA program, in a data science job role, you would use technologies and overcome these challenges.