Wednesday, July 2, 2014

Five V’s of big data

I have just begun working with big data issues, supporting the EU project Optique; http://www.optique-project.eu/ with events and setting up a network of interested people, industries, authorities and organizations. If you would like to participate, then send me an email.

Big data is a fascinating topic. The term “big data” leads ones thoughts toward the vast volumes of data, continuously generated through modern technology. It is not difficult to find impressive examples http://en.wikipedia.org/wiki/Big_data; The Hubble space-telescope gathered 120 Terrabyte (Tb) of data between 1990 and 2007. The Large Hadron Collider (LHC) at CERN captures around 25,000 Tb every year from its acceleration/collision runs. The animated film “Despicable Me” used 142 Tb of data to show 95 minutes of film. The online radio site Pandora has 250 Tb of music in its archives. Walmart manages over 1 million customer transactions every hour and have accumulated over 2,560 Tb about their customers shopping habits. YouTube has 530 Petabyte (Pb) of video streams in their archives. The governance of these huge data assets poses technical, managerial and financial challenges for storage, processing (indexing, search and retrieval), broad-band-width transmission, quality assurance and protection.

But, big data is much more than just its volume. It is also the velocity of big data, or in other words, the time it takes to create/capture, update, index, manage, analyze and use data. Some data might be captured millions of times every second (like the LHC collision sensors) and other data might be updated manually once a year. Data that are created, updated and used with different timelines must be analyzed and synchronized in order to be used for a comparative user query. If it takes to much lead-time to query, search, retrieve and assemble a result then the big data resource will not be used as anticipated. A big challenge is therefore the ability to “crunch” a lot of data in a short time span, and to analyze data in real-time, automatically.

When one has firm grasp over the volume and velocity, then it is time to understand the variety of big data. Integrating data from different sources can result in severe inconsistencies when the same data can be both unstructured or structured and based on different representations, formats and media. So, we could for example find the data term “temperature” represented in many different forms. It could be a number, a text string, a stream of bits, a color, an image or icon, an animation, a sound, a frequency or an algorithm and expressed in Celsius, Fahrenheit, Kelvin or some new scale suited for a special purpose – Mexican food; scorching, very hot, hot, just fine and gringo. Some data are structured; it has already a data model linked to it and other data sources are unstructured and needs hands-on work to sort out its interpretations. The challenge comprises of making sure that data can be mapped to something, it could be an object, a property or a reference. It should also have a clear definition and references. If data is unstructured, then we need to map towards existing structures or rapidly create new objects, properties or references. Large amounts of time could be saved if this could be done automatically.

These three V’s, volume, velocity and variety were Gartner’s initial 2001 definition of big data characteristics. But there are more things that we need to take into account. Some organizations regard value to be an essential characteristic, or how to create value from big data; http://www.bigdatavalue.eu/. This is naturally very important, and it will surely be the natural state-of-art for the future. Most organizations will be dependent of big data sources to better understand and provide the right products and services to their dependents, customers, employees, students, patients and clients.
But, value is not a big data characteristic. It is more an output effect, based on the demand for specific information and the big data environment capabilities to supply/deliver data efficiently and effective to cover that demand.

IBM and others correctly assume that big data sources will not be used if one cannot ensure availability and trust in the data. They have called it veracity, or making sure that these huge data assets are validated, they are truthful, reliable and in short trusted. Big data links here to a number of already ongoing initiatives such as data/information quality (making sure that data is available, authentic, actual, and accurate), Information Security where sensitive content will be protected and secure from unauthorized access/manipulation, regulatory openness and transparency (SOX), respected integrity (PUL) and intellectual property rights (IPR & Copyrights). Veracity is not a characteristic for big data. It is more a set of requirements to ensure its validity and trust.

The final characteristic of big data is by some Complexity. I do not agree with that, complexity is nothing more than a consequence derived from the relationships between volumes, velocity and variety (see picture below). So the five V’s of big data are; Volume, Velocity, Variety, Value and Veracity.


No comments: