I have just begun working with big data issues, supporting the EU
project Optique; http://www.optique-project.eu/ with events and setting up a network of
interested people, industries, authorities and organizations. If you would like
to participate, then send me an email.
Big data is a fascinating topic. The term “big data” leads ones
thoughts toward the vast volumes of data, continuously generated through
modern technology. It is not difficult to find impressive examples http://en.wikipedia.org/wiki/Big_data; The Hubble space-telescope gathered 120 Terrabyte
(Tb) of data between 1990 and 2007. The Large Hadron Collider (LHC) at CERN
captures around 25,000 Tb every year from its acceleration/collision runs. The
animated film “Despicable Me” used 142 Tb of data to show 95 minutes of
film. The online radio site Pandora has 250 Tb of music in its archives.
Walmart manages over 1 million customer transactions every hour and have accumulated
over 2,560 Tb about their customers shopping habits. YouTube has 530 Petabyte
(Pb) of video streams in their archives. The governance of these huge data
assets poses technical, managerial and financial challenges for storage,
processing (indexing, search and retrieval), broad-band-width transmission,
quality assurance and protection.
But, big data is much more than just its volume. It is also the velocity
of big data, or in other words, the time it takes to create/capture, update,
index, manage, analyze and use data. Some data might be captured millions of
times every second (like the LHC collision sensors) and other data might be
updated manually once a year. Data that are created, updated and used with
different timelines must be analyzed and synchronized in order to be used for a
comparative user query. If it takes to much lead-time to query, search,
retrieve and assemble a result then the big data resource will not be used as
anticipated. A big challenge is therefore the ability to “crunch” a lot of data
in a short time span, and to analyze data in real-time, automatically.
When one has firm grasp over the volume and velocity, then it is
time to understand the variety of big data. Integrating data from
different sources can result in severe inconsistencies when the same data can
be both unstructured or structured and based on different representations, formats
and media. So, we could for example find the data term “temperature” represented
in many different forms. It could be a number, a text string, a stream of bits,
a color, an image or icon, an animation, a sound, a frequency or an algorithm and
expressed in Celsius, Fahrenheit, Kelvin or some new scale suited for a special
purpose – Mexican food; scorching, very hot, hot, just fine and gringo. Some
data are structured; it has already a data model linked to it and other data
sources are unstructured and needs hands-on work to sort out its interpretations.
The challenge comprises of making sure that data can be mapped to something, it
could be an object, a property or a reference. It should also have a clear definition
and references. If data is unstructured, then we need to map towards existing
structures or rapidly create new objects, properties or references. Large
amounts of time could be saved if this could be done automatically.
These three V’s, volume, velocity and variety were Gartner’s
initial 2001 definition of big data characteristics. But there are more things
that we need to take into account. Some organizations regard value to be
an essential characteristic, or how to create value from big data; http://www.bigdatavalue.eu/. This is naturally very important, and it
will surely be the natural state-of-art for the future. Most organizations will
be dependent of big data sources to better understand and provide the right
products and services to their dependents, customers, employees, students,
patients and clients.
But, value is not a big data characteristic. It is more an output effect,
based on the demand for specific information and the big data environment capabilities
to supply/deliver data efficiently and effective to cover that demand.
IBM and others correctly assume that big data sources will not be
used if one cannot ensure availability and trust in the data. They have called
it veracity, or making sure that these huge data assets are validated, they
are truthful, reliable and in short trusted. Big data links here to a number of
already ongoing initiatives such as data/information quality (making sure that
data is available, authentic, actual, and accurate), Information Security where
sensitive content will be protected and secure from unauthorized
access/manipulation, regulatory openness and transparency (SOX), respected integrity
(PUL) and intellectual property rights (IPR & Copyrights). Veracity is not
a characteristic for big data. It is more a set of requirements to ensure its validity
and trust.
The final characteristic of big data is by some Complexity.
I do not agree with that, complexity is nothing more than a consequence derived
from the relationships between volumes, velocity and variety (see picture
below). So the five V’s of big data are; Volume, Velocity, Variety, Value and
Veracity.

No comments:
Post a Comment