Dimensions of Data Sets

The following figure outlines important dimensions to consider when analyzing a new data set.

Important dimensions to classify data sets for analysis.

The 3Vs

A number of definitions propose the so called 3Vs to decide whether data is "Big" or not. Regardless of whether we speak of "Big Data" or not, the 3Vs serve as useful dimensions when deciding on the methods to process and analyze a given data set.

Volume

The first V stands for the volume of the data set. We measure the volume of data using the notion of bits, bytes, megabytes, gigabytes and so forth. The question is: How many 0 and 1 does it take to store the data on a storage medium such as a hard drive or the computer's memory (RAM).

Depending on the answer, we need to take different measures to analyze the data. The most important question here is, whether the data can be processed and analyzed on a single machine (laptop, server)? Or do we need a set of interconnected machines (a so-called cluster) to distribute the data and processing workload across multiple nodes?

There is no single answer to this question because it depends on the hardware of the machine as well as the types of analysis that we want to perform. Are we running queries that must take the whole data set into account? Or do we only need a subset of the data? Since hardware is improving as we speak, the boundaries are pushed further every day.

Velocity

The second V addresses the frequency at which new data records are produced at the source and how timely they need to be considered in the analysis. Consider the example of sensor data in a manufacturing environment. Here, potentially millions of records are produced every second. This increases the velocity and the requirements for our data capturing solution.

But the frequency of production is not the only relevant factor. It makes a huge difference whether we need to analyze new records in real-time, or whether we can merely collect the data and perform a batch analysis every 24 hours or so. The former requires a real-time streaming solution, the latter can be achieved through simple batch scripts (e.g. SQL, Python, R) running once per day.

Variety

To complete the trio, the third V takes the variety of the data into account. What types of data are we dealing with in our data set? On the highest level, we can distinguish between structured and unstructured data. Structured data is the data find we find in spreadsheets or relational databases, where data is organized in columns and rows. A column has a data type and a clearly defined range of values it can hold. Structured data can be directly queried with SQL using operations such as filtering, grouping and sorting.

On the other hand, unstructured data lacks a clear definition of a data type and a range of values. A good example is text data, such social media posts, news articles, or documents in general. For a column containing text data, we have no means to apply the typical operations such as filtering, grouping and sorting because every entry we find is going to be unique. This doesn't mean we can't analyze unstructured data, but we need to define a structure and impose it onto that data first.

In Data Analysis, there is no such thing as unstructured data. Only data that structure hasn't been applied to yet.

Some definitions additionally include semi-structured data such as list or objects. Those data structures do not lend themselves for direct analysis and querying, but they do come with a clearly defined structure that can be leveraged easily.

Quality

Coming soon.

Format

Coming soon.

Number of Tables

Coming soon.

Original Purpose

Coming soon.

Last updated

Was this helpful?