Skip to main content

Some big data terminology

This year I've helped a lot of clients with their big data projects, and it's likely that anyone that works in DevOps or even as a regular IT person will have to deal with big data in the coming years. Businesses rely more and more on data analytics to make decisions that wouldn't have been possible before. Whether it's an insurance company adjusting rates based on real time car data streaming in, a security company alerting their agents automatically when something suspicious is detected on one of their many surveillance systems, or even a small business trying to gain more insights from web traffic, big data is everywhere.

But before you can deal with big data, you need to know some of the common terms being used, what they refer to, and how they typically apply within the enterprise. This will allow you to successfully engage with the different stakeholders and make sure everyone is on the same page, so projects don't over-promise and under-deliver.

Types of data

It's one thing to say that virtually every business will need to utilize big data at some point, but where is that data even coming from? Typically, we refer to big data as anything that gets produced at a high rate, usually in an automated fashion, although that's not always the case. Data producers are resources or people who produce this data. This can be IOT devices, software programs, hardware devices, or even people who write reports and notes. The format this data can take are multiple, but they usually fall into one of two categories: structured and unstructured.

Structured data is anything that can easily be structured inside of a database. This can be CSV or JSON files that have strict rows and columns, or fields which can be defined and imported automatically. Unstructured data however doesn't have anything like that. A PowerPoint presentation or a Word document would be classified as unstructured. This doesn't mean unstructured data can't be useful. In fact most businesses rely heavily on Microsoft Office applications and they produce hundreds if not thousands of Office documents per year, so you may need to handle unstructured data as well.

Storing data

Two very common terms when talking about storage is a data lake and a data warehouse. A data lake is usually the first place your data ends up. This is typically a large arrays of hard drives where data gets dumped into folders, or it could be an objects storage system like Amazon S3. A typical scenario would have a series of S3 buckets with an API Gateway in front of them, and all of your data producers would be sending data to those buckets, and that ends up being your data lake.

The idea with a data lake is that you want all of your important data in a single location, to make it easier to process. But the data lake isn't where you do the processing, nor where you gain insights. A modern data platform will have a second level called a data warehouse. This is a structured solution such as a database or specialized data platform like Redshift or Snowflake. The data is taken from your data lake into the data warehouse for processing. Usually, you want your compute power to be separated from your data, so that you can provision as much computing power as needed for whatever task you need.

For example, let's say you have millions of GPS devices all around the world, dumping 10 TB of data every day into your data lake. Trying to constantly analyze this data would be a massive job. Instead, you want to categorize the data, pick the useful bits from the noise, and send easy to process content to your data warehouse. Once triaged and categorized, you will have a much easier time making sense of it.

Gaining insights

None of these processes would have any point if you couldn't gain useful business insights. There are many tools out there to help you process data and do analytics, but usually you have to write code in order to make sense of your data, because every business case is slightly different. This is typically done by data scientists in Python, R, Scala using Apache Sparks or Databricks.

Big data is a very exciting field, but it can quickly become overwhelming if you don't put in place systems to automate and efficiently process the data. As IT professionals, it's part of our role to do that, and we have more tools now than ever to do just that.