The Dendory Capital Datalab creates solutions, workflows, tools and pipelines for our clients' Big Data needs. As part of this process, we need to make sure the solutions we provide our clients work properly. Today, we're going to go over a simple use case of analyzing a dataset to gain useful insights about a particular problem. We're going to load a CSV file containing data from the NASA Near Earth Objects project, and try to find out whether or not any large object is going to come close to the planet in the next week.
Singing up for Databricks
Databricks is the commercial version of Apache Spark, and provides a handy web-based interface to create and manage clusters, start a notebook, and use Python code without any administration overhead. Better yet, they have a community edition we're going to be able to use for free.
So the first thing to do is go to databricks.com and signing up for a community account. Once you confirm your email address, you can log into the control plane by going to community.cloud.databricks.com and logging in with the account you just created. You should see the following screen:
Creating a cluster
The idea behind a modern data workflow, is that the data is separated from the compute. So the next step will be to create a cluster in order to run code on the data we're going to be accessing. So click on the Clusters link on the left side, and then create a cluster. You can give it a name and click on Create. With the community edition, you don't have access to a lot of the additional options, but for now keeping everything as default is fine.
Creating a notebook
A notebook is basically where you can run all of your code. It can run multiple languages like R, Scala, SQL, but by default it runs Python. So go back to the main page by clicking on the home icon at the top of the menu, then click on New Notebook, and give it a name. You should see a brand new notebook:
Your notebook has to be attached to a cluster for code to run on. So make sure your newly created cluster is selected in the dropdown menu, next to the green dot. To test it out, write the following in the first cell and press the play button to run it:
print("Hello World")
If all went well, you should see a new output with the string Hello World in it.
Loading data
For convenience, we've provided the dataset we need as a CSV file in an S3 bucket. So in order to load the data into Spark, you can use the following code. It will simply connect to S3, read the CSV file, and load it into a dataframe, which is the most common data structure used in data engineering:
df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("s3a://dendory-data/cneos_closeapproach_data.csv")
If all goes well, Databricks will load the data and assign the proper headers and data types for each column. If you look at the screenshot, you can see that the date column was assigned to a string, while the velocity columns were assigned to a double.
Displaying basic information
To start with, we can display some basic information such as the number of rows in our dataset with:
df.count()
We can also see a sample of the data, for example the first 5 rows, with:
display(df.limit(5))
As you can see, the dataset contains 26,559 entries, and each entries has information such as the date, name, size of the object, how close it's going to come, and so on.
You can refer to the Apache Spark documentation for all the transformation functions available, but you should be able to analyze your data in any way that fits your use case.