Understanding The Benefits Of Data Processing
To most people, raw data is ugly. It has bad characters, carriage returns, bad data types, NULLs, misplaced commas. The steps from transforming raw data into actionable and intelligible information are rarely written about because they are boring. Since cloud computing is now the standard and not the exception, the need for efficient and easily understood data processing steps is growing.
It’s not uncommon for data to move across 3 or 4 different networks to distribute data to all the proper storage locations. Data lakes are much more common now. A data lake is a way to store data without doing any processing and keeping the raw data available for future analysis.
Here is a good visual.
Running a data environment can be very hectic if there are tasks/jobs constantly failing or not working properly. Ensuring that all jobs and tasks are running can be accomplished by using any number of data orchestration platforms. A data orchestration platform runs these tasks and jobs to perform the ETL process.
The platform that Petram utilizes is Apache Airflow. Airflow is a python-based tool used to run jobs on a schedule and has built in integrations with numerous cloud and data services. These integrations allow for simple connections to be made between source systems of data and their destination, while along the way transforming data into a more usable state.
Airflow has the concept of something called a DAG. A DAG (Directed Acyclic Graph) is the core concept of Airflow, collecting Tasks together, organized with dependencies and relationships, to say how they should run.
For Petram, this is typically collecting data from a source system, performing transformation steps, and finally building a visualization. Below is a simple graphic to display a typical DAG.
Data processing can be summarized into thousands of different steps and processes, but the most common way to define them is into some variation of these 3 steps.
Extract
Transform
Load
Extract
Data extraction is simply taking data from one source system and landing it into another. The method for how this is done varies greatly by source system and by company. This way this step is executed is highly variable based on the source system of the data. For data available through API, this step typically entails writing a python script that rips data down from a source API. For systems that don’t have an API, a .txt or .csv file are placed onto a file storage location such as S3, Google Cloud Storage, or an SFTP server and a python script or native Airflow operator can be used to grab this data and begin the next step of the ETL process, transformation.
Transform
The transformation steps of an ETL process are the ones that vary the most. It is becoming more common to skip or move this step of the ETL process. As mentioned before, the rise of cloud computing has allowed for raw data to be stored and then transformed “on the fly” when it needs to be accessed in a report or some other visualization.
When these transformation steps are done, it involves data cleaning and data type verification. Data cleaning is the process of evaluating individual data points to identify bad data points, such as missing values, clearly wrong values, or generally bad input data. Once data is cleaned, it's usually ready to be loaded into an environment.
There are additional transformation steps that can be taken if data is to be utilized in a machine learning environment or a reporting layer. When designing the transformation step, it may be beneficial to create logic that creates data to be consumed in a standardized report or ML environment, these steps entail further data cleaning and formatting for a reporting environment. In the case of ML, these steps can include taking data columns and transforming them into 1’s and 0’s and assigning numeric values to strings.
Load
Once data is clean, the load step begins. Loading data in a cloud first world is a bit simpler than it was when most companies had on prem data storage. Understanding the overall architecture of the cloud data system is crucial so that data is loaded in the most efficient location. The final location for data is usually a database. It can be a traditional database type such as Microsoft SQL Server, MySQL, PostgreSQL.
It also can be loaded into a NoSQL/nontraditional database such as Google BigQuery, Amazon Redshift or Couchbase. Each of these systems has advantages and disadvantages, but it's nearly impossible to beat the speed, stability and ease of writing data to a nontraditional database. This is mostly because much of the work in setting up/ connecting to these databases is already done and open sourced. The downside to these cloud databases is that they can be costly, specifically to extract data from if data is leaving that cloud provider's network.
Once data is loaded, the final application or user can view and utilize the data.
How Petram does this
We utilize Airflow to orchestrate tasks for our various clients. We follow the ETL process with some additions for extra security and backups. For example, we almost always store raw data files such as .json, .csv, .txt files into a cloud storage bucket as well as a database to ensure data persistence. Additionally, since Airflow works best if tasks are automatic, almost all data that is moved around is done so through cloud storage. We try to avoid passing data back and forth in memory, and in general we try to take data from a source and get it into our virtual private cloud as the first step, after which we begin the rest of the ETL process.
Using Airflow and these other tools , tips, and techniques , Petram is capable of building data pipelines to process hundreds of millions of events and petabytes of data per day!