The code we’ll be working with on this piece is that this set of Python capabilities that use Pandas to learn in and course of knowledge. It features a perform to learn the uncooked knowledge in chunks, then just a few capabilities that carry out some transformations on the uncooked knowledge.
# data_processing.py
import pandas as pd
from pandas import DataFramedef read_raw_data(file_path: str, chunk_size: int = 1000) -> DataFrame:
csv_reader = pd.read_csv(file_path, chunksize=chunk_size)
processed_chunks = []
for chunk in csv_reader:
chunk = chunk.loc[chunk["Order ID"] != "Order ID"].dropna()
processed_chunks.append(chunk)
return pd.concat(processed_chunks, axis=0)
def split_purchase_address(df_to_process: DataFrame) -> DataFrame:
df_address_split = df_to_process["Purchase Address"].str.cut up(
",", n=3, develop=True
)
df_address_split.columns = ["Street Name", "City", "State and Postal Code"]
df_state_postal_split = (
df_address_split["State and Postal Code"]
.str.strip()
.str.cut up(" ", n=2, develop=True)
)
df_state_postal_split.columns = ["State Code", "Postal Code"]
return pd.concat([df_to_process, df_address_split, df_state_postal_split], axis=1)
def extract_product_pack_information(df_to_process: DataFrame) -> DataFrame:
df_to_process["Pack Information"] = (
df_to_process["Product"].str.extract(r".*((.*)).*").fillna("Not Pack")
)
return df_to_process
def one_hot_encode_product_column(df_to_process: DataFrame) -> DataFrame:
return pd.get_dummies(df_to_process, columns=["Product"])
def process_raw_data(file_path: str, chunk_size: int) -> DataFrame:
df = read_raw_data(file_path=file_path, chunk_size=chunk_size)
return (
df.pipe(split_purchase_address)
.pipe(extract_product_pack_information)
.pipe(one_hot_encode_product_column)
)
Subsequent, we will get began with implementing our first knowledge validation check. When you’re going to observe alongside in a pocket book or IDE, it’s best to import the next in a brand new file (or in one other cell in your pocket book):
import pandas as pd
import numpy as np
import pytest
from pandas import DataFrame
from data_processing import (
read_raw_data,
split_purchase_address,
extract_product_pack_information,
one_hot_encode_product_column,
)
from pandas.testing import assert_series_equal, assert_index_equal
You may learn extra on how you can really run pytest (naming conventions for recordsdata and the way exams are found here, however for our case, all you could do is create a brand new file referred to as test_data_processing.py
and in your IDE as you add to the file you simply can run pytest
and optionally with “- -verbose”.
Fast Introduction to pytest and Easy Information Validation Examine
Pytest is a testing framework in Python that makes it simple so that you can write exams to your knowledge pipelines. You may primarily make use of the assert assertion, which basically checks if a situation you place after assert
evaluates to True or False. If it evaluates to False, it’ll elevate an exception AssertionError
(and when used with pytest will trigger the check to fail).
So first, let’s check one thing easy. All we’re going to do is test if the output of one in every of our capabilities (the primary one to learn the uncooked knowledge) returns a DataFrame.
As a fast apart, you’ll discover within the authentic perform we write the arrow ->
syntax so as to add kind hints to the perform the place we are saying that the perform ought to return a DataFrame. Which means that should you write in your perform to return one thing aside from a DataFrame, your IDE will flag it as returning an invalid output (however this gained’t technically break your code or forestall it from working).
To truly test if the perform returns a DataFrame, we’ll implement a perform to check the read_raw_data
perform and simply name it test_read_raw_data
.
def test_read_raw_data():
"""Testing output of uncooked desk learn in is DataFrame"""
test_df = read_raw_data(file_path="Updated_sales.csv", chunk_size=1000)
assert isinstance(test_df, DataFrame) # checking if it is a DataFrame
On this perform, we add a one-line docstring to elucidate that our check perform is simply checking if the output is a DataFrame. Then, we assign the output of the prevailing read_raw_data
perform to a variable and use isinstance
to return True or False if the required object is of the sort you place in. On this case, we test if the test_df
is a DataFrame
.
We will equally do that for the remainder of our capabilities that simply take a DataFrame as enter and are anticipated to return a DataFrame as output. Implementing it might probably appear to be this:
def test_pipe_functions_output_df():
"""Testing output of uncooked desk learn in is DataFrame"""
test_df = read_raw_data(file_path="Updated_sales.csv", chunk_size=1000)
all_pipe_functions = [
split_purchase_address,
extract_product_pack_information,
one_hot_encode_product_column,
]
for perform in all_pipe_functions:
assert isinstance(perform(test_df), DataFrame)
Be aware that you may additionally use the assert
assertion in a for loop, so we simply undergo every of the capabilities, passing in a DataFrame as enter and checking to see if the output can also be a DataFrame.
Implementing fixtures in pytest for extra environment friendly testing
You may see above that we needed to write the very same line twice in our two completely different check capabilities:
test_df = read_raw_data(file_path="Updated_sales.csv", chunk_size=1000)
It is because for each check capabilities, we would have liked a DataFrame as enter for our check to test if the output of our knowledge processing capabilities resulted in a DataFrame. So you may keep away from copying the identical code in all of your check capabilities, you should use fixtures, which allow you to write some code that pytest will allow you to reuse in your completely different exams. Doing so seems to be like this:
@pytest.fixture
def test_df() -> DataFrame:
return read_raw_data(file_path="Updated_sales.csv", chunk_size=1000)def test_read_raw_data(test_df):
"""Testing output of uncooked desk learn in is DataFrame"""
assert isinstance(test_df, DataFrame) # checking if it is a DataFrame
def test_pipe_functions_output_df(test_df):
"""Testing output of uncooked desk learn in is DataFrame"""
all_pipe_functions = [
split_purchase_address,
extract_product_pack_information,
one_hot_encode_product_column,
]
for perform in all_pipe_functions:
assert isinstance(perform(test_df), DataFrame)
We outline the test_df
in a perform this time that returns the uncooked DataFrame. Then, in our check capabilities, we simply embody test_df
as a parameter and we will use it simply as we did earlier than.
Subsequent, let’s get into checking our split_purchase_address
perform, which basically outputs the identical DataFrame handed as enter however with further deal with columns. Our check perform will appear to be this:
def test_split_purchase_address(test_df):
"""Testing a number of columns in output and rows unchanged"""
split_purchase_address_df = split_purchase_address(test_df)
assert len(split_purchase_address_df.columns) > len(test_df.columns)
assert split_purchase_address_df.index.__len__() == test_df.index.__len__()
assert_index_equal(split_purchase_address_df.index, test_df.index) # utilizing the Pandas testing
Right here, we’ll test two essential issues:
- Does the output DataFrame have extra columns than the unique DataFrame?
- Does the output DataFrame have a unique index than the unique DataFrame?
First, we run the split_purchase_address
perform, passing the test_df
as enter and assigning the consequence to a brand new variable. This provides us the output of the unique perform that we will then check.
To truly do the check, we may test if a selected column exists within the output DataFrame, however a less complicated (not essentially higher) manner of doing it’s simply checking if the output DataFrame has extra columns than the unique with the assert
assertion. Equally, we will assert
if the size of the index for every of the DataFrames is identical.
You can even test the Pandas testing documentation for some built-in testing capabilities, however there are only some capabilities that basically simply test if two of a DataFrame, index, or Sequence are equal. We use the assert_index_equal
perform to do the identical factor that we do with the index.__len__()
.
As talked about earlier than, we will additionally test if a DataFrame accommodates a selected column. We’ll transfer on to the following perform extract_product_pack_information
which ought to all the time output the unique DataFrame with a further column referred to as “Pack Data”. Our check perform will appear to be this:
def test_extract_product_pack_information(test_df):
"""Check particular output column in new DataFrame"""
product_pack_df = extract_product_pack_information(test_df)
assert "Pack Data" in product_pack_df.columns
Right here, all we do is name columns
once more on the output of the unique perform, however this time test particularly if the “Pack Data” column is within the checklist of columns. If for some cause we edited our authentic extract_product_pack_information
perform to return further columns or renamed the output column, this check would fail. This might be a superb reminder to test if what no matter we used the ultimate knowledge for (like a machine studying mannequin) additionally took that into consideration.
We may then make do two issues:
- Make adjustments downstream in our code pipeline (like code that refers back to the “Pack Data” column);
- Edit our exams to replicate the adjustments in our processing perform.
One other factor we ought to be doing is checking to see if the DataFrame returned by our capabilities has columns of our desired knowledge sorts. For instance, if we’re doing calculations on numerical columns, we must always see if the columns are returned as an int
or float
, relying on what we’d like.
Let’s check knowledge sorts on our one_hot_encode_product_column
perform, the place we do a typical step in function engineering on one of many categorical columns within the authentic DataFrame. We anticipate all of the columns to be of the uint8
DataType (what the get_dummies
perform in Pandas returns by default), so we will check that like this.
def test_one_hot_encode_product_column(test_df):
"""Testing if column sorts are appropriate"""
encoded_df = one_hot_encode_product_column(test_df)
encoded_columns = [column for column in encoded_df.columns if "_" in column]
for encoded_column in encoded_columns:
assert encoded_df[encoded_column].dtype == np.dtype("uint8")
The output of the get_dummies
perform additionally returns columns which have an underscore (this, after all, could possibly be finished higher by checking the precise column names- like within the earlier check perform we test for particular columns).
Right here, all we’re doing is in a for loop of goal columns checking if all of them are of the np.dtype("uint8")
knowledge kind. I checked this beforehand by simply in a pocket book checking the information kind of one of many output columns like column.dtype
.
One other good observe along with testing the person capabilities you have got that make up your knowledge processing and transformation pipelines is testing the ultimate output of your pipeline.
To take action, we’ll simulate working our whole pipeline within the check, after which test the ensuing DataFrame.
def test_process_raw_data(test_df):
"""Testing the ultimate output DataFrame as a last sanity test"""
processed_df = (
test_df.pipe(split_purchase_address)
.pipe(extract_product_pack_information)
.pipe(one_hot_encode_product_column)
)# test if all authentic columns are nonetheless in DataFrame
for column in test_df.columns:
if column not in processed_df.columns:
elevate AssertionError(f"COLUMN -- {column} -- not in last DataFrame")
assert all(
component in checklist(test_df.columns) for component in checklist(processed_df.columns)
)
# test if last DataFrame would not have duplicates
assert assert_series_equal(
processed_df["Order ID"].drop_duplicates(), test_df["Order ID"]
)
Our last test_process_raw_data
will test for 2 last issues:
- Checking if the unique columns are nonetheless current within the last DataFrame — this isn’t all the time a requirement, however it may be that you really want all of the uncooked knowledge to nonetheless be out there (and never reworked) in your output. Doing so is simple- we simply must test if the column within the
test_df
remains to be current within theprocessed_df
. Lastly, we will this time elevate anAssertionError
(equally to simply utilizing anassert
assertion) if the column just isn’t current. It is a good instance of how one can output a selected message in your exams when wanted. - Checking if the ultimate DataFrame doesn’t have any duplicates — there are plenty of alternative ways you are able to do this- on this case, we’re simply utilizing the “Order ID” (which we anticipate to be like an index) and the
assert_series_equal
to see if the output DataFrame didn’t generate any duplicate rows.
Checking the pytest output
For a fast have a look at what working pytest seems to be like, in your IDE simply run:
pytest --verbose
Pytest will test the brand new check file with all of the check capabilities and run them! It is a easy implementation of getting a collection of information validation and testing checks in your knowledge processing pipeline. When you run the above, the output ought to look one thing like this:
You may see that our last check failed, particularly the a part of the check the place we test if the entire columns from the preliminary DataFrame are current within the last. Additionally that our customized error message within the AssertionError
we outlined earlier is populating accurately—that the “Product” column from our authentic DataFrame just isn’t exhibiting up within the last DataFrame (see if you’ll find why primarily based on our preliminary knowledge processing capabilities).
There’s much more room to enhance on this testing—we simply have a very easy implementation with primary testing and knowledge validation instances. For extra complicated pipelines, chances are you’ll wish to have much more testing each to your particular person knowledge processing capabilities, in addition to in your uncooked and last output DataFrames to make sure that the information you find yourself utilizing is knowledge you may belief.