A fast introduction to enums
You would possibly first be questioning “What’s an enum”?
An enum, quick for enumeration, is a “set of symbolic names (members) certain to distinctive values” (Python docs, 2023). Virtually talking, this implies that you may outline and use a set of associated variables beneath one most important “class”.
A easy instance could be having an enum class “Shade”, and having names like “Purple”, “Inexperienced”, and “Blue” that you can use everytime you needed to consult with particular colours.
Subsequent, you’re in all probability questioning what’s the purpose of defining some variables in a separate enum class when you can simply name the names you want immediately in your knowledge processing pipelines.
Enums have a number of key advantages:
- Defining enums lets you’ve gotten associated constants organized in a single (or many) courses that may act as a supply of reality for dimensions, measures, and different constants you should name in your pipelines;
- Utilizing enums will can help you keep away from passing invalid values in your knowledge pipelines, assuming you accurately outline and keep the enum class;
- Enums permit customers to work with a standardized set of knowledge factors and constants, which is useful when a number of individuals are aggregating or creating fashions based mostly on one most important supply of knowledge (to assist keep away from having a number of definitions or aliases for a similar column within the uncooked knowledge supply).
This sounds somewhat summary, so let’s check out how one can virtually apply enums when working with knowledge in an instance of a normal pre-processing pipeline.
Utilizing enums in your knowledge processing pipelines
We have already got our preliminary DataFrame, so let’s start by making a operate so as to add a number of extra columns to our knowledge by splitting the acquisition handle.
def split_purchase_address(df_to_process: DataFrame) -> DataFrame:
df_address_split = df_to_process["Purchase Address"].str.cut up(",", n=3, develop=True)
df_address_split.columns = ["Street Name", "City", "State and Postal Code"]df_state_postal_split = (
df_address_split["State and Postal Code"]
.str.strip()
.str.cut up(" ", n=2, develop=True)
)
df_state_postal_split.columns = ["State Code", "Postal Code"]
return pd.concat([df_to_process, df_address_split, df_state_postal_split], axis=1)
Subsequent, we will apply this to our present desk by utilizing the native Pandas pipe
technique like so, the place we name pipe on the DataFrame and move the operate title as an argument.
processed_df = df.pipe(split_purchase_address)
Subsequent, you’ll see that the info we’ve continues to be on a really granular degree, with the Order ID as the first key of the desk. After we need to combination the info for additional evaluation, we will use the groupby
technique in Pandas to take action.
Some code you would possibly see in Pandas to group knowledge on a set of columns after which do an combination rely on one of many dimensions (on this case we’ll use the Order ID) can seem like this:
# groupby usually
grouped_df = (
processed_df
.groupby(
["Product", "Quantity Ordered", "Street Name", "City", "State Code", "Postal Code"]
)
["Order ID"]
.rely()
.reset_index()
.sort_values("Order ID", ascending=False)
.rename({"Order ID": "Rely of Order IDs"}, axis=1)
)
The results of this can be a new DataFrame that appears like this:
On this easy instance, grouping by simply six columns shouldn’t be too troublesome and we will move a listing of those columns on to the groupby
technique. Nevertheless, this has a number of drawbacks:
- What if we have been working with a bigger dataset and needed to group by 20 columns?
- What if we had new necessities from finish customers are available and we wanted to tweak the precise columns to group by?
- What if the underlying desk adjustments and the names or aliases of the columns modified?
We will partially handle a few of these points by defining the columns in an enum class. Particularly for this case, we will outline these group by columns pertaining to our gross sales desk in a brand new class SalesGroupByColumns
as follows:
class SalesGroupByColumns(Enum):
PRODUCT = "Product"
QUANTITY_ORDERED = "Amount Ordered"
STREET_NAME = "Avenue Title"
CITY = "Metropolis"
STATE_CODE = "State Code"
POSTAL_CODE = "Postal Code"
What we’re doing right here is finally simply defining the columns as constants inside a brand new Enum class (which is taken from the import earlier than from enum import Enum
.
Now that we’ve these new enum values outlined, we will entry particular person members of the enum like this:
SalesGroupByColumns.PRODUCT
SalesGroupByColumns.PRODUCT.worth
Simply calling the enum title will return the enum member, and calling worth
on the goal enum lets us entry the string worth of the enum member immediately. Now, to get all members of the enum into a listing we will move to the groupby
, we will use a listing comprehension like this:
[column.value for column in SalesGroupByColumns]
Having gotten them into a listing, we will assign this output to a variable after which move this variable to our groupby
technique as an alternative of passing the uncooked checklist of strings immediately:
# groupby adjusted
groupby_columns = [column.value for column in SalesGroupByColumns]grouped_df = (
processed_df
.groupby(groupby_columns)
["Order ID"]
.rely()
.reset_index()
.sort_values("Order ID", ascending=False)
.rename({"Order ID": "Rely of Order IDs"}, axis=1)
)
grouped_df.head()
We arrive on the identical desk as earlier than, however with code, that’s somewhat cleaner to have a look at. The good thing about this for maintainability will be seen in the event you’re engaged on the processing pipeline over time.
For instance, you would possibly discover that you just need to add by a number of new columns, like for instance in the event you additionally needed to perform a little extra function engineering and create a home quantity and product class column to then add to the group by. You could possibly replace your enum class like this:
# what is the profit? including new columns!class SalesGroupByColumns(Enum):
PRODUCT = "Product"
QUANTITY_ORDERED = "Amount Ordered"
STREET_NAME = "Avenue Title"
CITY = "Metropolis"
STATE_CODE = "State Code"
POSTAL_CODE = "Postal Code"
HOUSE_NUMBER = "Home Quantity"
PRODUCT_CATEGORY = "Prouct Class"
# then you definitely run the code identical as earlier than and it could nonetheless work
Then, you wouldn’t want to vary your present processing code, as a result of the checklist comprehension would robotically seize all of the values within the SalesGroupByColumns
class and apply that to your aggregation logic.
A very good be aware right here is that every one of this may solely work if you realize precisely what you’re defining within the enum class and use them solely as meant. Should you make adjustments right here and also you’re grabbing all these columns to group by in a number of completely different tables, it’s necessary that that’s what you’re desiring to do.
In any other case, you can as an alternative outline the set of enums you should use for a particular desk both in separate courses or if it is smart in a separate checklist of columns (so you continue to keep away from passing a uncooked checklist of strings to the groupby
technique.
Utilizing enums in your knowledge aggregation in Pandas
For one more instance, say we had a distinct case the place we apply a number of further transformations to our knowledge by altering some columns’ knowledge sorts and creating a brand new complete value column. We will add to our earlier pipeline like so:
def convert_numerical_column_types(df_to_process: DataFrame) -> DataFrame:
df_to_process["Quantity Ordered"] = df_to_process["Quantity Ordered"].astype(int)
df_to_process["Price Each"] = df_to_process["Price Each"].astype(float)
df_to_process["Order ID"] = df_to_process["Order ID"].astype(int)return df_to_process
def calculate_total_order_cost(df_to_process: DataFrame) -> DataFrame:
df_to_process["Total Cost"] = df_to_process["Quantity Ordered"] * df_to_process["Price Each"]
return df_to_process
processed_df = (
df
.pipe(split_purchase_address)
.pipe(convert_numerical_column_types)
.pipe(calculate_total_order_cost)
)
Now that our DataFrame is remodeled on an Order ID degree, let’s subsequent carry out one other group by on a brand new set of columns, however this time combination on a number of completely different measures:
# as an example we've a file now "SalesColumns.py"
# we will add to itimport numpy as np
class AddressColumns(Enum):
STREET_NAME = "Avenue Title"
CITY = "Metropolis"
STATE_CODE = "State Code"
POSTAL_CODE = "Postal Code"
class SalesMeasureColumns(Enum):
TOTAL_COST = "Complete Value"
QUANTITY_ORDERED = "Amount Ordered"
# then individually we will do the groupby
groupby_columns = [column.value for column in AddressColumns]
grouped_df = (
processed_df
.groupby(groupby_columns)
.agg(
Total_Cost=(SalesMeasureColumns.TOTAL_COST.worth, np.sum),
Total_Quantity_Ordered=(SalesMeasureColumns.QUANTITY_ORDERED.worth, np.sum)
)
.reset_index()
.sort_values("Total_Cost", ascending=False)
)
There are a number of key factors to notice right here:
- We’ve outlined a brand new set of enum courses:
AddressColumns
andSalesMeasureColumns
. Now for a distinct desk the place we need to group particularly on handle fields, we will as an alternative outline thegroupby_columns
checklist to incorporate these columns to later move to thegroupby
technique on the remodeled DataFrame. - The
SalesMeasureColumns
class contains the measures that we need to be aggregating. Placing column names from the uncooked desk within the class implies that if different folks additionally need to sum the price and amount ordered they name the correct columns.
We may moreover add to our pipeline from earlier than that chained pipes and the features we outlined from earlier than and put this code to gather the checklist of columns and combination the desk in a brand new operate. Then the ultimate code turns into simpler to learn and probably debug and log over time.
For the aggregation, it’s additionally potential that total_cost and quantity_ordered are outlined otherwise for various tables, groups, and finish customers. Defining it within the enum for SalesMeasuresColumns
implies that for the Gross sales desk and measures all customers can do the aggregation on these columns with the identical definitions.