# Understanding the Transform Function in Pandas

Posted by Chris Moffitt in articles

## Introduction

One of the compelling features of pandas is that it has a rich library of methods for manipulating
data. However, there are times when it is not clear what the various functions
do and how to use them. If you are approaching a problem from an Excel mindset,
it can be difficult to translate the planned solution into the unfamiliar pandas command.
One of those “unknown” functions is the ```
transform
```

method.
Even after using pandas for a while, I have never had the chance to use this function
so I recently took some time to figure out what it is and how it could be helpful
for real world analysis. This article will walk through an example where
```
transform
```

can be used to efficiently summarize data.

## What is transform?

I have found the best coverage of this topic in Jake VanderPlas’ excellent Python Data Science Handbook. I plan to write a review on this book in the future but the short and sweet is that it is a great resource that I highly recommend.

As described in the book, ```
transform
```

is an operation used in conjunction
with ```
groupby
```

(which is one of the most useful operations in pandas). I
suspect most pandas users likely have used ```
aggregate
```

, ```
filter
```

or
```
apply
```

with ```
groupby
```

to summarize data. However, ```
transform
```

is a little more difficult to understand - especially coming from an Excel world.
Since Jake made all of his book available via jupyter notebooks it is a good place
to start to understand how transform is unique:

While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean.

With that basic definition, I will go through another example that can explain how this is useful in other instances outside of centering data.

## Problem Set

For this example, we will analyze some fictitious sales data. In order to keep the dataset small, here is a sample of 12 sales transactions for our company:

account | name | order | sku | quantity | unit price | ext price | |
---|---|---|---|---|---|---|---|

0 | 383080 | Will LLC | 10001 | B1-20000 | 7 | 33.69 | 235.83 |

1 | 383080 | Will LLC | 10001 | S1-27722 | 11 | 21.12 | 232.32 |

2 | 383080 | Will LLC | 10001 | B1-86481 | 3 | 35.99 | 107.97 |

3 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 48 | 55.82 | 2679.36 |

4 | 412290 | Jerde-Hilpert | 10005 | S1-82801 | 21 | 13.62 | 286.02 |

5 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 9 | 92.55 | 832.95 |

6 | 412290 | Jerde-Hilpert | 10005 | S1-47412 | 44 | 78.91 | 3472.04 |

7 | 412290 | Jerde-Hilpert | 10005 | S1-27722 | 36 | 25.42 | 915.12 |

8 | 218895 | Kulas Inc | 10006 | S1-27722 | 32 | 95.66 | 3061.12 |

9 | 218895 | Kulas Inc | 10006 | B1-33087 | 23 | 22.55 | 518.65 |

10 | 218895 | Kulas Inc | 10006 | B1-33364 | 3 | 72.30 | 216.90 |

11 | 218895 | Kulas Inc | 10006 | B1-20000 | -1 | 72.18 | -72.18 |

You can see in the data that the file contains 3 different orders (10001, 10005 and 10006) and that each order consists has multiple products (aka skus).

The question we would like to answer is: “What percentage of the order total does each sku represent?”

For example, if we look at order 10001 with a total of $576.12, the break down would be:

- B1-20000 = $235.83 or 40.9%
- S1-27722 = $232.32 or 40.3%
- B1-86481 = $107.97 or 18.7%

The tricky part in this calculation is that we need to get a total for each order and combine it back with the transaction level detail in order to get the percentages. In Excel, you could try to use some version of a subtotal to try to calculate the values.

## First Approach - Merging

If you are familiar with pandas, your first inclination is going to be trying to group the data into a new dataframe and combine it in a multi-step process. Here’s what that approach would look like.

Import all the modules we need and read in our data:

```
import pandas as pd
df = pd.read_excel("sales_transactions.xlsx")
```

Now that the data is in a dataframe, determining the total by order is simple with the
help of the standard ```
groupby
```

aggregation.

```
df.groupby('order')["ext price"].sum()
```

order 10001 576.12 10005 8185.49 10006 3724.49 Name: ext price, dtype: float64

Here is a simple image showing what is happening with the standard ```
groupby
```

The tricky part is figuring out how to combine this data back with the original dataframe. The first instinct is to create a new dataframe with the totals by order and merge it back with the original. We could do something like this:

```
order_total = df.groupby('order')["ext price"].sum().rename("Order_Total").reset_index()
df_1 = df.merge(order_total)
df_1["Percent_of_Order"] = df_1["ext price"] / df_1["Order_Total"]
```

account | name | order | sku | quantity | unit price | ext price | order total | Order_Total | Percent_of_Order | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 383080 | Will LLC | 10001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 | 576.12 | 0.409342 |

1 | 383080 | Will LLC | 10001 | S1-27722 | 11 | 21.12 | 232.32 | 576.12 | 576.12 | 0.403249 |

2 | 383080 | Will LLC | 10001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 | 576.12 | 0.187409 |

3 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 | 8185.49 | 0.327330 |

4 | 412290 | Jerde-Hilpert | 10005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 | 8185.49 | 0.034942 |

5 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 | 8185.49 | 0.101759 |

6 | 412290 | Jerde-Hilpert | 10005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 | 8185.49 | 0.424170 |

7 | 412290 | Jerde-Hilpert | 10005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 | 8185.49 | 0.111798 |

8 | 218895 | Kulas Inc | 10006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 | 3724.49 | 0.821890 |

9 | 218895 | Kulas Inc | 10006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 | 3724.49 | 0.139254 |

10 | 218895 | Kulas Inc | 10006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 | 3724.49 | 0.058236 |

11 | 218895 | Kulas Inc | 10006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 | 3724.49 | -0.019380 |

This certainly works but there are several steps needed to get the data combined in the manner we need.

## Second Approach - Using Transform

Using the original data, let’s try using ```
transform
```

and ```
groupby
```

and
see what we get:

```
df.groupby('order')["ext price"].transform('sum')
```

0 576.12 1 576.12 2 576.12 3 8185.49 4 8185.49 5 8185.49 6 8185.49 7 8185.49 8 3724.49 9 3724.49 10 3724.49 11 3724.49 dtype: float64

You will notice how this returns a different size data set from our normal ```
groupby
```

functions. Instead of only showing the totals for 3 orders, we retain the same number
of items as the original data set. That is the unique feature of using ```
transform
```

.

The final step is pretty simple:

```
df["Order_Total"] = df.groupby('order')["ext price"].transform('sum')
df["Percent_of_Order"] = df["ext price"] / df["Order_Total"]
```

account | name | order | sku | quantity | unit price | ext price | order total | Order_Total | Percent_of_Order | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 383080 | Will LLC | 10001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 | 576.12 | 0.409342 |

1 | 383080 | Will LLC | 10001 | S1-27722 | 11 | 21.12 | 232.32 | 576.12 | 576.12 | 0.403249 |

2 | 383080 | Will LLC | 10001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 | 576.12 | 0.187409 |

3 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 | 8185.49 | 0.327330 |

4 | 412290 | Jerde-Hilpert | 10005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 | 8185.49 | 0.034942 |

5 | 412290 | Jerde-Hilpert | 10005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 | 8185.49 | 0.101759 |

6 | 412290 | Jerde-Hilpert | 10005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 | 8185.49 | 0.424170 |

7 | 412290 | Jerde-Hilpert | 10005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 | 8185.49 | 0.111798 |

8 | 218895 | Kulas Inc | 10006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 | 3724.49 | 0.821890 |

9 | 218895 | Kulas Inc | 10006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 | 3724.49 | 0.139254 |

10 | 218895 | Kulas Inc | 10006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 | 3724.49 | 0.058236 |

11 | 218895 | Kulas Inc | 10006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 | 3724.49 | -0.019380 |

As an added bonus, you could combine into one statement if you did not want to show the individual order totals:

```
df["Percent_of_Order"] = df["ext price"] / df.groupby('order')["ext price"].transform('sum')
```

Here is a diagram to show what is happening:

After taking the time to understand ```
transform
```

, I think you will agree that this
tool can be very powerful - even if it is a unique approach as compared to the
standard Excel mindset.

## Conclusion

I am continually amazed at the power of pandas to make complex numerical manipulations
very efficient. Despite working with pandas for a while, I never took the time to figure
out how to use ```
transform.
```

Now that I understand how it works, I am sure
I will be able to use it in future analysis and hope that you will find this useful
as well.

## Comments