Pandas v.s. Polars - A Data Workflow Showdown

Aug 10, 20245 min read

If you've ever handled data operations in Python, you’ve likely used the pandas package for your data manipulation needs. Whether it's reading data, storing it in convenient DataFrames, or performing various data exploration tasks, pandas has been a key tool in all your data projects. However, have you heard of Polars? It can handle almost everything that pandas does, but it’s designed to be significantly faster and more optimised. This can save you a lot of time, especially when working with large datasets.

Is it genuinely that fast, and is it worth switching from pandas, which has become muscle memory for us and is almost synonymous with data manipulation in Python? I had these questions myself when I began delving into Polars and considering whether a switch might be beneficial. To convince myself (and perhaps many of you reading this), I spent the past couple of weeks exploring the Polars API and conducted a benchmarking exercise to compare the time taken by both Polars and pandas for data manipulation operations on a dataset. For the purpose of this exercise I used a sample dataset (not so large) from Maven Analytics Free Data Repository on Mexico Toy Sales. The dataset used comprises 829,262 rows and 14 attributes, showcasing sales and inventory data for a fictitious toy store chain based in Mexico. Using this data, I assessed the performance of both pandas and Polars for core data exploration tasks. I timed them over 10 runs (-r10), processing each operation in a loop 10 times (-n10), using Python's %%timeit function. Here’s how they performed on these 8 core data manipulation tasks:

Task	Pandas	Polars
Reading Data	461 ms ± 78.2 ms	208 ms ± 56.1 ms
Joining Data	684 ms ± 50.4 ms	290 ms ± 48.7 ms
Writing Data	5.97 s ± 142 ms	1.52 s ± 50.3
Selecting Data Columns	49.9 ms ± 1.43 ms	13.8 µs ± 5.2 µs
Filtering	80.1 ms ± 19.5 ms	41.9 ms ± 1.66 ms
Grouping and Aggregation	170 ms ± 33.3 ms	46.2 ms ± 1.06 ms
Changing Data Types	1.48 ms ± 124 µs	46.1 µs ± 12.7 µs
Sorting	317 ms ± 53.9	388 ms ± 57.1 ms

Our test case demonstrates that Polars has significantly outperformed pandas in nearly every task, often with impressive results. Here's my sample code file that runs various data exploration operations in both Pandas and Polars, illustrating their speed differences. You can replicate this work to test if Polars could save you time for your specific use case. Furthermore, I’ll explain why there are such stark differences in execution times between the two packages and how, ironically, laziness makes Polars so "blazingly fast".

The following reasons are often accredited to being the engine Polars and which makes them a fast DataFrame:

Written in Rust: Polars is built using Rust, a programming language known for its speed and efficiency. This means Polars can execute tasks quickly because it doesn't need to be interpreted line-by-line like Python code.
Parallelisation: Polars makes full use of your computer's CPU by splitting tasks across all available cores. It’s like having multiple workers handling different parts of a job simultaneously, making everything faster.
Vectorised Query Engine: Polars uses Apache Arrow, a columnar data format, to process data in chunks. This method speeds up data handling by efficiently processing multiple data points at once.
Out-of-Core Processing: Polars can work with large datasets without having to load everything into memory at once. This is useful for handling massive amounts of data that wouldn't fit in your computer's RAM.
Lazy Evaluation: Polars uses lazy evaluation, meaning it doesn’t execute each operation immediately. Instead, it waits until it has a complete picture of what needs to be done and then runs everything in one go with an optimised query plan, hence optimising the performance.

How does being Lazy Helps Polars ?

If your dad asks you to bring 10 items from the grocery store, and you keep running back and forth each time you hear a new item, you might be seen as hardworking, but it’s not the most efficient approach. It would be a much smarter (obvious in this case) choice to make a list of all the items first and then make a single trip to get everything at once. In simpler terms, this is exactly the difference between eager and lazy evaluation. With eager evaluation, you handle each task immediately, like making multiple trips to the store for each item. With lazy evaluation, you wait until you have a complete plan, then tackle everything in one go, which is much more efficient.

Pandas uses eager evaluation, meaning it executes each line of code as it encounters it. In contrast, Polars supports both eager and lazy evaluation. With lazy evaluation, Polars builds an optimised query plan and executes all the operations together in a single, efficient step.

Executing this four-step data exploration process—reading data, filtering, grouping and aggregation, and finally sorting—happens in the specified order as the interpreter encounters each step.

# a) Read the CSV file into a pandas DataFrame
df = pd.read_csv("output_file_pl.csv")

# b) Filter the DataFrame
filtered_df = df[df['Store_City'] == "Cuidad de Mexico"]

# c) Group by 'Product_Category' and count 'Sale_ID's
grouped_df = (filtered_df
               .groupby('Product_Category')
               .agg(Total_Transactions=('Sale_ID', 'count'))
               .reset_index())

# d) Sort by the count of transactions in descending order
sorted_df = grouped_df.sort_values(by='Total_Transactions', ascending=False)

However, with lazy evaluation using scan_csv() in the code below, execution is deferred until all tasks are collected. Once the collect() function is called, an optimal query plan is created to ensure efficient execution of the tasks. To learn more about the optimisations Polars applies to your query plan, read here.

# Using scan_csv (lazy API)

lazy_pl_df = (
    pl.scan_csv("output_file_pl.csv") # 1) Reading/Scanning(lazy)
    .filter(pl.col('Store_City') == "Cuidad de Mexico") # 2) Filtering
    .group_by('Product_Category') # 3) Grouping and Aggregation
    .agg(pl.col('Sale_ID').count().alias("Total Transactions"))
    .sort("Total Transactions", descending=True) # 4) Sorting
    .collect() # Collect All for Lazy Evaluation
)

In my test case, when I compared the performance of eager and lazy evaluation across both pandas and Polars, the lazy evaluation API of Polars significantly outperformed the eager evaluation approach, as anticipated. See the results in the table below:

Evaluation Methodology	Pandas	Polars
Eager Evaluation	1.99 s ± 52.5 ms	830 ms ± 108 ms per loop
Lazy Evaluation	-	236 ms ± 62.3 ms

Additionally, polars also provides you a show_graph() function to display the query plan used in the process.

# show query plan with show_graph()
lazy_pl_df.show_graph(optimized=True)

Here's a visual representation of how it executes our query in an optimised way (Notice how it applies Filters first for Predicate pushdown when optimised= True in the second image below)

Although my dataset wasn't large enough to fully showcase the time savings, for genuinely big data, the execution time difference between Polars and pandas could easily save you several minutes, if not hours. I hope I’ve demonstrated how Polars is outpacing pandas and provided you with enough information to start experimenting with it yourself.

Signing Off,

Yash

Would you try Polars?

Pandas v.s. Polars - A Data Workflow Showdown

Recent Posts

Comments