Refactoring for Scalable Python Code With Pandas | by Charlie Shelbourne | May, 2022

Python design pattern for writing scalable data-wrangling pipelines

Photo by Siora Photography on Unsplash

Some of the beauties of Python are its flexibility and simplicity. However, these abilities are a double-edged sword. If you do not put the work in early on, to design reusable, manageable, and testable code, you will run into progress issues as your codebase scales.

When using Python’s Pandas module, it’s easy to move away from an object-oriented style of coding. A common pitfall is to write quick code, that becomes hard to test and messy to scale.

This post shows a design pattern for reusable and low maintenance code when data-wrangling with Pandas.

  1. Build a metrics pipeline with Pandas

2. Refactor pipeline to be easily extendable and testable

🐍 Codebase found here.

We’ll be using a free data set from Kaggle containing: β€œA complete list of unicorn companies in the world.”

Context

β€œA unicorn company, or unicorn startup, is a private company with a valuation over $1 billion. As of March 2022, there are 1,000 unicorns around the world. Popular former unicorns include Airbnb, Facebook and Google. Variants include a decacorn, valued at over $10 billion, and a hectocorn, valued at over $100 billion.”

πŸ“Š Kaggle Unicorn Companies dataset found here.

We will build 3 tables of Unicorn statistics using Pandas.

  1. Country-level metrics
  2. Country-level time series metrics
  3. Investor metrics

πŸ† Winning Combination of Pandas Methods

We will be using a combination of Pandas methods that breeze data manipulation and keep our pipeline looking clean.

🌍 Country Level Stats

Firstly, we will calculate the count of Unicorns per country and the average Unicorn valuation (in billions) per country.

output

πŸ“ˆ Example Plot using country_stats Output Table

The plot below is one example of a plot made with our country_stats table. We can quickly see the US leading the world in the total number of Unicorn companies.

⏳ Country Level Time Series

For these metrics, we group by country and date_joined columns, to count the number of Unicorns over time and sum the valuations.

πŸ’‘ Note I previously sorted the dataframe by date_joined.

output

βž• Cumulative Time Series

So far we have only generated time-series metrics at the point in time. However, it’s easier on the eye, to view the cumulative sum over time.

These steps take the generated time_series table and use an expanding window to calculate a cumulative sum.

output

πŸ“ˆ Example Plot Using time_series Output Table

The time series plot below is made with our cumulative results, for the number of Unicorns per country. We can see since 2020–2021 the US has reached a trajectory for producing Unicorns that could not be matched by China. Whilst, the India and UK may be just beginning their growth stages.

πŸ§‘β€πŸ’Ό Investor Stats

Generating investor metrics is more complex. Each company’s investors are stored as a coma-separated string.

For example, β€œAvant” have select_investors “RRE Ventures, Tiger Global, August Capital”.

We want to reuse the same code format, as with the country-level metrics, to make use of the pandas.DataFrame.groupby method. This will help us refactor later on.

example output

πŸ— Un-pivoting Investors

Un-pivoting is key to this design pattern, as we want to make use of the groupby method on individual investors.

Using pandas.DataFrame.explode, we generate an additional column for individual investors. Note that we now have multiple rows per company in our table.

πŸ’‘ Here I have used explode to un-pivot. Another method to check out is melt.

output

The next step, is to generate simple investor stats, company count and valuation of the companies, in each investor’s portfolio.

πŸ“ˆ Example Plot using investors_stats

The histogram below shows the distribution of investors by the total number of Unicorns in their portfolio. We see power-law type distribution where most investors have only 1 unicorn, whilst few have invested in many. This type of distribution can also be found in populational monetary wealth and social networks.

So far we have a metrics pipeline that looks fairly neat, but we are only generating a total of 8 metrics. If we were to extend this to 20–30 metics, our script would start to see a lot of repetition.

The format of our code up to now is a simple python script. Thus our code cannot be isolated and unit-tested.

Our only testing option is to run the entire script and assess the output, in an end-to-end style test. This is not great as it could take a long time to run.

πŸšͺOpen-close Principle

β€œOpen for extension but closed for modification.”

Refactor our code to follow the open-close principle as best possible.

  1. Move metrics functions to a Metrics class and enlist the use pandas.DataFrame.apply method.
  2. Remove the repeat calls to pandas.DataFrame.groupbywith a generate_metrics function, using Python’s built in getattr function.
  3. Create a metrics config file, which is passed to our generate_metrics function, with the meta-data required to generate our metrics.

πŸ’‘ Check out SOLID design principles here.

πŸ§‘πŸŽ“ Metrics Class

By shifting our metrics to a class, we can isolate the metrics, and build unit tests for each metric. By using the Pandas.DataFrame.apply method, we can add personalized metrics, and leverage other python packages that are not included in Pandas.

Following the open-close principle, if we wanted to add metrics we would create a new class that inherits our class Metrics.

πŸ“‚ Config File

The config file has a list of metrics for each table we want to generate. If we want to add or remove metrics or change naming etc, we simply change the config file. This way we are not editing our codebase itself.

βš™οΈ Generate Metrics Functions

This function takes in our Unicorn data, an instance of our Metrics class, and our metrics config, and returns a metrics dataframe.

Steps:

  1. Uses getattr to create a pandas.DataFrame.groupby.apply object
  2. Uses getattr to create a Metrics class method object (eg Metrics.count)
  3. Calls the pandas.DataFrame.groupby.apply object passing the Metrics method object

πŸͺ„ Refactored Pipeline

Finally, we arrive at our refactored pipeline. We can easily add metrics to existing tables by defining new metrics classes and adding them to our config file.

  • Standardise code in our pipeline to use pandas.DataFrame.groupby.applyto run transformations on our data.
  • Un-pivot data to give unique rows (for use with pandas.DataFrame.groupby).
  • Host the metrics we wish to generate in a class and pass them to pandas.DataFrame.groupby.apply.
  • Use Python’s built-in function getattrand a metadata dictionary, to loop through our metrics, rather than repeat calls to pandas.DataFrame.groupby.apply.

Benefits

  • Code is modular and more manageable for scaling up our metrics.
  • A config file give us more flexiblility to add and takeout metrics without touching code.
  • Easier to test functionality of our code as it has been isolated.

Tradeoffs

  • More code to maintain for a small pipeline.
  • Readability of our code has been reduced.
  • Harder to debug for a small pipeline.

🐍 Codebase found here.

πŸ“Š Kaggle Unicorn Companies dataset found here.

Third-party libraries:

Want to Connect?I'm just starting out my blogging journey. Let's Connect on Twitter!

Leave a Comment