Boost Your Data Analysis Efficiency

Published on June 1, 2024

🚀Save Time Looping Through DataFrames🚀

A few months ago, I shared a tip on how to optimize your code, and I was thrilled to receive such positive feedback from the community. If you missed it, check out my post on using Python’s Multiprocessing package. In that post, I demonstrated how to divide your code to run in parallel across different processes, significantly improving performance.

Today, I’m excited to share another time-saving tip that will help keep data analysts motivated and efficient in their work. Stay tuned for a practical technique that will streamline your data analysis process!

Nowadays, it’s becoming increasingly common to use AI agents to assist with daily work activities. Today, anyone can access high-quality information for free and leverage it to adapt their workflow using a Large Language Model (LLM) trained specifically for their tasks.

Creating an assistant for yourself depends on several variables, which I’ll be happy to explain in another post. However, today I want to focus on the time it takes to manipulate data, particularly the training data used to train an LLM. One of the most exhausting tasks is preparing datasets for training supervised machine learning components, which involves a lot of effort to get the training dataset ready for ML.

Another example I’ve noticed lately is the time spent applying complex calculations to dataframes with 60k instances. For anyone who has had to process large amounts of data, you’ve probably experienced waiting with folded arms for a Jupyter Notebook cell to finish executing — sometimes 10, 20, 40 seconds, or more. Initially, this might not seem like much, but over time, as you need to refactor code and consider new parameters, this waiting process can become discouraging. And trust me, it is!

So today, I will share a tip on how to iterate through a dataframe with a reasonable number of columns, making it easier to apply complex calculations to instances and reduce the time by 2–3 times, without using multiprocessing.

To prove this, let’s conduct a small experiment with financial market data (yet again). Set aside a dataset with data from variable income asset, the mini-dollar contract (WDO), and I will apply the Relative Strength Index (RSI) calculation to the closing price data on a 5-minute timeframe. Think of the timeframe as the interval between each instance in the time series. In other words, this will be applied to a dataframe that spans 1 month of trading for the asset, with intervals of 5 minutes.

The mathematical formula for the RSI is:

IFR = 100 - (100/(1+FR))

FR = MH / ML

Where:

IFR (Índice de Força Relativa) or RSI: This indicator measures the relative strength of price movements. It’s the Portuguese version of the RSI (Relative Strength Index).
FR (Força Relativa): This is the division between the average of closing prices of upward movements by the average of closing prices of downward movements.

I applied this calculation considering a dataset of 2,506 instances. In Figure 1, I show the time it took to generate a result, and after a mere change in the code, I obtained the result in Figure 2.

Figure 1: Time spent to compute data in traditional way.

Figure 2: Time spent to compute data with ‘slightly’ change.

Isn’t that a significant improvement? Less than half the time it used to take. I’ll provide more details now.

As I mentioned before, when you’re analyzing data in a dataframe, you’re probably iterating over it like this:

for index, row in enumerate(df.itertuples()):
 # rest of the code below #

And there’s nothing wrong with using it like this, but, from what I understood what happened, pandas takes quite some time to associate the value of the row with its respective column.

figure 3: Response from looping itertuples()

So, instead of calling the row values using ‘row.column’, change the way you iterate over columns, and treat it like a list, using numbers. To do this, start by replacing ‘df_test.itertuples()’ with ‘df_test.values’. Notice with figures 3 and 4 that the first line of both are the same, but one is in dataframe format, and the other each row is a list.

You might be wondering,

“Sure, it improves processing speed, but won’t using a list compromise code readability? Associating index numbers with dataframe columns might not seem like the most intuitive approach.”

Fear not, as there’s a solution. Instead of relying on index numbers, consider preparing a base, such as a dictionary, where you pair the column name (Key) with its index (value). This approach maintains the clarity of your code while enhancing performance.

“So how does this work?”

Here’s the breakdown: You still call the dataframe row data by the column name, but behind the scenes, you’re effectively organizing indices in a list and retrieving the appropriate value in the dataframe. Despite the added layer of complexity, this method proves to be faster than the standard way of iterating over a dataframe.

By implementing this approach, you can significantly enhance the processing speed of your dataframes without sacrificing code readability. It’s a win-win situation that streamlines your data analysis workflow and boosts productivity.

Stay tuned for more insights on optimizing your Python code for efficient data analysis!

Continue reading on website

Other news

🌸 Spring bingo - Wellness challenge - Halfway! 🌸

April 15, 2025

Hey Hivebriters! Quick check-in on our April Wellness Challenge - Spring Bingo! We're halfway through the month, and it's the perfect time to jump in if you haven't started yet (or keep going if you have)! Quick Reminders:Complete rows or columns for 5 raffle entries eachSquares with 📷 require photo submissions in the commentsSubmit completed rows/columns through the form by April 30thBonus entri