The Sun Says: Identify Bottleneck Functions with Python cProfile!

9 min

language: ja bn en es hi pt ru zh-cn zh-tw

We!
Ooh!
Gather 'round!
Meow!

Hello, I'm Munou.
Lately, I've been really enjoying listening to Masahiro Sakurai's channel (formerly of HAL Laboratory) in the background like a radio while tinkering with my PC.
In fact, memories of playing with my 3DS came back, and the Nintendo Directs hosted by President Iwata that I watched back then were always fun. Looking back, figures like Fujitsu's Toshio Ikeda, when you watch documentaries about them, you can feel their passion. It's an unbelievable world where people traveled to the US so many times back then. At that time, there was also the Vietnam War, and the shadow of the post-World War II era must have still been present, so where did all that energy come from?

Introduction

I hadn't paid much attention to the execution speed of scripting languages, but running cuckooget, which I'm developing, made me curious.
Some parts allow Rust code written in Rust to be executed within the interpreter using PyO3/maturin.
More simply, functions written in Rust are turned into Python libraries, and their execution happens via compiled binary files of the Rust code. The simple reason is that for CPU-intensive processes, executing them with these binary files seems faster. It did turn out faster, but there's a big mistake here.
First, I performed the measurements without even benchmarking the 'Only Python' code.
Even in 'Notes on C Programming' written by Rob Pike, often called the father of Go,

Rule 1: You can't know where a program will spend its time. Bottlenecks occur in surprising places. So, don't guess or speed-hack until you've clarified where the bottleneck is.

If that's the case, then measurements should be taken. If this is a principle in low-level C language, it's impossible that it wouldn't apply to higher-level layers.

I sometimes forget these things, so I strongly feel the need for something like UNIX philosophy, which involves regular checks, to discipline myself.

That's what I wrote, but being human, I know it, yet I tend to draw outliers...

Measurement with cProfile

Initially, I was using a primitive method: import time, then start_time = time.time() before function execution and end_time = time.time() after, followed by print(f"exec time {start_time - end_time}"). But I thought, 'Why use a computer for something so inefficient!' and looked for something that could display execution time for each function.
Apparently, there's something called cProfile as a standard library. Analyzing code performance with Python cProfile

In my case, I checked it as follows:

python3 -m cProfile -o profile.pstats ./main.py https://soulminingrig.com/ site

This stores the execution time of functions run by ./main.py in a binary file named profile.pstats.

Checking Function Execution Results

The previous site used a Python executable, but I'll check it more easily by entering pstats command-line mode.

python3 -m pstats profile.pstats

This enters pstats command-line mode, and I'll check the output results as follows:

profile.pstats% stats 5
Thu Nov  7 20:24:10 2024    profile.pstats

         48488696 function calls (47625490 primitive calls) in 51.151 seconds

   Random listing order was used
   List reduced from 3017 to 5 due to restriction <5>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1586    0.001    0.000    0.001    0.000 /home/haturatu/.local/lib/python3.12/site-packages/aiohttp/client_reqrep.py:909(__del__)
       80    0.000    0.000    0.001    0.000 /usr/lib/python3.12/copy.py:247(_reconstruct)
        4    0.000    0.000    0.005    0.001 /usr/lib/python3.12/encodings/__init__.py:71(search_function)
        3    0.000    0.000   98.745   32.915 /usr/lib/python3.12/asyncio/base_events.py:651(run_until_complete)
     1103    0.001    0.000    0.003    0.000 /usr/lib/python3.12/re/_parser.py:312(_class_escape)

You can check other commands with help.

profile.pstats% help

Documented commands (type help <topic>):
========================================
EOF  add  callees  callers  help  quit  read  reverse  sort  stats  strip

In my case, I outputted with stats 100, then opened it casually with vim from another terminal, copied and pasted, and then quit.
I'll then examine this text file, pasted and saved in vim, using awk and sort respectively.

In the case of tottime

$ awk '{ print $2","$6 }' bench | sort | column -t -s "," | tail 
0.009    /usr/lib/python3.12/re/_compiler.py:243(_optimize_charset)
0.010    /usr/lib/python3.12/concurrent/futures/_base.py:428(result)
0.012    /usr/lib/python3.12/asyncio/base_events.py:627(run_forever)
0.013    /usr/lib/python3.12/re/__init__.py:280(_compile)
0.018    /usr/lib/python3.12/re/_compiler.py:37(_compile)
0.022    /usr/lib/python3.12/concurrent/futures/_base.py:497(set_running_or_notify_cancel)
0.030    /home/haturatu/git/devgit/async_web_mirror.py:186(download_and_save_image)
0.031    /usr/lib/python3.12/re/_parser.py:512(_parse)
0.061    /usr/lib/python3.12/concurrent/futures/_base.py:537(set_result)
tottime  filename:lineno(function)

Somehow, it feels like tottime alone provides enough information for a function-by-function benchmark.

Is the Beautiful Soup 4 bottleneck just my imagination?

What was slow from other environments seemed to be search_tag in element.py, called from bs4, which was dragging things down.
What caught my attention was this:

        if ((not self.name)
            or call_function_with_tag_data
            or (markup and self._matches(markup, self.name))
            or (not markup and self._matches(markup_name, self.name))):

It's hard to believe that an or operation would be a bottleneck...
However, there are only 7 places where or is evaluated in this element.py, and 3 of them are used here.
Something about this place smells fishy.


So, the topic has shifted to bs4, so I'll stop here.
Until next time.

Related Posts