The Sun Says: Identify Bottleneck Functions with Python cProfile!
We!
Ooh!
Gather 'round!
Meow!
Hello, I'm Munou.
Lately, I've been really enjoying listening to Masahiro Sakurai's channel (formerly of HAL Laboratory) in the background like a radio while tinkering with my PC.
In fact, memories of playing with my 3DS came back, and the Nintendo Directs hosted by President Iwata that I watched back then were always fun. Looking back, figures like Fujitsu's Toshio Ikeda, when you watch documentaries about them, you can feel their passion. It's an unbelievable world where people traveled to the US so many times back then. At that time, there was also the Vietnam War, and the shadow of the post-World War II era must have still been present, so where did all that energy come from?
Introduction
I hadn't paid much attention to the execution speed of scripting languages, but running cuckooget, which I'm developing, made me curious.
Some parts allow Rust code written in Rust to be executed within the interpreter using PyO3/maturin.
More simply, functions written in Rust are turned into Python libraries, and their execution happens via compiled binary files of the Rust code. The simple reason is that for CPU-intensive processes, executing them with these binary files seems faster. It did turn out faster, but there's a big mistake here.
First, I performed the measurements without even benchmarking the 'Only Python' code.
Even in 'Notes on C Programming' written by Rob Pike, often called the father of Go,
Rule 1: You can't know where a program will spend its time. Bottlenecks occur in surprising places. So, don't guess or speed-hack until you've clarified where the bottleneck is.
If that's the case, then measurements should be taken. If this is a principle in low-level C language, it's impossible that it wouldn't apply to higher-level layers.
I sometimes forget these things, so I strongly feel the need for something like UNIX philosophy, which involves regular checks, to discipline myself.
That's what I wrote, but being human, I know it, yet I tend to draw outliers...
Measurement with cProfile
Initially, I was using a primitive method: import time, then start_time = time.time() before function execution and end_time = time.time() after, followed by print(f"exec time {start_time - end_time}"). But I thought, 'Why use a computer for something so inefficient!' and looked for something that could display execution time for each function.
Apparently, there's something called cProfile as a standard library.
In my case, I checked it as follows:
python3 -m cProfile -o profile.pstats ./main.py https://soulminingrig.com/ site
This stores the execution time of functions run by ./main.py in a binary file named profile.pstats.
Checking Function Execution Results
The previous site used a Python executable, but I'll check it more easily by entering pstats command-line mode.
python3 -m pstats profile.pstats
This enters pstats command-line mode, and I'll check the output results as follows:
profile.pstats% stats 5
Thu Nov 7 20:24:10 2024 profile.pstats
48488696 function calls (47625490 primitive calls) in 51.151 seconds
Random listing order was used
List reduced from 3017 to 5 due to restriction <5>
ncalls tottime percall cumtime percall filename:lineno(function)
1586 0.001 0.000 0.001 0.000 /home/haturatu/.local/lib/python3.12/site-packages/aiohttp/client_reqrep.py:909(__del__)
80 0.000 0.000 0.001 0.000 /usr/lib/python3.12/copy.py:247(_reconstruct)
4 0.000 0.000 0.005 0.001 /usr/lib/python3.12/encodings/__init__.py:71(search_function)
3 0.000 0.000 98.745 32.915 /usr/lib/python3.12/asyncio/base_events.py:651(run_until_complete)
1103 0.001 0.000 0.003 0.000 /usr/lib/python3.12/re/_parser.py:312(_class_escape)
You can check other commands with help.
profile.pstats% help
Documented commands (type help <topic>):
========================================
EOF add callees callers help quit read reverse sort stats strip
In my case, I outputted with stats 100, then opened it casually with vim from another terminal, copied and pasted, and then quit.
I'll then examine this text file, pasted and saved in vim, using awk and sort respectively.
In the case of tottime
$ awk '{ print $2","$6 }' bench | sort | column -t -s "," | tail
0.009 /usr/lib/python3.12/re/_compiler.py:243(_optimize_charset)
0.010 /usr/lib/python3.12/concurrent/futures/_base.py:428(result)
0.012 /usr/lib/python3.12/asyncio/base_events.py:627(run_forever)
0.013 /usr/lib/python3.12/re/__init__.py:280(_compile)
0.018 /usr/lib/python3.12/re/_compiler.py:37(_compile)
0.022 /usr/lib/python3.12/concurrent/futures/_base.py:497(set_running_or_notify_cancel)
0.030 /home/haturatu/git/devgit/async_web_mirror.py:186(download_and_save_image)
0.031 /usr/lib/python3.12/re/_parser.py:512(_parse)
0.061 /usr/lib/python3.12/concurrent/futures/_base.py:537(set_result)
tottime filename:lineno(function)
Somehow, it feels like tottime alone provides enough information for a function-by-function benchmark.
Is the Beautiful Soup 4 bottleneck just my imagination?
What was slow from other environments seemed to be search_tag in element.py, called from bs4, which was dragging things down.
What caught my attention was this:
if ((not self.name)
or call_function_with_tag_data
or (markup and self._matches(markup, self.name))
or (not markup and self._matches(markup_name, self.name))):
It's hard to believe that an or operation would be a bottleneck...
However, there are only 7 places where or is evaluated in this element.py, and 3 of them are used here.
Something about this place smells fishy.
So, the topic has shifted to bs4, so I'll stop here.
Until next time.