Scalene: a high-performance, high-precision CPU and memory profiler for Python

bawr_hszm · on Jan 9, 2020

Sadly, relying on Python's signal handling is not enough to get robust profiling information the moment your code is spending a significant chunk of time outside of simple Python calls. This is because signals don't get delivered to the Python level until the interpreter comes back from C land, and it's possible to get stuck in C land even with pure Python code.

To wit:

  $ python -m scalene trace.py
  f1 1.3852 312499987500000
  f2 1.2420 2812499962500000
  f3 1.5018 1.5
  trace.py: % of CPU time =  33.66% out of   4.13s.
  Line  | CPU %    |  [trace.py]
     1  |          | #!/usr/bin/env python
     2  |          | 
     3  |          | import time
     4  |          | 
     5  |          | def timed(f):
     6  |          |     def f_timed(*args, **kwargs):
     7  |          |         t = time.time()
     8  |          |         r = f(*args, **kwargs)
     9  |          |         t = time.time() - t
    10  |          |         print('%s %.4f %s' % (f.__name__, t, r))
    11  |          |     return f_timed
    12  |          | 
    13  |          | @timed
    14  |          | def f1(n):
    15  |          |     s = 0
    16  |  17.99%  |     for i in range(n):
    17  |  81.29%  |         s += i
    18  |          |     return s
    19  |          | 
    20  |          | @timed
    21  |          | def f2(n):
    22  |   0.72%  |     return sum(range(n))
    23  |          | 
    24  |          | @timed
    25  |          | def f3(t):
    26  |          |     time.sleep(t)
    27  |          |     return t
    28  |          | 
    29  |          | if __name__ == '__main__':
    30  |          |     f1(25_000_000)
    31  |          |     f2(75_000_000)
    32  |          |     f3(1.5)

emeryberger · on Jan 9, 2020

Good observation. My personal POV is that the best way to optimize your Python code is to use native code (whether as libraries or through pure Python code that is essentially a thin wrapper over C) rather than living in the interpreter. I want to see the parts of the program that are spending that time in the interpreter.

In short, a profiler that tells me that a program is spending a lot of time in C is not generally providing me particularly actionable information.

(In any event, the top-line report is that the Python part of the program only accounts for 33.66% of the execution time, which looks just about right.)

bawr_hszm · on Jan 9, 2020

I can understand this approach, but I fundamentally disagree with it - the first duty of a profiling tool is not to mislead the user.

In this example, the program spends a third of the time just sleeping / blocked, a third of the time on CPU but at the C level, and the remainder just evaluating Python loops. Unless you already know how it's implemented, that 33.66% is not easy to interpret, and the docs don't mention what exactly is being profiled. Specifically, these samples aren't a % of CPU time, they're a % of real time that we happened to have been able to do a Python-level interrupt, and even that isn't a great explanation. I think most users would still expect line 22 to get traced properly.

I very much do want to know about C time, too, because it's very much actionable for most of the optimizations I end up making in production systems.

That said, I don't think this line of discussion is super productive for either of us, we seem to have different goals in mind, which is fine. ;)

So I'll close by saying that I was impressed by your LD_PRELOAD hacks for memory profiling, which isn't an approach that I've ever seen in other Python profilers.

emeryberger · on Jan 10, 2020

Re: being impressed by the memory profiling hacks, thanks! :)

And you persuaded me. New version to be uploaded momentarily that attributes time to line 22.

  % python3 -m scalene sample-sample.py                                        
  f1 1.8081 312499987500000
  f2 1.7910 2812499962500000
  f3 1.5034 1.5
  sample-sample.py: % of CPU time =  70.53% out of   5.10s.
    Line  | CPU %    |  [sample-sample.py]
       1  |          | #!/usr/bin/env python
       2  |          |  
       3  |          | import time
       4  |          |      
       5  |          | def timed(f):
       6  |          |     def f_timed(*args, **kwargs):
       7  |          |         t = time.time()
       8  |          |         r = f(*args, **kwargs)
       9  |   0.32%  |         t = time.time() - t
      10  |          |         print('%s %.4f %s' % (f.__name__, t, r))
      11  |          |     return f_timed
      12  |          | 
      13  |          | @timed
      14  |          | def f1(n):
      15  |          |     s = 0
      16  |   8.58%  |     for i in range(n):
      17  |  41.34%  |         s += i
      18  |          |     return s
      19  |          | 
      20  |          | @timed
      21  |          | def f2(n):
      22  |  49.76%  |     return sum(range(n))
      23  |          | 
      24  |          | @timed
      25  |          | def f3(t):
      26  |          |     time.sleep(t)
      27  |          |     return t
      28  |          | 
      29  |          | if __name__ == '__main__':
      30  |          |     f1(25000000)
      31  |          |     f2(75000000)
      32  |          |     f3(1.5)

emeryberger · on Jan 10, 2020

This discussion inspired me to refine Scalene's output further. Scalene now separates out CPU time spent in Python from time spent in C (that is, not executing in the interpreter). I think this is the best of both worlds: all CPU time is accounted for, and it's easy for users to see what code is running where. Thanks!

  sample-sample.py: % of CPU time =  99.58% out of   3.34s.
           | CPU %    | CPU %    | 
    Line   | (Python) | (C)      | [sample-sample.py]
  --------------------------------------------------------------------------------
       1   |          |          | #!/usr/bin/env python
       2   |          |          |  
       3   |          |          | import time
       4   |          |          |      
       5   |          |          | def timed(f):
       6   |          |          |     def f_timed(*args, **kwargs):
       7   |          |          |         t = time.time()
       8   |          |          |         r = f(*args, **kwargs)
       9   |          |          |         t = time.time() - t
      10   |          |          |         print('%s %.4f %s' % (f.__name__, t, r))
      11   |          |          |     return f_timed
      12   |          |          | 
      13   |          |          | @timed
      14   |          |          | def f1(n):
      15   |          |          |     s = 0
      16   |   6.61%  |   0.41%  |     for i in range(n):
      17   |  35.17%  |   3.97%  |         s += i
      18   |          |          |     return s
      19   |          |          | 
      20   |          |          | @timed
      21   |          |          | def f2(n):
      22   |   0.30%  |  53.53%  |     return sum(range(n))
      23   |          |          | 
      24   |          |          | @timed
      25   |          |          | def f3(t):
      26   |          |          |     time.sleep(t)
      27   |          |          |     return t
      28   |          |          | 
      29   |          |          | if __name__ == '__main__':
      30   |          |          |     f1(25000000)
      31   |          |          |     f2(75000000)
      32   |          |          |     f3(1.5)

bawr_hszm · on Jan 10, 2020

I like it, and I appreciate that you changed your mind - I'm still staunchly in the flamgraph camp of capturing and displaying whole stacks, but now I could see using this for small code samples, especially with the memory usage information added on top.

Thanks for being good with my grumpy feedback!

gshulegaard · on Jan 10, 2020

Just wanted to take a moment to appreciate the awesome response to feedback and new added feature.

kinow · on Jan 10, 2020

Yeah, love this kind of discussion and interaction on hnews!

gigatexal · on Jan 10, 2020

Yeah this is some of the best HN back and forth I’ve seen!

Kudos to everyone involved.

JackC · on Jan 9, 2020

I love the idea of fast Python profilers that don't require modified source code to run.

One profiler I used recently which isn't mentioned in the readme is pyflame (developed at uber):

https://pyflame.readthedocs.io/

pyflame likewise claims to run on unmodified source code and be fast enough to run in production, so it might be worth adding to the comparison. It generates flamegraphs, which greatly sped up debugging the other day when I needed to figure out why something was slow somewhere in a Django request-response callstack.

albertzeyer · on Jan 9, 2020

Similar to PyFlame is also py-spy: https://github.com/benfred/py-spy

> While pyflame is a great project, it doesn't support Python 3.7 yet and doesn't work on OSX, Windows or FreeBSD.

I wonder how the CPU profiling in Scalene is different. It does not mention PyFlame or py-spy at all in the Readme. Of course, the memory profiler is some nice extra.

emeryberger · on Jan 9, 2020

Scalene author here. I was not aware of either tool - many thanks for the pointers! I just tried py-spy (I will try PyFlame on a Linux box momentarily). It's pretty cool, though having to run it as root on OS X isn't great (scalene runs without the need for root privileges; the CPU profiling part is pure Python). Py-spy does appear to efficiently track CPU perf at the line granularity (and does a lot of other stuff). It does a lot of things that scalene does not do, but not memory profiling. Also, I personally prefer scalene's line-level display of annotated source code vs. the flame graphs, but YMMV.

Znafon · on Jan 9, 2020

The possibility to run py-spy on an already running Python program in production is pretty awesome. Do you think Scalene could do this?

mlthoughts2018 · on Jan 9, 2020

You can fairly easily write helper scripts using eg pkg_util to automatically add profiler decorators (eg with kernprof) so that profiling never requires modifying code.

My team has a large body of profiling code written with kernprof and none of it modifies the underlying source. Profiler annotations are solely added automatically by the little profiler runner tooling we wrote.

Not to say other profiling tools aren’t worth it.

monkeyshelli · on Jan 10, 2020

Sadly it seem that pyflame is not maintained anymore:

  Pyflame: A Ptracing Profiler For Python. This project is deprecated and not maintained

per https://github.com/uber-archive/pyflame

spott · on Jan 9, 2020

Another python profile that isn't mentioned is pprofile[0]. It is also a statistical profiler with line-level granularity. It doesn't have the memory profiling abilities (that is pretty slick...), but it also doesn't require the dynamic library injection.

I don't know how it's statistical profiling speed compares to scalene, but it would be great to see the comparison.

Also, does anyone know how the malloc interacts with pytorch?

[0]https://github.com/vpelletier/pprofile

jeanvalmarc · on Jan 9, 2020

Another similar sampling profiler which has decent performance is pyinstrument: https://github.com/joerick/pyinstrument