Support for Perf Maps
----------------------
-On supported platforms (as of this writing, only Linux), the runtime can take
+On supported platforms (Linux and macOS), the runtime can take
advantage of *perf map files* to make Python functions visible to an external
-profiling tool (such as `perf <https://perf.wiki.kernel.org/index.php/Main_Page>`_).
-A running process may create a file in the ``/tmp`` directory, which contains entries
-that can map a section of executable code to a name. This interface is described in the
+profiling tool (such as `perf <https://perf.wiki.kernel.org/index.php/Main_Page>`_ or
+`samply <https://github.com/mstange/samply/>`_). A running process may create a
+file in the ``/tmp`` directory, which contains entries that can map a section
+of executable code to a name. This interface is described in the
`documentation of the Linux Perf tool <https://git.kernel.org/pub/scm/linux/
kernel/git/torvalds/linux.git/tree/tools/perf/Documentation/jit-interface.txt>`_.
.. _perf_profiling:
-==============================================
-Python support for the Linux ``perf`` profiler
-==============================================
+========================================================
+Python support for the ``perf map`` compatible profilers
+========================================================
:author: Pablo Galindo
-`The Linux perf profiler <https://perf.wiki.kernel.org>`_
-is a very powerful tool that allows you to profile and obtain
-information about the performance of your application.
-``perf`` also has a very vibrant ecosystem of tools
-that aid with the analysis of the data that it produces.
+`The Linux perf profiler <https://perf.wiki.kernel.org>`_ and
+`samply <https://github.com/mstange/samply>`_ are powerful tools that allow you to
+profile and obtain information about the performance of your application.
+Both tools have vibrant ecosystems that aid with the analysis of the data they produce.
-The main problem with using the ``perf`` profiler with Python applications is that
-``perf`` only gets information about native symbols, that is, the names of
+The main problem with using these profilers with Python applications is that
+they only get information about native symbols, that is, the names of
functions and procedures written in C. This means that the names and file names
-of Python functions in your code will not appear in the output of ``perf``.
+of Python functions in your code will not appear in the profiler output.
Since Python 3.12, the interpreter can run in a special mode that allows Python
-functions to appear in the output of the ``perf`` profiler. When this mode is
+functions to appear in the output of compatible profilers. When this mode is
enabled, the interpreter will interpose a small piece of code compiled on the
-fly before the execution of every Python function and it will teach ``perf`` the
+fly before the execution of every Python function and it will teach the profiler the
relationship between this piece of code and the associated Python function using
:doc:`perf map files <../c-api/perfmaps>`.
.. note::
- Support for the ``perf`` profiler is currently only available for Linux on
- select architectures. Check the output of the ``configure`` build step or
+ Support for profiling is available on Linux and macOS on select architectures.
+ Perf is available on Linux, while samply can be used on both Linux and macOS.
+ samply support on macOS is available starting from Python 3.15.
+ Check the output of the ``configure`` build step or
check the output of ``python -m sysconfig | grep HAVE_PERF_TRAMPOLINE``
to see if your system is supported.
+Using the samply profiler
+-------------------------
+
+samply is a modern profiler that can be used as an alternative to perf.
+It uses the same perf map files that Python generates, making it compatible
+with Python's profiling support. samply is particularly useful on macOS
+where perf is not available.
+
+To use samply with Python, first install it following the instructions at
+https://github.com/mstange/samply, then run::
+
+ $ samply record PYTHONPERFSUPPORT=1 python my_script.py
+
+This will open a web interface where you can analyze the profiling data
+interactively. The advantage of samply is that it provides a modern
+web-based interface for analyzing profiling data and works on both Linux
+and macOS.
+
+On macOS, samply support requires Python 3.15 or later. Also on macOS, samply
+can't profile signed Python executables due to restrictions by macOS. You can
+profile with Python binaries that you've compiled yourself, or which are
+unsigned or locally-signed (such as anything installed by Homebrew). In
+order to attach to running processes on macOS, run ``samply setup`` once (and
+every time samply is updated) to self-sign the samply binary.
+
How to enable ``perf`` profiling support
----------------------------------------
import os
-import sys
+import sysconfig
import unittest
try:
except ImportError:
raise unittest.SkipTest("requires _testinternalcapi")
+def supports_trampoline_profiling():
+ perf_trampoline = sysconfig.get_config_var("PY_HAVE_PERF_TRAMPOLINE")
+ if not perf_trampoline:
+ return False
+ return int(perf_trampoline) == 1
-if sys.platform != 'linux':
- raise unittest.SkipTest('Linux only')
-
+if not supports_trampoline_profiling():
+ raise unittest.SkipTest("perf trampoline profiling not supported")
class TestPerfMapWriting(unittest.TestCase):
def test_write_perf_map_entry(self):
--- /dev/null
+import unittest
+import subprocess
+import sys
+import sysconfig
+import os
+import pathlib
+from test import support
+from test.support.script_helper import (
+ make_script,
+)
+from test.support.os_helper import temp_dir
+
+
+if not support.has_subprocess_support:
+ raise unittest.SkipTest("test module requires subprocess")
+
+if support.check_sanitizer(address=True, memory=True, ub=True, function=True):
+ # gh-109580: Skip the test because it does crash randomly if Python is
+ # built with ASAN.
+ raise unittest.SkipTest("test crash randomly on ASAN/MSAN/UBSAN build")
+
+
+def supports_trampoline_profiling():
+ perf_trampoline = sysconfig.get_config_var("PY_HAVE_PERF_TRAMPOLINE")
+ if not perf_trampoline:
+ return False
+ return int(perf_trampoline) == 1
+
+
+if not supports_trampoline_profiling():
+ raise unittest.SkipTest("perf trampoline profiling not supported")
+
+
+def samply_command_works():
+ try:
+ cmd = ["samply", "--help"]
+ except (subprocess.SubprocessError, OSError):
+ return False
+
+ # Check that we can run a simple samply run
+ with temp_dir() as script_dir:
+ try:
+ output_file = script_dir + "/profile.json.gz"
+ cmd = (
+ "samply",
+ "record",
+ "--save-only",
+ "--output",
+ output_file,
+ sys.executable,
+ "-c",
+ 'print("hello")',
+ )
+ env = {**os.environ, "PYTHON_JIT": "0"}
+ stdout = subprocess.check_output(
+ cmd, cwd=script_dir, text=True, stderr=subprocess.STDOUT, env=env
+ )
+ except (subprocess.SubprocessError, OSError):
+ return False
+
+ if "hello" not in stdout:
+ return False
+
+ return True
+
+
+def run_samply(cwd, *args, **env_vars):
+ env = os.environ.copy()
+ if env_vars:
+ env.update(env_vars)
+ env["PYTHON_JIT"] = "0"
+ output_file = cwd + "/profile.json.gz"
+ base_cmd = (
+ "samply",
+ "record",
+ "--save-only",
+ "-o", output_file,
+ )
+ proc = subprocess.run(
+ base_cmd + args,
+ stdout=subprocess.PIPE,
+ stderr=subprocess.PIPE,
+ env=env,
+ )
+ if proc.returncode:
+ print(proc.stderr, file=sys.stderr)
+ raise ValueError(f"Samply failed with return code {proc.returncode}")
+
+ import gzip
+ with gzip.open(output_file, mode="rt", encoding="utf-8") as f:
+ return f.read()
+
+
+@unittest.skipUnless(samply_command_works(), "samply command doesn't work")
+class TestSamplyProfilerMixin:
+ def run_samply(self, script_dir, perf_mode, script):
+ raise NotImplementedError()
+
+ def test_python_calls_appear_in_the_stack_if_perf_activated(self):
+ with temp_dir() as script_dir:
+ code = """if 1:
+ def foo(n):
+ x = 0
+ for i in range(n):
+ x += i
+
+ def bar(n):
+ foo(n)
+
+ def baz(n):
+ bar(n)
+
+ baz(10000000)
+ """
+ script = make_script(script_dir, "perftest", code)
+ output = self.run_samply(script_dir, script)
+
+ self.assertIn(f"py::foo:{script}", output)
+ self.assertIn(f"py::bar:{script}", output)
+ self.assertIn(f"py::baz:{script}", output)
+
+ def test_python_calls_do_not_appear_in_the_stack_if_perf_deactivated(self):
+ with temp_dir() as script_dir:
+ code = """if 1:
+ def foo(n):
+ x = 0
+ for i in range(n):
+ x += i
+
+ def bar(n):
+ foo(n)
+
+ def baz(n):
+ bar(n)
+
+ baz(10000000)
+ """
+ script = make_script(script_dir, "perftest", code)
+ output = self.run_samply(
+ script_dir, script, activate_trampoline=False
+ )
+
+ self.assertNotIn(f"py::foo:{script}", output)
+ self.assertNotIn(f"py::bar:{script}", output)
+ self.assertNotIn(f"py::baz:{script}", output)
+
+
+@unittest.skipUnless(samply_command_works(), "samply command doesn't work")
+class TestSamplyProfiler(unittest.TestCase, TestSamplyProfilerMixin):
+ def run_samply(self, script_dir, script, activate_trampoline=True):
+ if activate_trampoline:
+ return run_samply(script_dir, sys.executable, "-Xperf", script)
+ return run_samply(script_dir, sys.executable, script)
+
+ def setUp(self):
+ super().setUp()
+ self.perf_files = set(pathlib.Path("/tmp/").glob("perf-*.map"))
+
+ def tearDown(self) -> None:
+ super().tearDown()
+ files_to_delete = (
+ set(pathlib.Path("/tmp/").glob("perf-*.map")) - self.perf_files
+ )
+ for file in files_to_delete:
+ file.unlink()
+
+ def test_pre_fork_compile(self):
+ code = """if 1:
+ import sys
+ import os
+ import sysconfig
+ from _testinternalcapi import (
+ compile_perf_trampoline_entry,
+ perf_trampoline_set_persist_after_fork,
+ )
+
+ def foo_fork():
+ pass
+
+ def bar_fork():
+ foo_fork()
+
+ def foo():
+ import time; time.sleep(1)
+
+ def bar():
+ foo()
+
+ def compile_trampolines_for_all_functions():
+ perf_trampoline_set_persist_after_fork(1)
+ for _, obj in globals().items():
+ if callable(obj) and hasattr(obj, '__code__'):
+ compile_perf_trampoline_entry(obj.__code__)
+
+ if __name__ == "__main__":
+ compile_trampolines_for_all_functions()
+ pid = os.fork()
+ if pid == 0:
+ print(os.getpid())
+ bar_fork()
+ else:
+ bar()
+ """
+
+ with temp_dir() as script_dir:
+ script = make_script(script_dir, "perftest", code)
+ env = {**os.environ, "PYTHON_JIT": "0"}
+ with subprocess.Popen(
+ [sys.executable, "-Xperf", script],
+ universal_newlines=True,
+ stderr=subprocess.PIPE,
+ stdout=subprocess.PIPE,
+ env=env,
+ ) as process:
+ stdout, stderr = process.communicate()
+
+ self.assertEqual(process.returncode, 0)
+ self.assertNotIn("Error:", stderr)
+ child_pid = int(stdout.strip())
+ perf_file = pathlib.Path(f"/tmp/perf-{process.pid}.map")
+ perf_child_file = pathlib.Path(f"/tmp/perf-{child_pid}.map")
+ self.assertTrue(perf_file.exists())
+ self.assertTrue(perf_child_file.exists())
+
+ perf_file_contents = perf_file.read_text()
+ self.assertIn(f"py::foo:{script}", perf_file_contents)
+ self.assertIn(f"py::bar:{script}", perf_file_contents)
+ self.assertIn(f"py::foo_fork:{script}", perf_file_contents)
+ self.assertIn(f"py::bar_fork:{script}", perf_file_contents)
+
+ child_perf_file_contents = perf_child_file.read_text()
+ self.assertIn(f"py::foo_fork:{script}", child_perf_file_contents)
+ self.assertIn(f"py::bar_fork:{script}", child_perf_file_contents)
+
+ # Pre-compiled perf-map entries of a forked process must be
+ # identical in both the parent and child perf-map files.
+ perf_file_lines = perf_file_contents.split("\n")
+ for line in perf_file_lines:
+ if f"py::foo_fork:{script}" in line or f"py::bar_fork:{script}" in line:
+ self.assertIn(line, child_perf_file_contents)
+
+
+if __name__ == "__main__":
+ unittest.main()
Billy G. Allie
Jamiel Almeida
Kevin Altis
+Nazım Can Altınova
Samy Lahfa
Skyler Leigh Amador
Joe Amenta
--- /dev/null
+Add support for perf trampoline on macOS, to allow profilers wit JIT map
+support to read Python calls. While profiling, ``PYTHONPERFSUPPORT=1`` can
+be appended to enable the trampoline.
.text
+#if defined(__APPLE__)
+ .globl __Py_trampoline_func_start
+#else
.globl _Py_trampoline_func_start
+#endif
# The following assembly is equivalent to:
# PyObject *
# trampoline(PyThreadState *ts, _PyInterpreterFrame *f,
# {
# return evaluator(ts, f, throwflag);
# }
+#if defined(__APPLE__)
+__Py_trampoline_func_start:
+#else
_Py_trampoline_func_start:
+#endif
#ifdef __x86_64__
#if defined(__CET__) && (__CET__ & 1)
endbr64
addi sp,sp,16
jr ra
#endif
+#if defined(__APPLE__)
+ .globl __Py_trampoline_func_end
+__Py_trampoline_func_end:
+#else
.globl _Py_trampoline_func_end
_Py_trampoline_func_end:
.section .note.GNU-stack,"",@progbits
+#endif
# Note for indicating the assembly code supports CET
#if defined(__x86_64__) && defined(__CET__) && (__CET__ & 1)
.section .note.gnu.property,"a"
#ifdef PY_HAVE_PERF_TRAMPOLINE
/* Standard library includes for perf jitdump implementation */
-#include <elf.h> // ELF architecture constants
+#if defined(__linux__)
+# include <elf.h> // ELF architecture constants
+#endif
#include <fcntl.h> // File control operations
#include <stdio.h> // Standard I/O operations
#include <stdlib.h> // Standard library functions
#include <sys/types.h> // System data types
#include <unistd.h> // System calls (sysconf, getpid)
#include <sys/time.h> // Time functions (gettimeofday)
-#include <sys/syscall.h> // System call interface
+#if defined(__linux__)
+# include <sys/syscall.h> // System call interface
+#endif
// =============================================================================
// CONSTANTS AND CONFIGURATION
* based on the actual unwind information requirements.
*/
+
+/* These constants are defined inside <elf.h>, which we can't use outside of linux. */
+#if !defined(__linux__)
+# if defined(__i386__) || defined(_M_IX86)
+# define EM_386 3
+# elif defined(__arm__) || defined(_M_ARM)
+# define EM_ARM 40
+# elif defined(__x86_64__) || defined(_M_X64)
+# define EM_X86_64 62
+# elif defined(__aarch64__)
+# define EM_AARCH64 183
+# elif defined(__riscv)
+# define EM_RISCV 243
+# endif
+#endif
+
/* Convenient access to the global trampoline API state */
#define trampoline_api _PyRuntime.ceval.perf.trampoline_api
typedef struct {
struct BaseEvent base; // Common event header
uint32_t process_id; // Process ID where code was generated
- uint32_t thread_id; // Thread ID where code was generated
+ uint64_t thread_id; // Thread ID where code was generated
uint64_t vma; // Virtual memory address where code is loaded
uint64_t code_address; // Address of the actual machine code
uint64_t code_size; // Size of the machine code in bytes
return NULL; // Failed to get page size
}
+#if defined(__APPLE__)
+ // On macOS, samply uses a preload to find jitdumps and this mmap can be slow.
+ perf_jit_map_state.mapped_buffer = NULL;
+#else
/*
* Map the first page of the jitdump file
*
close(fd);
return NULL; // Memory mapping failed
}
+#endif
perf_jit_map_state.mapped_size = page_size;
ev.base.size = sizeof(ev) + (name_length+1) + size;
ev.base.time_stamp = get_current_monotonic_ticks();
ev.process_id = getpid();
+#if defined(__APPLE__)
+ pthread_threadid_np(NULL, &ev.thread_id);
+#else
ev.thread_id = syscall(SYS_gettid); // Get thread ID via system call
+#endif
ev.vma = base; // Virtual memory address
ev.code_address = base; // Same as VMA for our use case
ev.code_size = size;
perf_trampoline=yes ;; #(
aarch64-linux-gnu) :
perf_trampoline=yes ;; #(
+ darwin) :
+ perf_trampoline=yes ;; #(
*) :
perf_trampoline=no
;;
esac
AC_MSG_RESULT([$SHLIBS])
-dnl perf trampoline is Linux specific and requires an arch-specific
+dnl perf trampoline is Linux and macOS specific and requires an arch-specific
dnl trampoline in assembly.
AC_MSG_CHECKING([perf trampoline])
AS_CASE([$PLATFORM_TRIPLET],
[x86_64-linux-gnu], [perf_trampoline=yes],
[aarch64-linux-gnu], [perf_trampoline=yes],
+ [darwin], [perf_trampoline=yes],
[perf_trampoline=no]
)
AC_MSG_RESULT([$perf_trampoline])