Welcome Skills Experience Education Projects Blog

A Pybind11 Demo

This is a brief demonstration of how to use pybind11 to build a Python module from C++ source. I’ve put the example into a single C++ source file, named pbtest.cpp: click to expand

// Some examples of how to use pybind11. Note that this is 
// NOT production code. It has several issues (particularly the 
// *_sum_* functions)

#include <iostream>
#include <algorithm>

#include <omp.h>
#include <immintrin.h>

#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>

namespace py = pybind11;


/** An example function */
int add(int i, int j) {
    return i + j;
}


/** Sum with a single thread (scalar version) */
__attribute__ ((hot)) float numpy_sum_st(const py::array_t<float, py::array::c_style> &array)
{
    float sum = 0.0;

    const float* data = array.data();

    const unsigned int size = array.size();

    for (unsigned int i=0; i < size; ++i)
    {
        sum += *data;
        ++data;
    }
    return sum;
}


/** Sum with a single thread (vectorized with AVX) 
 * expects the array to be aligned properly. **/
__attribute__ ((hot)) float numpy_sum_avx(const py::array_t<float, 
                                          py::array::c_style> &array)
{
    __m256 sum = _mm256_setzero_ps();

    const float* data = array.data();

    const unsigned int size = array.size();

    unsigned int i = 0;
    for (; i + 8 < size; i+=8)
    {
        _mm_prefetch(reinterpret_cast<const char*>(data + 16), _MM_HINT_T0);
        const __m256 octet = _mm256_load_ps(data);
        sum = _mm256_add_ps(sum, octet);
        data += 8;
    }

    float tail_sum = 0.0f;
    for (; i < size; ++i) {
        tail_sum += data[0];
        ++data;
    }

    union { __m256 vector; float array[8]; } val;

    val.vector = sum;

    return tail_sum + val.array[0] + val.array[1] + val.array[2] + val.array[3]
           + val.array[4] + val.array[5] + val.array[6] + val.array[7];
}


/** Sum with multiple threads, using OpenMP */
float numpy_sum_mt(const py::array_t<float, py::array::c_style> &array)
{
    float sum = 0.0;

    const float* data = array.data();

    const unsigned int size = array.size();

#pragma omp parallel for reduction(+ : sum) schedule(static) num_threads(4)
    for (unsigned int i=0; i < size; ++i)
    {
        sum += data[i];
    }
    return sum;
}


float numpy_sum(const py::array_t<float, py::array::c_style> &array)
{
    const unsigned int size = array.size();

    float result = 0.0f;

    if (size < 40000u)
    {
        result = numpy_sum_st(array);
    }
    else
    {
        result = numpy_sum_mt(array);
    }
    return result;
}


float numpy_max(py::array_t<float, py::array::c_style> &array)
{
    float max = -INFINITY;

    const float* data = array.data();

    const unsigned int size = array.size();

    for (unsigned int i=0; i < size; ++i)
    {
        //max = std::max(array.at(i), max);
        max = std::max(*data, max);
        ++data;
    }
    return max;
}


/** An example class */
class Pet
{
public:
    enum Class {Mammal =0 , Reptile, Bird, Fish, Arachnid, Insect};

    Pet(const std::string &name) : name(name) { }

    void set_name(const std::string &name_) { name = name_; }

    const std::string& get_name() const { return name; }

    void set_age(const float age) { _age = age; }

    float get_age() const { return _age; }

    void set_species(const std::string &species) {
        std::cout << "We are in the setter for species. We can perform some checks here.\n";
        _species = species;
    }

    const std::string& get_species() const { return _species; }

    std::string name;

    float mass = 0;

    int n_years = 0;

protected:

    float _age = 0;

    std::string _species = "";
};


PYBIND11_MODULE(example, m) {

    // module docstring

    m.doc() = "pybind11 example";

    // expose a C++ function, name its parameters, and give it default parameters

    m.def("add", &add, "A function which adds two numbers",
          py::arg("i")=0, // default argument
          py::arg("j")=0
          );

    m.def("numpy_sum", &numpy_sum, "A function for summing a NumPy array.",
          py::arg("array"));

    m.def("numpy_sum_mt", &numpy_sum_mt, "A function for summing a NumPy array.",
          py::arg("array"));

    m.def("numpy_sum_st", &numpy_sum_st, "A function for summing a NumPy array.",
          py::arg("array"));

    m.def("numpy_sum_avx", &numpy_sum_avx, "A function for summing a NumPy array.",
          py::arg("array"));

    m.def("numpy_max", &numpy_max, 
          "A function for finding the value of the largest element of a NumPy array.",
          py::arg("array"));

    // cast a C++ type to a Python type and store it as a Python object

    py::object world = py::cast("World");

    // add attributes to the module

    m.attr("the_answer") = 42;
    m.attr("what") = world;

    // expose the Pet class

    py::class_<Pet> pet(m, "Pet");

    pet.def(py::init<const std::string &>())
       .def("set_name", &Pet::set_name, "docstring for set_name")
       .def("get_name", &Pet::get_name, "docstring for get_name")
       .def("set_age", &Pet::set_age, "docstring for set_age")
       .def("get_age", &Pet::get_age, "docstring for get_age")
       .def_readwrite("mass", &Pet::mass)
       .def_property("species", &Pet::get_species, &Pet::set_species, "docstring for species")
       .def("__str__", [](const Pet &p) { return "<a Pet named " + p.get_name() + ">"; });

    // add the enumeration to the Pet class

    py::enum_<Pet::Class>(pet, "Class", py::arithmetic())
        .value("Mammal",  Pet::Class::Mammal)
        .value("Reptile", Pet::Class::Reptile)
        .value("Bird",    Pet::Class::Bird)
        .value("Fish",    Pet::Class::Fish)
        .value("Arachnid", Pet::Class::Arachnid)
        .value("Insect",  Pet::Class::Insect)
        .export_values();
}

Note that the __attribute__ ((hot)) is not necessary. It’s just a hint to the compiler, telling it that the associated function should be optimized because it appears in a “hot” part of the code (a part that is executed many times and uses much of the execution time).

Compiling

On a GNU/Linux system that has all of the dependencies installed (g++, OpenMP, pybind11), you can build this using:

g++ -Ofast -march=native -mfpmath=sse -fopenmp -shared -std=c++11 \
    -fPIC `python3 -m pybind11 --includes` pbtest.cpp \
    -o example`python3-config --extension-suffix`

strip *.so

The name of the output file will vary, depending on your version of Python and your operating system. On my system, it is called example.cpython-312-x86_64-linux-gnu.so.

Using the module

If you open a Python interpreter in the directory containing the output file, or add the directory to your PYTHONPATH, you can import the module. For example, if I cd into the directory containing the output file and startipython, I can then enter

import example

to use the code within the module. The class is available as example.Pet and the functions are available as well. To learn more, you can check out the pybind11 documentation (or ask your favorite LLM, like ChatGPT), build the example code, and experiment.

Benchmarks

The rest of this article deals with the performance characteristics of the various sum functions, which add the elements of a NumPy array of 32-bit floats. In addition to the different functions, two compiler optimization levels are compared: -O1 and -Ofast. The performance of the following are compared:

The standard numpy.sum().
numpy_sum_st() with -O1. This is essentially scalar code.
numpy_sum_st() with -Ofast. This is compiler-vectorized code.
numpy_sum_avx() with -Ofast and -O1. This is the manually-vectorized version with and without some extra compiler optimizations.
numpy_sum_mt() with -Ofast. This version uses OpenMP to distribute the work over 4 CPU threads.
numpy_sum() with -Ofast. This is a hybrid which uses numpy_sum_st() for small arrays and numpy_sum_mt() for large arrays.

Since the Ryzen 9 7950X3D CPU that I am using has two chiplets with different clock speeds and different amounts of level-3 cache, I used the taskset command to distribute the threads over the physical cores of the CPU in different ways when running the tests for numpy_sum_mt():

All four threads on separate physical cores of chiplet 0.
All four threads on separate physical cores of chiplet 1.
Two threads on chiplet 0 and two threads on chiplet 1.

The results are plotted below.

Zooming in to the small array regime:

Scaling with array size - Zooming in on the small-array regime.

Overall, the lightly-optimized -O1 scalar version of the code is slowest. The built-in numpy.sum() is the second worst performer, overall; for very small arrays, it actually manages to lose to the lightly-optimized scalar version. The fastest function for summing arrays shorter than about 40,000 elements is the compiler-vectorized, single-threaded -Ofast version. For very large arrays, the OpenMP parallelized version is the fastest. The overhead of using multiple threads causes the OpenMP version to be slow for short arrays. The manually-vectorized function performs nearly the same when -O1 and -Ofast optimizations are used, except when summing extremely short arrays, and it competes fairly well against the compiler-vectorized version. The best performer overall (by design) is the hybrid one, which uses the compiler-vectorized function for arrays smaller than 40,000 elements and uses the OpenMP version when summing larger arrays.

OpenMP on Ryzen 9 7950X3D

When distributing the task over multiple CPU threads, short arrays benefit from running on the chiplet whose cores are allowed to boost to higher clock frequencies. Large arrays are summed faster by the chiplet that contains extra level-3 cache. Distributing the work over both chiplets yields the best performance overall for very large arrays, however it is beneficial to spread the cores over both chiplets. The processing speed is ultimately limited by the speed of the system memory (RAM) throughput for the largest arrays. The plot below shows the net data throughput rate, including the function call overhead and thread creation overhead. The effect of the larger level-3 cache on chiplet0 is evident; the vertical dotted lines show the L3 cache plus the total L2 cache available on the 4 cores that are being used.

Net throughput rate, including function call overhead and thread creation overhead.