Console #107 -- Interview With Pablo of Bloomberg about Memray

Featuring Nexe, InfiniCache, and toxiproxy.

May 29, 2022

🤝 Sponsor - Fractal

A risk optimized approach to building a successful startup

Fractal provides aspiring entrepreneurs with a business idea, a highly capable and complementary co-founder, capital, and ongoing support. There is no better pathway to becoming a successful startup founder than Fractal.

- Fractal de-risks the founding process, accelerating you past the biggest hurdles a new founder faces.
- Fractal is the best platform for a first-time entrepreneur to build a successful startup.

Learn more

🏗️ Projects

Nexe

https://cloud.githubusercontent.com/assets/2391349/23598327/a17bb68a-01ee-11e7-8f55-88a5fc96e997.png

Nexe is a command-line utility that compiles your Node.js application into a single executable file.

language: TypeScript, stars: 11,371, issues: 110, last commit: May 05, 2022
repo: github.com/nexe/nexe

InfiniCache

InfiniCache is the first-of-its-kind, cost-effective, and high-performance, in-memory object cache that is built atop ephemeral cloud functions. InfiniCache is 31x - 96x cheaper than traditional cloud cache services.

language: Go, stars: 185, issues: 0, last commit: May 18, 2022
repo: github.com/mason-leap-lab/infinicache
site: mason-leap-lab.github.io/infinicache/

toxiproxy

https://camo.githubusercontent.com/a107b5b054d0abbfc74893b3e949f8287a836f6f89f1559db78d66df6519f1ce/687474703a2f2f692e696d6775722e636f6d2f734f614e77306f2e706e67

Toxiproxy is a framework for simulating network conditions. It's made specifically to work in testing, CI, and development environments, supporting deterministic tampering with connections, but with support for randomized chaos and customization.

language: Go, stars: 8104, issues: 52, last commit: May 23, 2022
repo: github.com/Shopify/toxiproxy

🎤 Interview With Pablo of Bloomberg about Memray

https://avatars.githubusercontent.com/u/11718525?v=4

Memray is a memory profiler for Python that was recently published as open source by Bloomberg's Python Infrastructure team. It can track memory allocations in Python code, in native extension modules, and in the Python interpreter itself. It can generate several different types of reports to help you analyze the captured memory usage data.
repo: github.com/bloomberg/memray

Hey Pablo! Thanks for joining us! Let’s start with your background. Where have you worked in the past, where are you from, how did you learn to program, and what languages or frameworks do you like?

I am from the north of Spain. As I got older, I moved more and more to the south of Spain, and I finally moved to London where I live today. I studied Physics and Mathematics. Before working as a Software Engineer, I worked in academia doing a Ph.D. in general relativity. I learned to program as part of my undergraduate studies, starting with C and Fortran 77 (yeah, we are really old school in physics). I started using other languages such as C++, CUDA, and Python when I began my Ph.D. Since then, I have been using mainly C, C++, and Python for almost everything. I have also been doing quite a lot of Rust in my free time over the past three years for different physics and simulation projects.

Who or what are your biggest influences as a developer?

My biggest influences are likely my colleagues on the CPython core team. That’s people like Yury Selivanov, Carol Willing, Łukasz Langa, Guido van Rossum, Victor Stinner, and Batuhan Taskaya. Not only have I learned incredibly valuable lessons from them – ranging from very deep computer science to how to navigate the human parts of software engineering – but also I consider these people very good friends today. All of them are industry experts, as well as very generous and excellent people. I couldn't be luckier than to work together with them.

What's an opinion you have that most people don't agree with?

That software engineering is not really about writing code.

What’s your most controversial programming opinion?

That C is a beautiful language. And I will die on this hill.

What is your favorite software tool?

Not sure if I have one favorite tool, but if I have to choose one, it will certainly be a debugger, like GDB. I would have lost hundreds of hours without them!

If you could dictate that everyone in the world should read one book, what would it be?

“Gödel, Escher, Bach: An Eternal Golden Braid” by Douglas Hofstadter. Without question, this is one of the books that has most influenced the way I think and how I approach abstract concepts.

If you had to suggest 1 person developers should follow, who would it be?

Any people in the Python core team :) <spam> Also, you can follow me on Twitter at @pyblogsal. I speak a lot about Python development, compilers, linkers, and many other cool topics </spam>

If you could teach every 12-year-old in the world one thing, what would it be and why?

Being happy is something you need to learn and practice; it is not simply something that will “happen to you.”

What are you currently learning?

How to do sweep picking on an electric guitar. It’s so hard to make it sound just right!

What have you been listening to lately?

Vivaldi’s The Four Seasons: Recomposed by Max Richter. I have been playing that on repeat like crazy. That being said, I have been listening to the new single “The Foundations of Decay” from My Chemical Romance a lot this week.

How do you separate good project ideas from bad ones?

By having a lot of bad ideas, failing with grace, and learning to do good pattern-matching against those. Tearing apart bad ideas and understanding why they won’t work is also a very good exercise where you can learn a lot. I would say that is even more valuable than having good ideas. It is like being told the answer to a math problem or exercise. You will not learn how to solve it by yourself. Bad ideas are (almost) the best ideas!

What’s the funniest GitHub issue you’ve received?

Someone claimed that detecting syntax errors in code blocks that were never executed was a bad idea and was breaking their use case. It reminds me a lot of https://xkcd.com/1172/.

Why was Memray started?

We started building Memray because developers at Bloomberg came to us a few years ago asking for a good memory profiler for their Python applications. The problem is that most Python code at Bloomberg actually calls and interacts with high-performance native code written in C and C++. This is a problem for most existing profilers because they are unable to see the memory allocations that happen in C and C++, and those that can see them cannot report where this happens in the native layer.
After investigating existing alternatives, we decided that it was time for a new tool and we started to develop Memray. Memory profiling is a very complex topic and it becomes even more complex when you need to do it across many languages. But we put a lot of effort into making sure our tool was fast, flexible, and very easy to use. To do that, we utilized our expertise with Python, linkers, compilers, and low-level tools to produce a very integrated and cohesive tool.

Where did the name for Memray come from?

The internal tool we built was originally called “pensieve,” a reference to an object that appears in the Harry Potter series of books that different characters use to bring each other's memories to life. We thought there was some funny relationship between this and “computer memory.” The problem is that when we published the tool as open source, that name was already chosen in the Python Package Index (PyPI), so we needed to come up with another name. We tried every combination of Greek gods and pop culture references that had anything to do with memory or computers, but they were either already picked or were not very memorable. Out of the blue, someone suggested “Memray” and it just felt right.

Who, or what was the biggest inspiration for Memray?

Probably experiencing the frustration of hundreds of engineers at the company when they had to debug memory problems in Python. Solving that problem kept us motivated and inspired us to create the best tool we could.

Are there any overarching goals of Memray that drive design or implementation? If so, what trade-offs have been made in Memray as a consequence of these goals?

We wanted the tool to be fast, precise, and flexible. Many profilers achieve speed by doing the computations in the same memory space as the application being profiled, and they produce one single output and that’s it. But we wanted to be able to run the application once and produce multiple different outputs that can adapt to different use cases. That’s the reason we decided to just dump all the information that we captured onto the disk.
This exposed us to the very challenging problem of managing a gigantic stream of data, while keeping the tool performant and not destroying users’ hard drives in the process. While this approach sacrifices some optimizations that we could have done if we were aggregating in memory, in retrospect it has been a success, as this has allowed us to have many many different ways to analyze the data, thereby allowing different users to tackle their very specific problems in the way that’s more natural for them.

What is the most challenging problem that’s been solved in Memray, so far?

To avoid having to modify how programs are executed and to ensure that there is no overhead when the profiler is deactivated (i.e., this allows users to only profile specific sections of the code and not pay for when they are not running the profiler), we had to deal with some very tricky dynamic linker tricks. Apart from the complexity of dealing with linkers and relocations, this has exposed us to some weird platform-specific behavior related to how different implementers deal with the ELF specification. For example, take a look at https://github.com/bloomberg/memray/blob/8623bfb1f329250db8b9f0acf6d6f564e22ebbb9/src/memray/_memray/elf_shenanigans.cpp#L98-L102.
We also had to deal with some tricky behavior of thread-local variables. This is because we are tracking every memory de-allocation that happens in the program. To do this, we use some thread-local variables that allow us to not recurse infinitely in our tracking functions. Unfortunately, we also “see” when these thread-local variables are de-allocated, so we had to deal with the very challenging problem of ensuring that we don’t crash or go into undefined behavior when these thread-local variables are destroyed and we track the destruction. Check out this gigantic comment on our repo to be exposed to the horrors: https://github.com/bloomberg/memray/blob/8623bfb1f329250db8b9f0acf6d6f564e22ebbb9/src/memray/_memray/tracking_api.cpp#L58-L84.

Are there any competitors or projects similar to Memray? If so, what were they lacking that made you consider building something new?

We are not competing with anyone. Memray is and will always be free. The landscape of profiling tools for Python is quite vibrant and filled with innovation and a lot of very smart people. We like to collaborate with them to ensure that we can collectively learn from each other and provide the community with the best tools possible. Competing with one another is the worst way to achieve this goal, so we don’t do it. We don’t plan to be the best profiler for Python. We just want to do what we do in the best possible way, and we believe that collaborating with our fellow maintainers is the best way to achieve this goal.

What was the most surprising thing you learned while working on Memray?

If there is anything underspecified in an official specification (like ELF or DWARF), every implementor will eventually implement every possible variation of it. And none of those things are ever documented :)

What is your typical approach to debugging issues filed in the Memray repo?

It’s very similar to any other open source project. First, we work with the contributor to ensure that we can reproduce the problem or that we, at least, have enough information to understand what’s going on. Once we understand the issue, we then discuss among ourselves what the best solution is, what are the compromises (if any), and how we are going to prioritize the issue in our planning.
This can be very challenging sometimes (like https://github.com/bloomberg/memray/discussions/66#discussioncomment-2778037) because you need to run the contributor application yourself, and that can be extremely tough, as many of these applications only run in very specific environments and you don’t initially have any idea of what they do or how they work. But, navigating this effectively comes with experience and, fortunately, we have quite a lot of it :)

What is the release process like for Memray?

It is automated via GitHub actions. We push a signed tag, create a release object in GitHub, and a lot of binary artifacts for different platforms and architectures are automatically built, tested, validated, and uploaded to PyPi. It feels like magic!

Is Memray intended to eventually be monetized if it isn’t monetized already?

No, Memray is and always will be free and open source.

How do you balance your work on open source with your day job and other responsibilities?

This is a very difficult thing to do, and this is what a lot of fellow open source maintainers struggle with. Fortunately, Bloomberg is generous enough to give me 50% of my time to work on CPython and other open source work, like my collaboration with the “Faster CPython project” that Guido van Rossum is leading at Microsoft. In general, you need a very good understanding of how you work, what keeps you motivated, and very good time management skills. Getting burned out is very easy if you do a poor job of balancing your contributions to open source with your personal life.

What is the best way for a new developer to contribute to Memray?

They can go to https://github.com/bloomberg/memray and review the active issues. We normally try to describe them as way possible so anyone who has the will to help and contribute has all the context they need.
They can also ask us any questions they want in the issue tracker or the Discussions page. We are very happy to say that we had 10 new contributors to the latest release and we already have a great active community of people who are helping every single day to make Memray better. If someone is thinking about contributing, they should know that we will welcome any new contributors, plus we are very friendly and compassionate people (and we don’t bite!). So please, come and contribute to Memray! You will certainly learn something new along the way :)

If you plan to continue developing Memray, where do you see the project heading next?

We want to provide more specialized reporters and more ways to analyze the raw data that we collect. We collect a lot of data, so we’re sure there are many different ways to see and analyze this data. It would be great if users told us what they need or helped us directly to craft new cool reporters and analysis tools for Memray.
In addition, we also want to add new optimizations and ways to execute the tool to ensure that we can adapt to many use cases and situations where different compromises are needed to guarantee the best experience when profiling applications.

Are there any other projects besides Memray that you’re working on?

Apart from all the work I do for CPython development, which takes almost all my free non-work time, we also have a bunch of other cool stuff we are working on at Bloomberg that we will publish as open source in the near future. Stay tuned!

Do you have any other project ideas that you haven’t started?

Oh, plenty of them. If I only had the time… :)

Where do you see open source heading next?

I would like to see a better sustainable model for open source, where companies that use open source libraries and products directly help fund them by supporting the maintainers either economically or with infrastructure (or both). Almost every part of the industry runs in some way or shape on open source and very little of the revenue this generates goes to help keep these building blocks stable and sustainable. Many of these fundamental projects are run by a very limited number of developers (sometimes even a single person), and the current model is not really viable in the long term.
This situation is certainly starting to shift, so there is hope in this regard. But more and more changes are still needed if we want to ensure a healthy and sustainable relationship between users, companies, and maintainers.

Want to join the conversation about one of the projects featured this week? Drop a comment, or see what others are saying!

Console by CodeSee.io

Discussion about this post