Console #100 -- Lapce, Semgrep, and Meerschaum

An Interview with Bennett of Meerschaum

Apr 10, 2022

LEX - Diversify your portfolio with commercial real estate

Interested in investing in commercial real estate but not sure where to start?

LEX has created a new way for you to invest in real estate.

LEX turns individual buildings into public stocks via IPO so you can invest, trade, and manage your own portfolio of high-quality commercial real estate.

Any US investor can open a LEX account, browse opportunities in various asset classes such as multifamily and office buildings, and buy shares of individual buildings.

LEX opens up direct and tax advantaged ownership in an asset class that has previously been inaccessible to most investors.

Get started today and explore LEX’s live assets in New York City and upcoming IPO in Seattle.

🏗️ Projects

Lapce

Lapce is a lightning-fast and powerful code editor written in pure Rust, with the UI in Druid. It uses Xi-Editor's Rope Science for text editing and the Wgpu Graphics API for rendering.

language: Rust, stars: 9811, forks: 242, issues: 159, last commit: April 8, 2022
site: lapce.dev
repo: github.com/lapce/lapce

Semgrep

Semgrep is a fast, open-source, static analysis tool for finding bugs and enforcing code standards at editor, commit, and CI time.

language: OCaml, stars: 6346, forks: 303, issues: 351, last commit: April 8, 2022
site: semgrep.dev
repo: github.com/returntocorp/semgrep

Meerschaum

Meerschaum is a tool for quickly synchronizing time-series data streams called pipes. With Meerschaum, you can have a data visualization stack running in minutes.

language: Python, stars: 39, forks: 4, issues: 0 last commit: April 10, 2022
site: meerschaum.io
repo: github.com/bmeares/Meerschaum

🎤 Interview with Bennett of Meerschaum

https://avatars.githubusercontent.com/u/38741257?v=4

Hey Bennett! Thanks for joining us! Let’s start with your background. Where have you worked in the past, where are you from, how did you learn how to program, and what languages or frameworks do you like?

I’m a software developer and data engineer from South Carolina, USA. I primarily write in Python and SQL, though I’ve worked in everything from C and C++ to Javascript and even (God forbid) PHP.
For a number of years, while earning my undergrad and master’s in computer science, I worked as the data engineer in a research group tasked with analyzing my university’s utility data for ways to reduce carbon emissions. We needed to process several billion rows of sensor data, so I built a time-series ETL system to quickly cache the data we needed without overwhelming the sensitive production databases. Over time, I had amassed a suite of tools my coworkers relied upon, and I realized I wanted to build an open-source library to make data engineering more accessible to data scientists and analysts.
Little did I know it at the time, but tackling those problems in that initial ETL system led me down the path to studying the time-series synchronization problem for the next few years, including as the topic of my master’s thesis.

Who or what are your biggest influences as a developer?

I look up to the great developer and speaker Robert “Uncle Bob” Martin, Will McGugan (the author of Rich), Sebastián Ramírez (author of FastAPI), and my friend Casey Doran for my inspiration as a developer. Also, I owe my outlook on Meerschaum to my close friends Drew Emery, Zach Smith, and Harrison Hall and to the current developers at my previous job: Keaton Myers and Tahj Anderson to name a few.

What's an opinion you have that most people don't agree with?

Sometimes developers tend to be very dogmatic on the “correct” way to write code, and I have to agree that I too sometimes fall into that line of thinking. But the beauty of software development is that we’re allowed to “break the rules” when we need to. “Best practices” are usually good guidelines to follow, especially when working with others and unlearning bad programming habits, but they don’t have to be enforced as law.

What’s your most controversial programming opinion?

Not sure if this is controversial, but software development in Python can be just as technical as when done in more “serious” languages like C or C++. I suppose because Python is so popular and recommended as a first language, there’s a lot of bad Python code floating around, but because C was my first language in school, I’ve written a fair amount of C and C++ and can attest that there’s plenty of bad C code out there as well.

What is your favorite software tool?

It’s a toss-up between Grafana and DuckDB. I always prefer to manipulate my data in SQL, and when it comes to visualization, Grafana is the perfect time-series BI platform for me. That’s why Grafana integrates so well into the default Meerschaum stack.
When building data pipelines, I use DuckDB on a near-daily basis: it lets you manipulate in-memory Pandas DataFrames as if they were SQL tables; at first, I didn’t realize I needed that functionality, but now it’s a library I can’t live without.

What are you currently learning?

Lately, I’ve been working more with unstructured data, so I’ve been brushing up on information retrieval. For example, this week I’ve discovered the magic of PostgreSQL’s full-text capabilities with to_tsvector().

What have you been listening to lately?

My finacée thinks it’s funny, but one of my favorite ways of relaxing is sitting in my comfy chair and watching conference talks. I’ve recently gone through Robert “Uncle Bob” Martin’s talks on “Clean Code” and highly recommend it. He’s an amazing and captivating speaker.

What’s the funniest GitHub issue you’ve received?

More wholesome than funny, but my very first GitHub issue was from my friend Casey whom I had met through LUPLUG (a Linux user group for the podcast Linux Unplugged). At the time, I had only released up to v0.0.38, and my documentation was pretty sparse, so he started the issue to offer to help write my README. He’s a senior software engineer and was very patient and willing to try out Meerschaum despite its (at the time) barebones appearance. I took his advice to heart, and a few months later, when I came back to LUPLUG to demo my progress, he came to my talk and later told me that he was proud of how far I had come. He’s an incredible inspiration, and I’m grateful that he was one of the first engineers to encourage me to work my hardest.

Why was Meerschaum started?

Time-series ETL is my niche, and Meerschaum is my way of sharing my tools with the world. After stepping down from my previous data engineer position, I missed the packages I had left behind, and there were some architectural changes I wished I had made. I decided to build a new system from a clean slate and focus on the features I and my users actually needed in practice. That’s really the gist of it ― I eat my own dog food, in that I use Meerschaum as a dependency in most of the projects I take on.

Where did the name for Meerschaum come from?

I had previously used plumbing or industrial analogies for data pipelines when talking about ETL, but for this package, I wanted to evoke a more sophisticated, Sherlock-Holmes-type image. So I instead went with a Meerschaum pipe, which is an expensive, fancy smoking pipe. I don’t condone smoking, so I chose to depict the logo as a bubble pipe. Plus, as a bonus, Meerschaum is etymologically related to my surname (Meares).

Are there any overarching goals of Meerschaum that drive design or implementation?

Yes, Meerschaum prioritizes UX ― from the beginning, I wanted a nice shell interface with sensible commands. You can think of Meerschaum as a tightly integrated collection of scripts for managing your time-series data tables. The CLI has a simple verb-noun syntax with a standard collection of flags: for example, the command shows pipes -c sql:mydb displays all of the pipes with that connector, and users just need to change show to sync to update those chosen tables. The same syntax is shared amongst nearly all the commands.
Another major goal is to ensure that Meerschaum works as expected on almost all platforms and with as many database flavors as possible. I often test Meerschaum in some unanticipated environments, like through Termux on my phone or on Raspberry Pis. To get a sense of how I prioritize compatibility, take a look at my unit and integration tests: I test each combination of pipes using TimescaleDB / PostgreSQL, MSSQL, MariaDB / MySQL, CockroachDB, and even Oracle SQL (shudder).

What trade-offs have been made in Meerschaum as a consequence of these goals?

To ensure compatibility in as many situations as possible, I’ve found myself limited in the tools I have at my disposal. For example, PostgreSQL has a wealth of features, like a JSON data type, but I need to maintain a similar experience for users with SQLite as a backend. I suppose that’s a common trade-off that comes with ORM libraries.

What is the most challenging problem that’s been solved in Meerschaum, so far?

One problem I faced early on was how to handle adding dependencies for new features. I didn’t want to be limited in what dependencies I chose, but I also wanted to avoid having my users install dozens of packages for features they may never use.
The solution I came up with was my dynamic dependency system. I have a registry of dependency groups, so if you want all of the packages for certain features, you can specify them at installation (e.g. pip install meerschaum[api] for the Docker build of the API server). Otherwise, whenever a feature is first used, Meerschaum will automatically install the newly required packages into a virtual environment. This approach comes with several benefits: uninstallation is significantly easier because everything is kept in the root Meerschaum folder (~/.config/meerschaum/), and multiple installations can be emulated with the environment variable $MRSM_ROOT_DIR. I’ve also incorporated this dynamic dependency system into my plugins system: each plugin gets its own virtual environment, and users may specify a list of required packages or other plugins (similar to requirements.txt).

Are there any competitors or projects similar to Meerschaum? If so, what were they lacking that made you consider building something new?

There are plenty of ETL systems out there ― in a recent issue of Console, you interviewed Ido of Ploomber, for example. Other similar projects include Apache Airflow, dbt, and Meltano. Here is an article I wrote to compare alternatives and describe Meerschaum’s place in the world.
The niche Meerschaum fills is the need for a lightweight, no-frills, time-series ETL system. Most ETL projects are generalized to handle any sort of data, but if you make just a few assumptions about your data streams (i.e. time-series and immutable data), you can pick up significant gains in efficiency. Meerschaum can certainly handle non-time-series data, however.
For smaller projects without a dedicated data engineer on staff, using larger ETL systems can feel like cutting a cake with a chainsaw. Meerschaum offers a small, comfortable framework for building your time-series data streams without getting in your way. Due to its dynamic design, you can use just the parts of Meerschaum you need as components in a larger system.

What was the most surprising thing you learned while working on Meerschaum?

Over the last two years, I’ve learned about so many incredible libraries and tools while building Meerschaum. To name a few, I’ve fallen in love with projects like Plotly Dash (for building React dashboards in Python), PyOxidizer (for building standalone Python executables), and prompt-toolkit (for building responsive terminal interfaces).
I’ve also come to realize good documentation is just as important as the code it documents. I’m very grateful to the developers of pdoc3, mkdocs, and mkdocs-material for their projects which I rely upon to provide high-quality documentation.

What is the release process like for Meerschaum?

Generally speaking, I release a new version every couple of weeks. I used to release much more often when the project was still fast and loose, but now I verify that all of my tests pass before publishing. My advice for users is to update often; you can get the latest release with mrsm upgrade mrsm or pip install --upgrade meerschaum.
I try to follow semantic versioning for my releases. So for small intermediate releases, I increment the third “patch” number (e.g. v0.5.12 -> v0.5.13), and for larger feature releases, I increment the second “minor” digit. Backwards compatibility is important to me and why I have not yet released v1.0.0. I’m saving that release for a significant future release in case I need to break backwards compatibility.
Typically, when releasing new versions, I list my changes in my changelog, as well as including descriptive commit messages (with emoji, of course!). I also credit the authors of dependencies on my acknowledgments page.

Is Meerschaum intended to eventually be monetized if it isn’t monetized already? If so, how?

I have a GitHub Sponsors page set up for users and organizations to support my work or contract my help. No sponsors yet, but I have received a $5 donation on BuyMeACoffee.com!

How do you balance your work on open-source with your day job and other responsibilities?

Because I use it so heavily in my contracting work, Meerschaum kind of is my day job. I’ve been considering stepping into a more full-time position, but because I truly believe in the work that I’m doing, I’m dedicating this time to getting the word out and making Meerschaum the best project it can be.

What is the best way for a new developer to contribute to Meerschaum?

Check out the contributing guide! The first way to help is to join the discussion: ask questions, report bugs, or showcase your work! Next, you can write a Meerschaum plugin and publish it to the public repository: just run the commands register user <myuser> -i api:mrsm and register plugin <myplugin>.

If you plan to continue developing Meerschaum, where do you see the project heading next?

I have a wish list / to-do page on my website, and as of now most of the major features have been implemented. The next big things on my mind are to build a dependency graph of pipes to sort out multi-stage / derivative pipes, as well as incorporating my findings from several experimental synchronization strategies I had tested last year for my master’s thesis. Some lesser-known features I have also added are GUI frontend and web terminal (run the commands start GUI and start webterm), and later down the road, I will add a TUI built with Will McGugan’s new library Textual and as well as a proper desktop application.

What motivates you to continue contributing to Meerschaum?

Knowing that Meerschaum is being used in production and helping make people’s lives easier is the encouragement that keeps me going. For example, the new devs at my old job have been migrating my old legacy system to one they’re building with Meerschaum, and I’m so proud of what they’ve accomplished!

Are there any other projects besides Meerschaum that you’re working on?

Yes, much of my time now is spent on a few contracts I’m working on now. Most of the time, I implement these projects as privately hosted Meerschaum plugins.
Most of my side projects nowadays are packaged as Meerschaum plugins, such as the covid and apex plugins. To get started writing your own plugins, check out this episode in my tutorial series.

Where do you see open-source heading next?

This is kind of a grim outlook, but because of the severity of issues like Log4J, left-pad, faker-js, etc., I foresee open-source heading to a more sanitized, corporate future. I think many projects will be brought into the private sector, which isn’t necessarily a bad thing: projects need funding to survive. Just look at the corporate relationships built by the Python Software Foundation or Linux Foundation. They’ve managed to keep their integrity as open-source organizations while securing funding.
I firmly believe in the power of FOSS and even admire the GPL. In an ideal world, software wouldn’t be tied up by proprietary licenses, but pragmatically speaking, the best way I see new open source projects gaining adoption is with more permissive licenses like the Apache 2.0 License, which is what I’m using. I’m following in Docker’s footsteps, which saw industry-wide adoption a few years back in part to its permissive license, but I’m wary of being taken advantage of by companies like Amazon, like what happened with Grafana and ElasticSearch.

Do you have any suggestions for someone trying to make their first contribution to an open-source project?

Don’t let impostor syndrome stop you! We’re all faking it until we make it to some degree. You’d be surprised how many developers are eager to answer your questions or review your PR. Just be sure to do a bit of research first, especially for larger projects.

What is one question you would like to ask another open-source developer that I didn’t ask you?

I would ask the developers Sebastián Ramírez of FastAPI and Will McGugan of Rich how they first spread the word about their projects. It’s not easy convincing people to try a new library, and I’m not exactly a marketing person! I look up to them for inspiration and encouragement to continuously improve the project.

Want to join the conversation about one of the projects featured this week? Drop a comment, or see what others are saying!

Console by CodeSee.io

Discussion about this post