Console #132 -- Interview with Zach of Dolt - Git for Data

Featuring Sapling, ntfy, and Dolt

Nov 20, 2022

🏗️ Projects

Browse through open source projects on OpenSourceHub.io, add your project to get more exposure and connect with other maintainers and contributors!

🌱 Sapling

Sapling SCM is a cross-platform, highly scalable, Git-compatible source control system open sourced by Facebook.

language: Rust, stars: 3502, issues: 42, last commit: yesterday
repo: github.com/facebook/sapling
site: sapling-scm.com

🔔 ntfy

ntfy.sh (notify) is a simple HTTP-based pub-sub notification service. It allows you to send notifications to your phone or desktop via scripts from any computer, entirely without signup or cost.

language: Go, stars: 7688, issues: 83, last commit: yesterday
repo: github.com/binwiederhier/ntfy
site: ntfy.sh

📂 Dolt

Dolt is a SQL database that you can fork, clone, branch, merge, push and pull just like a Git repository. Connect to Dolt just like any MySQL database to run queries or update the data using SQL commands.

language: Go, stars: 13395, issues: 257, last commit: 2 days
repo: github.com/dolthub/dolt
site: dolthub.com

Join thousands of other open-source enthusiasts and developers in the Open Source Hub Discord server to continue the discussion on the projects in this week's email!

🎤 Interview With Zach Musgrave of Dolt

Hey Zach! Thanks for joining us! Let’s start with your background. Where have you worked in the past, where are you from, how did you learn how to program, what languages or frameworks do you like?

I learned to program when I was 12 or so on my Macintosh LCII, making point and click adventure games in Hypercard, basically through trial and error. I took my first programming class in college. Before joining DoltHub I spent 8 years at Amazon and then 5 at Google and learned a ton at both. I grew up in the Seattle area and went to school at the UW, which made me one of the only locals working at Amazon.
In terms of languages, I was a Java dev most of my career and have now switched to Golang. I will never program in Java again if I have the option of Golang. For scripting, I love Perl and hate Python. I can’t live without JetBrains IDEs.

What’s your most controversial programming opinion?

Code quality doesn’t actually matter. Customers don’t see code, they only see results. Engineers obsess over code quality because we like pretty code, but its relationship to product quality is very weak. All the processes people put in place to try to improve code quality are a huge tax on development with very mixed results. Google had incredibly restrictive policies around code quality that cost thousands of engineering years to enforce annually, and yet it was trivial to find terrible code in the repository. And Google can’t ship software to save its life.
I’m not saying code quality isn’t a good thing, or that bad code quality doesn’t make it harder to ship software. Both these things are true. But it doesn’t make your product good. As an industry we invest in code quality far past the point of diminishing returns, and still end up with bad code.

What are you currently learning?

Related to Dolt I’ve been learning the intricacies of MySQL’s transaction model so that Dolt can reproduce it bug for bug. The interesting thing is that MySQL does a lot of locking to guarantee consistency among clients, while Dolt does no locking and tries to sort out everything via merge at commit time.
Outside work, I’m learning about dog training because we’re getting a family dog soon, and learning to play some classic rock songs on the guitar. I kind of only know how to play mid 2000s indie rock songs and it’s a little obnoxious.

What have you been listening to lately?

I know when I’m working really hard because I go to the same few instrumental albums and listen to them on repeat. Today it’s Archipelago by Hidden Orchestra. Also all of El Ten Eleven’s oeuvre.

How do you separate good project ideas from bad ones?

It’s hard. It’s pretty easy to identify bad technical ideas, but hard to know a bad product idea.
For Dolt, our job is somewhat easier, because we’ve chosen to copy two separate products: MySQL for the database side, and Git for the versioning side. A lot of our project planning is simply deciding in what order to build things. We know we’re eventually going to build 100% of each.

Why was Dolt started?

Dolt was originally intended to make sharing data on the internet easier. It’s hard today because people either mail CSV files around, or rely on APIs, making it really hard to collaborate effectively. We thought if we could get people to think of their data the same way they think of their source code, and use tools that encourage collaboration like Git does, then we could bootstrap a data sharing community. That’s why we wrote Dolt.
We still believe in this vision, but we think it’s going to take a long time. We have to convince people that version control for data matters, and that Dolt is the right way to do it. It took Git over 5 years to reach a critical mass, and people were already sold on the concept of version control for source code. So we’re going to have to be patient. In the meantime, our customers want to use Dolt as an application database server. So we now see Dolt becoming successful first as an application database, and then as a means of publishing and sharing data second, probably much later.

It seems like a lot of your customers would be ML projects. By "critical mass", do you mean even outside ML use cases, or, are you even having difficulty making in-roads there?

ML is one of our biggest use cases, and we’re getting good traction in the ML space. Those customers want to version their training data and model outputs so their workflows are reproducible. But what they’re not doing is sharing their data. That’s what we think will take a long time to materialize.
We’re trying to bootstrap data sharing and collaboration by paying people to do it via our data bounties program. We’ve paid out around $50k in bounty money so far and gotten some really great open datasets produced for that money. But we just think it’s going to take a long time for this idea to catch on, and we’re OK with that.

Who, or what was the biggest inspiration for Dolt?

Definitely Git. We named Dolt to pay homage to how Linus Torvalds named Git.
Torvalds sarcastically quipped about the name git (which means "unpleasant person" in British English slang): "I'm an egotistical bastard, and I name all my projects after myself. First 'Linux', now 'git'."
We wanted a word meaning "idiot", starting with D for Data, short enough to type on the command line, and not taken in the standard command line lexicon. So, Dolt.
Dolt’s command line copies Git’s exactly, so if you know how to use Git you know how to use Dolt.

Are there any overarching goals of Dolt that drive design or implementation?

The obvious answer here is that Dolt storage has to be a commit graph. Without this, it’s not possible to implement branching and merging the way Git does, which was our most important design goal. This requirement drives every other technical decision in the product.
From the other direction, we want to be a 100% compatible, drop-in MySQL replacement, so that if you have a MySQL based application, you can port it to Dolt by just changing the connection string. The SQL layer is built mostly on top of the storage layer, but some of these requirements do find their way all the way down to the bottom layer.

What would the syntax for something like branching look like in SQL?

To get versioning features off the command line and into SQL, we introduce a bunch of custom SQL functions and system tables. E.g. you can examine the diff on a table named myTable with this query:
SELECT * FROM dolt_diff_myTable where to_commit IS NULL;
To switch to a different branch you can set some special session variables, use a different connection string, or use a special SQL function:
SELECT DOLT_CHECKOUT(‘-b’, ‘myBranch’);
And you can always query any revision of a table with the AS OF syntax:
SELECT * FROM myTable AS OF ‘feature-branch’;
We have a documentation site that covers all of this in depth.

What is the most challenging problem that’s been solved in Dolt, so far?

The most challenging technical aspect of Dolt is probably the storage format itself. It uses a novel data structure called a ProllyTree to get structural sharing across revisions, so that you can keep multiple versions of the data around without blowing up storage costs. It also makes diff and merge performant. We’ve published a bunch of technical articles about it, e.g.:
https://www.dolthub.com/blog/2020-06-16-efficient-diff-on-prolly-trees/

Are there any competitors or projects similar to Dolt? If so, what were they lacking that made you consider building something new?

Dolt is the only SQL database that you can branch and merge, fork and clone, and nobody else is building a direct competitor right now.
A lot of products call themselves “Git for data,” but they’re not, not really. What they mean is that they’re a data product that has some version control features. But most of them version only the schema of a database, not the actual data, and the rest can’t branch or merge the data, or even diff two revisions. The exception here is TerminusDB, which does branch and merge. But it’s a graph database, not SQL.
We wrote a roundup on all the products calling themselves “Git for data” over a year ago, and it hasn’t changed much.
https://www.dolthub.com/blog/2020-03-06-so-you-want-git-for-data/

What was the most surprising thing you learned while working on Dolt?

I love Golang, but I was very surprised to learn that a panic in any goroutine (of which there can be thousands) will kill your entire process. Not just the thread that panicked, but the entire process. There’s no top-level mechanism to ensure that you can catch these, no way to install a global panic handler at the program’s entry point. You have to handle any possible source of panic individually not only in your code, but in all the third-party libraries you use. It’s a serious problem with the language runtime I hope they address.

Why was Go chosen for the Dolt implementation?

Golang is a great language and we’re generally very happy with it, but we chose it for practical business reasons.
We built Dolt on top of a fork of an open source graph database called noms, written in golang. Noms implements the ProllyTree data storage and commit graph, and we built Dolt’s table and schema semantics on top of that. Building on top of noms saved us at least a year of engineering work and let us get to market much faster.
Picking golang also enabled us to adopt the go-mysql-server project, also written in pure golang, to build our SQL engine implementation. We’re really fortunate to have found these two great golang projects lying around for us to extend.

What is your typical approach to debugging issues filed in the Dolt repo?

My favorite way to debug an issue is by pure vibes, where I let my intuition guide me to where I just know the source of the problem must be even if I can’t explain why. Feels good man.
But often I have no idea what the problem is, which means I get a repro set up and put some breakpoints in GoLand to see what’s going on.

What motivates you to continue contributing to Dolt?

Dolt is my full-time job and I have equity in the company, so it would be really nice for it to succeed and make me rich. But beyond that, contributing to Dolt is satisfying because we ship all the time. We go from whiteboard discussion to feature launch in a couple weeks. When customers find a bug or ask for a feature, we usually get them a release the same week. Working at a place that values moving fast and putting tangible results in customers’ hands on a continual basis feels great.

What is the release process like for Dolt?

We have a great release engineer, Dustin Brown, who has automated everything for us. We do continuous integration and deployment with thousands of automated tests and performance benchmarks on every PR. Cutting a release is as simple as clicking a button on GitHub. I wrote a janky perl script to generate release notes including changes in dependent projects because none of the other release note generators I could find had this feature.

How are you currently monetizing Dolt?

We make money by selling support contracts to companies using Dolt as their application database, similar to other database companies. We also make money using the same private repository model that GitHub does, where people using DoltHub can pay us $50 a month to get private repositories. Eventually we’ll sell database server hosting as well.

What is the best way for a new developer to contribute to Dolt?

The best way to contribute to Dolt is to start using it, and find out what Git or MySQL features it doesn’t have that you need. Then start implementing them! We want them all, and we’re implementing them in the order people ask (paying customers first). There are a ton of things left to implement, and most of them aren’t hard, just not urgent for us.

If you plan to continue developing Dolt, where do you see the project heading next?

Dolt is headed in several exciting directions next.
One big push is making Dolt as performant as MySQL for the OLTP use case. Right now we’re about 4-8x slower on average, depending on the query. Then we need to benchmark and improve our numbers on concurrent transactions. Lots of work to do there, but we’re confident we can pull even with MySQL on performance in the next year.
Our big planned feature launch is a hosted solution for Dolt databases. The idea is if you are using DoltHub as your remote, then you can click a button on DoltHub to spin up a VM running your database as a server, and we give you the connection string to it. Then whenever you push to DoltHub, your running database gets updated with the data you just pushed.
We’re also going to build a cloud-native version of Dolt that can scale to any amount of data (petabytes) and separates storage from processing, like data warehouses do.

A lot of people have been moving in this direction lately, but it's never occurred to me to ask about technical details related to this. I'm imagining the use of Kubernetes to achieve this, or, will you use something else?

Right, we’re going to deploy a Dolt container into Kubernetes for every hosted server. Our whole stack is deployed on Kubernetes, so we already have the infrastructure to make this happen.

What are the current technical challenges you're having with this now, and do you see any more arising in the future?

The technical challenge there is managing fleets of these server containers at scale and having sufficient monitoring and automation to keep them alive and responsive. Kubernetes helps a lot there, but it’s not magic. You still have to do the work.
Another technical challenge is implementing hosted read replicas, which people are going to want for performance. We’re still designing how those will work with Dolt, and we have a bunch of ideas.

Want to join the conversation about one of the projects featured this week? Drop a comment, or see what others are saying!

Console by CodeSee.io

Discussion about this post