Banned, Bevy, and Airbyte
Code on the Table
Code on the Table is an online event about open-source business models happening on 03/24. Prominent open-source speakers will discuss the following topics:
How has FOSS changed?
Can open-core survive Amazon?
Is there more pressure for developer tools to be free?
What are enterprise companies looking for when they choose open-source software?
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
language: Java (core platform) + language-agnostic connectors, stars: 1940, watchers: 61, forks: 158, issues: 489
last commit: March 16, 2021, first commit: July 28, 2020
banned.h is a header file in the git repo with a list of banned C functions.
language: C, stars: 37271, watchers: 2306, forks: 21120, issues: 58
last commit: March 19, 2021, first commit: April 07, 2005
Simple, open-source, lightweight, and privacy-friendly web analytics alternative to Google Analytics.
language: Elixir, stars: 6932, watchers: 87, forks: 300, issues: 25
last commit: March 18, 2021, first commit: September 02, 2019
Bevy is a “refreshingly simple” data-driven game engine built in Rust.
language: Rust, stars: 7312, watchers: 191, forks: 576, issues: 401
last commit: March 20, 2021, first commit: November 13, 2019
If you’re interested in posting a help wanted ad for your project to thousands of developers, send an email to email@example.com
An Interview With Michel of Airbyte
Hey Michel! Let’s start with your background. Where have you worked in the past, where are you from, how did you learn how to program, what languages or frameworks do you like, etc?
I’ve been working in data engineering for 15 years. Originally from France, I came to the US in 2011 to join a small startup named LiveRamp. As the company grew, I became Head of Integrations and Director of Engineering, where my team built and scaled over 1,000 data ingestion and distribution connectors to replicate hundreds of TB worth of data every day.
After LiveRamp’s acquisition and later IPO (NYSE:RAMP), I wanted to go back to an early stage startup. So I joined rideOS as Director of Engineering, again deep in data engineering. While there, I realized that companies were always trying to solve the same problem over and over again. This problem should be solved once and for all.
This is when I decided to start a new company, and Airbyte was born.
Who or what are your biggest influences as a developer?
Over my career, I have always been battling against complexity, whether in code, infrastructure, processes or organization. It is always possible to do complex things in a simple way, and that has always been my North Star wherever I’ve been and whatever I’ve done.
What’s your most controversial programming opinion?
An amazing software engineer is the one who writes the least code.
Let me explain. Programming as we know it today has been around for 40 years. It means that it is very rare that you’re the first person to encounter that particular problem. Before going into any project, everyone should be thinking about using an existing solution. For every spec and decision there should be a rationale as to why you are not using an existing solution. You should be able to explain how it is an asset vs. just re-building something. Applying this in your day to day has many impacts. First, you can get more features out of the box. Second, if you encounter an issue, it is likely that someone outside of your company faced it and will fix it, and if not, then you can fix it for everyone. Third, it means all the code you’re writing is going to be valuable code for the product and will be an actual asset instead of a reinvention of the wheel.
This is the reason we decided to build Airbyte. There are existing solutions (FiveTran, Stitch..) but they are all closed source and cloud based. There is no existing solution that can be used out of the box, in the safety of your cloud, that can be extended and customized and that can support the long tail of all the connectors. We wanted to make sure that the next time an engineer builds a connectivity layer into a product or an analytics infrastructure, they wouldn’t have to spend countless hours building and managing connectors and instead they have access to a stable, community maintained, full-featured solution.
If you could dictate that everyone in the world should read one book, what would it be?
Sapiens. It is one of the best written and approachable books about how we got to where we are as Humans.
If you could teach every 12 year old in the world one thing, what would it be and why?
Learn to not be great at everything and instead force yourself to be good/decent in 90% of things and the best in the last 10%. It is a waste of time to try to be great at everything, because it is not possible. But it is possible to be AMAZING at one thing and if you focus on this early, it will compound over time. For the 90% remaining, other people will be amazing at it, you should find them and rely on them.
If I gave you $10 million to invest in one thing right now, where would you put it?
Tesla (obviously don’t take that as investment advice :)). They will win the self-driving car war. To get autonomous driving technology on the street, it is required to have a “Safety Driver” (someone who has been trained to take over if the car misbehaves).
Tesla is the ONLY company that has solved how to get safety drivers at scale while making money out of it: they sell cars. Tesla customers have become safety drivers and data probes for the company, and there are millions of these customers. Most other companies have 10-20k cars. Because of that Tesla has more data than anyone, and for solving this particular problem, data is the most important asset.
How do you separate good project ideas from bad ones?
During the first 6 months of Airbyte, we had our fair share of pivots and explorations.
We learned that you start with an intuition, but that intuition is very hard to evaluate unless you do some customer discovery interviews. You’ll detect the bad projects very quickly, as they can be dismissed within 2-3 days. The hard part is in distinguishing the great projects from the good ones. To do that, you need to pay a lot of attention to pattern matching and biases (trying to remove them as much as possible from your potential clients, and also from yourself). One thing that helped us was doing a series of five interviews, diving deeper with each series, until we reached a point where we weren’t learning anything new from the interviews.
Why was Airbyte started?
From July to August 2020, we reached out to 250 of Fivetran’s and StitchData’s clients. To do so, we took the list of all the public customers listed by these companies and we automated an outreach on LinkedIn. In the end, we managed to talk to 45 of them. What we learned during those interviews is that all these closed-source cloud-based approaches didn’t actually solve the data integration problem. All the companies still had to build their own connectors on the side, either because they were not supported, or they were supported but not in the way they needed. This problem can easily be addressed with open source, if you make it simpler to build and maintain connectors with the open-source tool. That’s exactly what we have in mind with Airbyte.
In addition to this, we started to see companies that couldn’t use those tools, as they couldn’t use cloud-based vendors because of data security concerns. Again, an open-source product would fix this problem.
The last point is around their volume-based pricing, which is unpredictable. Who knows how many rows you will replicate this month? An open-source self-hosted solution could address that point, too.
That’s why we started Airbyte.
Who, or what was the biggest inspiration for Airbyte?
In all honesty, I would say there were two inspirations.
The first one was Fivetran. Their shortcomings were our inspiration. We will have more and more data, more and more tools (and therefore data silos), and more and more requirements in terms of data privacy and security. The data integration problem will only get bigger with time. Closed-source solutions can’t address the long tail of connectors. They will always have a ROI consideration in regards to building and maintaining connectors that are used only by a few customers. Airbyte can address this. We’re building some abstraction to make it low-code to build connectors, and as all those connectors will be standardized, it will be easier for us and the community to help with their maintenance.
The second inspiration was Singer.io’s failure and slow death. Singer could have been great, if Stitch had been a lot more involved in the community. It was never their focus, and more like an afterthought to increase the number of connectors they could sell. But, above all, they didn’t plan well about how they could standardize the connectors to make it a lot easier to maintain them. Any contributor could build a connector with only their use case in mind, and in the end, they were the only ones able to maintain that connector. The impact of this is that most Singer taps are out of date today. That’s why, at Airbyte, we have our own data protocol with standardization in mind. This protocol is compatible with Singer, by the way. It’s a way for us to help Singer users to migrate to our new standard, so that all their work doesn’t go to waste.
What are the overarching goals of Airbyte that drive design or implementation, and what trade-offs have been made as a consequence of these goals?
We need Airbyte to work for any company whatever their data stack, and whatever their use case and data volume. With that in mind, we still have a lot to do! But we’re doing it step by step while anticipating the architecture we need to address all those use cases. In the end, data integration is a thousand-paper-cut problem.
So, for instance, we started with only full refresh. We added the incremental append in December 2020. We will add CDC support, integration with Airflow, DBT, OAuth, etc. in the very near future. Our goal is to be able to address 90% of use cases with the community open-source edition by the end of 2021. After all, we want to become the open-source standard to replicate data, and to commoditize data integration.
What is your typical approach to debugging issues filed in the Airbyte repo?
We identify the blockers and try to prioritize them as soon as possible—usually on the same day. Apart from those high-priority bugs, we have a weekly sprint process and go over all the issues week after week to prioritize them. We use GitHub to keep track of our issues & milestones. All of this is public.
What is the release process like for Airbyte?
There are two parts to Airbyte: the core platform and the connectors. We release on a weekly basis on the core platform. Regarding connectors, we have bi-weekly sprints, but we release in a continuous manner.
How do you intend to monetize Airbyte?
Here’s what we have in mind in terms of business model.
There will be a community edition that will remain open-source forever. Everything we’ve built right now is part of that open-source edition. It will include all the features that an individual contributor needs to perform their integration, i.e., connectors, integration with the data stack (DBT, Airflow, etc.), incremental/change data capture, etc.
Then, there will be a licensed edition with two plans:
A standard plan: hosting & management (premium support + SLA)
An enterprise plan, with data quality & privacy compliance features, and SSO & user access management features
We will work also on a hosted version in the future.
The last business model we have in mind is what we call "Powered by Airbyte.” We’d empower you to offer integrations to your own clients on your platform, using our white-labeled connectors through our API.
What is the best way for a new developer to contribute to Airbyte?
The first thing would be to join our Slack. Then, you can check our documentation and understand the architecture of Airbyte. Finally, you could check out our good first issues that we have tagged specifically for new contributors.
If you plan to continue developing Airbyte, where do you see the project heading next?
Well, we want to become the open-source standard for data integration, and be agnostic in terms of sources and destinations, first and foremost. Only after we’ve become the standard will we start focusing on premium features, too.
What motivates you to continue contributing to Airbyte?
The whole team is passionate about our mission to change how data is being managed within companies. When you look at the data infrastructure, in the value chain, the data warehouses / lakes and anything downstream of them are pretty mature now. That includes transformation with DBT, data analytics / visualization / business intelligence. However, everything upstream of the warehouse is not yet mature. Data integration, data lineage, data quality, privacy compliance, data cataloging and discovery—we still need standards for all of them. We feel that data integration is in the middle of it, and that’s exciting to us.
Like what you saw here? Why not share it?
Or, you can share Console!
Also, don’t forget to join thousands of engineers in subscribing to a weekly roundup of the latest in open-source software, curated by an Amazon engineer.