Console #119 - Interview With Saul of VisiData - an Open-source data multitool
Featuring voila, PasteBin, and ctk.
🏗️ Projects
voila
Voilà turns Jupyter notebooks into standalone web applications. Each user connecting to the Voilà tornado application gets a dedicated Jupyter kernel which can execute the callbacks to changes in Jupyter interactive widgets.
language: Python, stars: 4260, issues: 243, last commit: August 20, 2022
repo: github.com/voila-dashboards/voila
site: voila.readthedocs.io
PrivateBin
A minimalist, open source online pastebin where the server has zero knowledge of pasted data. Data is encrypted/decrypted in the browser using 256 bits AES.
language: PHP, stars: 4240, issues: 117, last commit: July 22, 2022
repo: github.com/PrivateBin/PrivateBin
site: privatebin.info
ctk
Visual composer for container based workloads.
language: Typescript, stars: 161, issues: 10, last commit: August 20, 2022
repo: github.com/ctk-hq/ctk
site: ctk.dev
VisiData
A terminal spreadsheet multitool for discovering and arranging data.
language: Python, stars: 5738, issues: 45, last commit: August 17, 2022
repo: github.com/saulpw/visidata
site: visidata.org
Join thousands of other open-source enthusiasts and developers in the Open Source Hub Discord server to continue the discussion on the projects in this week's email!
🎤 Interview With Saul Pwanson of VisiData
Hey Saul! Thanks for joining us! Let us start with your background. Where have you worked in the past, where are you from, how did you learn to program, and what languages or frameworks do you like?
I learned BASIC as a kid in 1980 and have been coding ever since, eventually working at Microsoft, Wizards of the Coast, F5 Networks, and several startups. I’ve done a fair amount of low-level assembly and C for operating systems, networking, and embedded development. I love Forth and have written a few Forth interpreters. The last 10 years or so I've been working more with data, so Python is my language of choice for most projects these days, with C and SQL as needed.
What’s your most controversial programming opinion?
I’ve got a lot of them, but one that seems to pique people: I think that having lots of tests is a bad thing. Tests are like tent stakes. You should have some, to keep your tent from blowing away — but if you have too many, it’s difficult to move your damn tent.
What is your favorite software tool?
Other than VisiData? :) I really like tmate, which makes it so easy to set up a shared terminal session. You just type `tmate`, and done! I love tools like that. Mosh (for interactive terminals over flaky network connections) is another one where it just works. The developers have removed all the frictions and it just feels really nice.
What are some of the things you did that contributed the most to your growth as a dev over the course of your career?
Publishing things in the open. Three projects I’ve done in the past 10 years have directly led to big things: two jobs and one scandal. In each case, I spent 6 months or more working on something. And in each case there were points at which it felt like a huge waste of time, but then it paid off ten-fold.
And what are some things you probably should have done that you didn't do?
I wish I had done more collaborating with other people. I can work 10 times faster as an individual than I can on a team, but it doesn’t matter how fast I am because I only have so many hours in a day. Working on a team scales in a way that you can’t if you’re working as an individual.
What are you currently learning?
Linear algebra. It just seems to be important in so many different fields. Also: Unicode art.
What have you been listening to lately?
The Machinarium Soundtrack is my current coding music, and I listen to it at least once a day; Clockwise Operetta is just fantastic. Also Hawaii Part ii is incredible; check out Labyrinth, a chiptune rap!
Why was VisiData started?
About a decade ago, I was working at F5 Networks and wanted a nicer interface for viewing statistics and configuring and testing BIG-IP. The existing tooling was very CLI-like: powerful and scriptable but verbose. I’m kind of fundamentally lazy, and fewer keystrokes is just a better scene for me.
So I wrote a program that would eventually become the inspiration for VisiData — an interactive, text-based interface that let me see all the data right away and use individual keystrokes to explore it. It was built to be hackable, both internally and at runtime, and that made it really easy to add new features, which I did for a while.
At my next job, I wished for years that I had that tool I’d written at F5, so I could extend it to handle other formats (I was working with JSON and CSV and HDF5 files all at the same time). So when that job started winding down, I decided to rebuild a more general, open-source version of it from scratch–once and for all. I started sketching out the project in late 2016, and within a couple of weeks, I had released an early version.
Things really accelerated the next year. I was lucky enough to get accepted to the Recurse Center for the Spring 2017 batch where I met Anja, who is now VisiData’s co-maintainer. It’s also when VisiData got traction, largely thanks to a successful announcement on Hacker News.
Where did the name for VisiData come from?
VisiData’s name is a direct reference to VisiCalc, the very first consumer spreadsheet application, released in the late 70s. But despite the cosmetic similarities, VisiData has a much different philosophy.
There are basically three ways of looking at data: row-wise, column-wise, and cell-wise. Life comes at you in rows, and rows are how transactional databases organize their data. Computation, on the other hand, is a lot more efficient if the data is organized column-wise; this is the organizing principle behind NumPy/Pandas/Arrow/Ibis or basically any serious computational library.
A cell-wise architecture is neither transactional nor efficient, but it’s fine for small amounts of data, and it’s actually good for top-level summaries and human-scale data flow. The original VisiCalc had a cell-based design — the fundamental unit of interaction is the individual cell. And since then, every interactive spreadsheet tool has copied VisiCalc’s cell-based design: Lotus 1-2-3, Excel, Google Sheets, you name it.
In VisiData, the fundamental operations are row-based and column-based. You can interact with single cells, or a selected set of cells, but by default you’re dealing with whole rows or whole columns. A cell-based spreadsheet becomes unwieldy at larger scales; when working with data, you really want things organized consistently in columns and rows. This is VisiData distilled to its essence, and so the name is like VisiCalc, but for any kind of structured data. VisiData lets you see into your data.
Are there any overarching goals of VisiData that drive design or implementation?
It should be smooth, like butter! I try to remove all the inefficiencies I can see — both for people who use the tool, and for me as a developer. It keeps it light, which changes the game. There’s a reason delightful has the word light in it!
This relates to another of VisiData’s goals: Rather than solve one big problem, VisiData aims to solve as many of the small problems as it can so that you can focus on the things that can’t be automated away — actually analyzing your data.
VisiData’s codebase also strives for hackability, in the sense that it should all be able to fit in one person’s brain.
What trade-offs have been made in VisiData as a consequence of these goals?
Some parts of the codebase are very hackability-focused, packing in a ton of expressiveness into terse constructs. But this comes at the cost of some standard “best” practices, and makes the project less inviting to short-term contributors.
What is an interesting problem that’s been solved in VisiData, so far (code links encouraged)?
Recently, we added support for selectively reading files in remote ZIP archives. Surprisingly, I couldn’t find any libraries that provided this functionality, so I wrote the reader from scratch, taking advantage of HTTP range requests and some technical details of the ZIP file format. So now you can just point VisiData to a .zip URL, navigate the archive, and load only the portion of it that you need, say a single .csv file.
Are there any competitors or projects similar to VisiData? If so, what were they lacking that made you consider building something new?
There are other terminal-based tools like sc-im, but they’re very much oriented around the classic, cell-based spreadsheet concept. There are also data manipulation tools like OpenRefine and Tableau, but those are all GUIs; there are CLI tools like miller and jq, but they are format-specific and non-interactive. I wanted an interactive data manipulation tool I could use in the terminal.
If you plan to continue developing VisiData, where do you see the project heading next?
VisiData currently can handle “millions of rows” — I’d say it maxes out around 1GB of data, and even then it’s a bit of a grind. But modern database engines can store 100GB or more on a single node, and execute optimized SQL queries against that data with excellent results.
So I’m focusing a lot of my attention on improving how VisiData interacts with SQL databases. Currently, VisiData can load data from several SQL databases, but it starts loading all the data into RAM. I’m working on a new plugin, vdsql, that uses the Ibis library to build expressions and send actual SQL to your databases, so you can use VisiData to explore databases using their own, performant engines.
What motivates you to continue contributing to VisiData?
I love it when people get into VisiData and post their thoughts. From Hillel Wayne going from installation to “holy shit” in under 20 minutes to Luke Plant’s thoughtful blog post “Everything is an X”. Also the community that’s growing around it, and the support from my sponsors on Patreon and our corporate sponsors October Swimmer and GenUI.
It also helps that we’ve worked to remove some of the frictions of open-source maintenance. One of those points of friction, for many maintainers, is their issue-tracker’s never-ending backlog. I think it’s important not to have that weighing on you. Users should feel free to file feature requests for things they want, and I should feel free to close the issue without prejudice – even if I think the feature is an okay idea! In the VisiData repo, we call this “Kondo’ing,” a reference to Marie Kondo’s cleaning process — if you’re not actually going to do it, close it.
Another thing we do to keep us motivated is a simple pinned discussion called “I ❤️ VisiData,” where we keep track of nice things that people say about the project. It’s a great resource when we’re looking for testimonials too.
Interested in sponsoring the newsletter or know of any cool projects or interesting developers you want us to interview? Reach out at osh@codesee.io or mention us @ConsoleWeekly!