Console #142 -- Interview with Simon of Tube Archivist - self hosted YouTube media server
Featuring Netlify CMS, shadcn UI, and Tube Archivist
🤝 Sponsor - CodeSee
CodeSee helps your contributors understand, build and refactor your open source project without guesswork. You can unlock contributions by instantly mapping and automating app and code changes saving you 10 hours a week. Download our Code Visibility: Practical Guide
🏗️ Projects
Browse through open source projects on OpenSourceHub.io, add your project to get more exposure and connect with other maintainers and contributors!
⚡ Netlify CMS
A Git-based CMS for Static Site Generators
language: JS, stars: 16052, last commit: May 2022
repo: github.com/netlify/netlify-cms
site: netlifycms.org
🧩 shadcn/ui
Beautifully designed components built with Radix UI and Tailwind CSS.
language: TypeScript, stars: 6195, last commit: yesterday
repo: github.com/shadcn/ui
site: ui.shadcn.com
📹 Tube Archivist
Your self hosted YouTube media server
language: Python, stars: 2221, last commit: Today
repo: github.com/tubearchivist/tubearchivist
site: tubearchivist.com
Join thousands of other open-source enthusiasts and developers in the Open Source Hub Discord server to continue the discussion on the projects in this week's email!
🎙️ Interview With Simon of Tube Archivist
Hey Simon! Thanks for joining us! Let us start with your background.
My Name is Simon, I’m originally from Switzerland. After some extensive traveling, I’m now living in South East Asia. I’ve worked a variety of different jobs, not all related to tech, but most recently I worked in Business Automations, developing integrations between systems that aren’t meant to integrate with each other. Recently I have started looking for a more traditional software engineer role.
I got hooked into the programming world after I first installed Linux in 2014 by creating various little Bash scripts to do some repetitive tasks for me, then things kind of snowballed from there. I mostly learn by solving problems - some imaginary, some real. Whenever I have a task or something I’m doing that is kind of tedious, I spend an unreasonable amount of time writing a tool automating it. Sometimes this becomes useful for others, people are providing feedback, start to contribute and before you know it, a new project is born.
I’m most productive in Python, as that’s the language I’m most familiar with, and the first real programming language I’ve learned. Resources for learning are abundant from excellent YouTube channels, blog posts and documentation itself.
What’s your most controversial programming opinion?
Just because there is a library to do what you want to do, that doesn’t mean you should use it. Depending on the complexity, creating your own integration that you control and understand is usually better than using somebody else's wrapper.
If I gave you $10 million to invest in one thing right now, where would you put it?
Do I have to make that money back? :-) If not, I’d probably start funding some of these low level but everywhere in use open source projects that make out the foundation of our modern infrastructure. Projects and organizations like Curl, OpenSSL and Apache or libraries like Django and Flask come to mind, but I’m sure I’m missing some obvious ones that I’m not even aware of. I’d finance development time, professional code reviews, pentesting, bug bounty programs or just regular maintenance tasks. The societal benefit would be huge and the risks of not doing it is terrifying. I’m surprised that this is not part of regular government infrastructure projects.
But if I’d have to make that money back, I’d invest it into software solutions analyzing energy usage and identifying inefficiencies for industrial processes or residential applications. I see a huge growth market in this area where automated and intelligent control systems could both reduce costs for businesses or homes and help stabilize power grids while transitioning to renewable energy sources. Just something I’ve been thinking about…
What are you currently learning?
I’m always learning. Recently I’ve been going deeper into websockets programming. I have a basic understanding, but I want to learn more. From a language perspective, I’ve started to write some smaller projects in Go, kind of a natural successor from Python.
What have you been listening to lately?
Among many channels on YouTube, I have particularly been enjoying listening to Hussein Nasser, an excellent resource for backend engineering concepts, or Cold Fusion for tech news.
Why was Tube Archivist started?
There have been various factors contributing to that. As I have been learning how to program, I’ve started downloading useful YouTube tutorial videos to keep them locally available. As my collection grew, this became hard to maintain, I couldn’t find a specific video even though I remembered downloading it, the tutorial series playlists were all out of order and it was all a lot of manual, tedious work.
Then the pandemic started and I unexpectedly had some time on my hands. I have been meaning to create a complete application as a portfolio project to help me apply for software engineering jobs, so I started combining my loose collection of Bash scripts into a self hosted web application to index my collection, automate the different tasks and make everything searchable.
Also some time ago there were changes in the TOS of YouTube regarding what is considered educational IT security information and what is enticing illegal behavior, so I lost access to a lot of interesting videos on YouTube, either because they were taken down by YouTube or the creators themselves removed them in a self censoring way to avoid getting striked.
Folks have different motivations behind archiving videos, most obviously to avoid losing access to it for whatever reason, others are typical Datahoarders, collecting whatever they can, others want to get away from the YouTube algorithm, tracking and all the distractions to just focus on the content they enjoy, some want to curate what videos or channels their kids are allowed to watch. These examples intersect and there are probably more that I’m not aware of.
How does Tube Archivist work?
Tube Archivist - your self hosted YouTube media server - builds on the excellent standalone command line program yt-dlp. So yt-dlp takes care of downloading and extracting metadata, which Tube Archivist then parses and indexes into a database. The archive is then presented in a convenient web interface to mimic the functionality of YouTube, but everything stays locally on your own server.
Why did you pick Python?
Tube Archivist is based on yt-dlp, which is built with Python. To use yt-dlp as a library, it makes sense to interface with it in Python as well.
Do you have any estimate on the number of users?
I do have some statistics for example on the GitHub repo, average unique visitors are usually at around 1k/14 days, but that fluctuates a lot whenever the project gets mentioned somewhere. As a privacy focused self hosted project, I don’t collect any analytics, so I don’t really know how many people use it. But the browser extension has now 45 stars on GitHub and 700 active installs reported from Firefox and Chrome, is that going to be the same ratio of users and the 2.2k stars on GitHub of the main project repo? Probably not a very accurate metric…
Are there any overarching goals of Tube Archivist that drive design or implementation? If so, what trade-offs have been made in Tube Archivist as a consequence of these goals?
It’s built to scale. I know of a few people who have 500k to 1M videos indexed, all with full text search over all the subtitles, you’ll basically get your own personal search engine, where you can search as you type through your whole archive and a lot of fancy stuff like that. To be able to do that on your own hardware requires some overhead. So if you just want to archive a few videos here and there, even though this project will be convenient, it's way overkill for that, like driving your sports car just up and down your driveway. To be able to build something self hosted that scales as is, the tradeoff is that if you don’t need it to scale, you’ll still have the overhead.
What is the most challenging problem that’s been solved in Tube Archivist, so far?
The parsing of auto generated subtitles from YouTube was a big challenge, as they use a proprietary format for structuring fragments basically word by word. The format isn’t documented anywhere and there are a lot of edge cases that need to be handled so that it can get indexed and written to a regular VTT file for the media player to display [link].
I also did a lot of tinkering with the search functionality, to have one search bar for simple queries to complex structured queries, inspired from GitHubs search syntax. To then take that and process it to a query that the database can understand was very challenging. That complexity is hidden from the user who just wants to make a simple query to find a video, but available if you want to get specific [link].
But looking at these code links, I should revisit that for some refactoring…
Are there any projects similar to Tube Archivist? If so, what were they lacking that made you consider building something new?
There are various frontends to yt-dlp, notable mentions are YoutubeDL-Material, Tubesync and Metube, but they are primarily a downloader, not a media server. Then there are various projects out there that try to index YouTube videos in existing media servers like the YouTube-Agent bundle for Plex or the Metadata Plugin for Jellyfin. These are valid approaches, but you are limited in what you can do as these media servers are built for TV shows and movies so YouTube videos will need to fit into that metadata structure. Having a dedicated solution gives the flexibility to build the application around the metadata and not the other way around.
What was the most surprising thing you learned while working on Tube Archivist?
How hard it is to define a scope of a project. Like what do you really want to accomplish with this project? And how do you communicate that so this doesn’t end up being one of these projects that does try to do too many things but none of them well? I’m still learning that, and how to deal with the disappointment from people when I have to tell them, no I’m not going to rewrite my application just to fit your very specific need.
Is Tube Archivist intended to eventually be monetized?
Not the core project itself, but I do have some ideas of additional services around it to overcome some inherent limitations of the concept. But that’s just in some early tinkering stages at the moment.
Have you ever experienced burnout? How did you deal with it?
From time to time, I’ll write in the Discord and on my GitHub profile, that I’ll be on holidays and won’t be replying. I’m usually not actually on holidays, like on a beach somewhere, more of a holiday from notifications. So whenever I feel overwhelmed or when I get unreasonably irritated, I’ll take a step away. So far that has been enough to keep my mental health intact.
I’m also lucky, I have an excellent group of people helping out on Discord and on GitHub. That takes a lot of pressure away from me. When people have questions or experience an issue, most of the time there is somebody there who can assist, then I can still join in if need be to verify a bug for example, or for some more complicated problems.
If you plan to continue developing Tube Archivist, where do you see the project heading next?
This project is now a little over one year old, v0.0.1 was Sep 15, 2021, so even though this project is very young, we have a good amount of users. But the rapid growth is not necessarily a good thing only, I’ll take some time to focus on things I know I should have done from the beginning, rewrite some code that through adding new features became hard to understand and unnecessarily entangled, and fix some other architectural problems I couldn’t foresee when I first posted this on Reddit a bit more than a year ago as a little portfolio project. These are all fixable things, and will help set this project up for long term success.
What motivates you to continue contributing to Tube Archivist?
I use it all the time, it has basically replaced YouTube for me for my regular viewing of my subscriptions. I do go back sometimes, to search for something specific, or whenever I want to discover something new. But every improvement I make on Tube Archivist, I benefit the most. Things become a bit different when implementing some feature request that I wouldn’t want to use myself, but usually if I see the use case for others, I’m happy to assist and merge a pull request.
Are there any other projects besides Tube Archivist that you’re working on?
I have a Raspberry Pi on our balcony that functions as a little air quality and weather measurement station, collecting data and sending it to a web server to store, analyze and display current and past values. That’s a personal project, not really reusable for others, but that was a great project to learn some data analysis fundamentals, I still do some minor improvements here and there on this one.
Then there is also tilefy.me [link], a small docker container that dynamically creates customizable PNG tiles with project statistics like your Github Stars or your Docker pulls, basically anything that’s accessible over a public API. That’s inspired by GitHub badges, but more customizable with your branding, logo, font and color scheme. A little bit silly, I know, but it’s working quite well. I may be the only person using it. Sometimes you need to enjoy life with a whimsical distraction like that.
Do you have any suggestions for someone trying to make their first contribution to an open-source project?
Just do it. To start is probably the most difficult step. Usually developers and maintainers are happy to help to get you started in finding your way around the code base. I think most people overestimate the level of proficiency needed to be useful in a project. Plus besides coding, there are a lot of things you could be helping with, from triaging issues, helping people on Discord. You may or may not be surprised, but the same issues will come up over and over, if somebody like you can be the friendly reception person, directing folks to the documentation pages, this frees up valuable head space for others.
Want to join the conversation about one of the projects featured this week? Drop a comment, or see what others are saying!
Interested in sponsoring the newsletter or know of any cool projects or interesting developers you want us to interview? Reach out at osh@codesee.io or mention us @ConsoleWeekly!