Taxonomy is Hard

Posted by Curiositry on October 21st, 2022 Tagged Essay, Tech, Linux, Data, Organization

We all have data that we need to store, and then find. Regardless of type, data tends to build up. Eventually, we need some system for organizing it into sensible categories.

It turns out, this problem is harder than it seems.

In this article, I'm going to be talking about organizing digital files. However, most of the problems (and some of the solutions) also apply to paper files, spreadsheet tables, databases — and even physical objects.

Let's dig in.

The Naïve Categorization System

Starting out, most of us come up with a categorization system that makes intuitive sense, following the general guideline a place for everything and everything in its place.

Everyone's mind works differently. The structure of people’s categorization systems tend to reflect the way their mind works. Some people have hundreds of files piled in visual heaps on the digital (or physical) desktop, and it works for them. But most of us have some kind of nested folder structure, and most people I know organize it by type or project.

Here's an example:

writing/
    fiction/
        in progress/
        draft/
        submitted/
        published/
    non-fiction/
    poetry/
        song lyrics/
    blog/
        post title/
reading/
photos/
learning/
music/
    recordings/
    practice and learning materials/
    original music/
    traditional songs and covers/

Obviously this is incomplete, but you get the idea. Organize things by context, in a way that makes sense, without overthinking it.

But you're probably already overthinking it, and if so, you have likely noticed several serious taxonomic flaws with the above directory structure!

The Problems: Examples

Exhibit A: The song lyrics directory is a subdirectory of writing. But song lyrics also belong in the music directory.

Exhibit B: Some of the photos in the photos directory are used in the blog that's in the writing directory.

I can see five ways of addressing Exhibit B:

Move the photos that are used on the blog in the relevant post folder in the blog directory — and lose the convenience of having all photos in one place.
Make a blog subdirectory in photos with all the photos that are used on the blog — and then never know exactly which photos go with which post.
Copy the photos that are used by the blog into the relevant blog subdirectory, leaving the canonical original copy with the original filename in the photos directory — and then have duplicates of those photos on the filesystem.
Get rid of the photos directory altogether, and put all photos in the folder of the project they belong to.
Give up and bury my head in the sand.

All the options are kind of horrible. But the last one seems promising.

The Naïve Solution: Project-Based Taxonomies

I’ve had friends recommend — and I have started to move to — a project based categorization system. It solves many of the problems. For example, we don't have the problem of where to put speculative poetry (fiction, or poetry?), or any of the other problems that the above brittle hierarchy exhibits. Instead, we have a folder for each project (song, story, essay, etc.) that everything related to that project goes in: text, photos, reference material, everything.

Most often, the context we are working in is single-project, so this works well. I sit down to work on project X, not to work on “Lists and Spreadsheets” (another directory; not kidding). I have everything I need to work on that project, right at my fingertips.

But we still have two problems: duplication of files that are used in multiple projects, and loss of convenient access to files by type.

The Problems, in Abstract

The re-use problem. Data is used by multiple projects.
- Does a book review belong in reading, or writing?
The interface problem. Sometimes we want to view data by type, not by project.
- That photos folder, even though it causes all kinds of problems, is there for a reason: sometimes we want to look at all the photos.
The mixed-quality problem. For every project, there is data that is invaluable and crucial, as well as a bunch of cruft created along the way. Project-based taxonomies tend to mix this all together in a way that is unpleasant.
- Sometimes we want to exclude low-quality or third-party data from back-ups, due to limited space (or time).
The project-boundary problem. The boundaries of a “project” are fuzzy.
- I have an algebra rules website. I'm working on an algebra rules poster. I abandoned an algebra rules course. Are these three projects, or one?

Symlinks to the Rescue?

We can solve two problems at once by using symbolic links. We can have our avocado toast and eat it too, by having type folders and project folders!

photos
  repellent-author-portrait.jpg

mangy-blogpost
  symlink-to-repellent-author-portrait.jpg
navelgazing-short-story
  another-symlink-to-same-repellent-author-portrait.jpg

Now, we have projects that contain everything we need, and we have a convenient type-based interface for all photos. Plus, we only store one copy of the file on disk.

Have we solved the hard problems of taxonomies yet?

Uh... not quite.

Symlinks are brittle. As soon as you re-organize the photos folder, or rename something further up the tree, all the symlinks to that original .jpg break. It sucks. Spreadsheets update references when data is moved; why can't we have nice things like that on the filesystem?
Having a bajillion symlinks to one file is gross. I don't know what it is exactly, but there's something ... messy about this approach. Symlinks aren’t technically duplicate files, but they feel like they are.

Tags to the Rescue!

Directories are hierarchical, and life is not. The best solution I’ve come up with is to give up on directory structures and symlinks, and use tags for all organization.

Now, we can tag that author photo with project:storytitle, project:essaytitle, and type:photo. This means:

Zero file duplication (not even symlinks!)
Multiple convenient overlapping “interfaces” — just view all files tagged project:storytitle, or type:photo (you could call it project:photos)
Freedom from the tyranny of rigid directory structures. Throw all your files in one giant potpourri folder — or keep your existing broken directory structure as is. Tag-based taxonomies enhance your existing organization system, rather than replacing it. (You can use your existing directory structure to automatically apply project and type tags to your files.)

File Tagging in the Real World

We have a promising theoretical solution to The Hard Problem of Taxonomy. But the software to implement it is still in diapers.

Metadata Storage

There are several places tags for a file can be stored. None of them are a panacea.

Extended file attributes (called “alternate data streams” on Windows, and “named forks” on Mac) are a filesystem-level way of storing metadata. They are ‘attached’ to the file in a sense, and travel with it where extended attributes are supported, but they aren't embedded in the file itself.
- Pros: works with all filetypes, all major operating systems support them in some form, they travel with the file when it is moved or renamed.
- Cons: not many tagging tools support them, poor operating system level integration on Linux and Windows, different operating systems’ implementations are largely non-interoperable, can only store ~4kb of metadata on some filesystems, poor portability.
Sidecar files. These are typically plaintext or XML files that are hidden by default, most often stored 1:1 alongside the single file the metadata sidecar file refers to.
- Pros: portable, simple, no size or characterset limits for metadata contained in them.
- Cons: multiple competing non-interoperable formats for tag sidecar files, creates clutter, hidden files can get left behind, can easily become unlinked if the file referred to is moved or renamed.
File-specific embedded metadata such as EXIF data, ID3 tags, Vorbis comments, IPTC keywords, et cetera.
- Pros: travels with the file, generally well-supported by operating systems and other software, industry standard for media files.
- Cons: not all filetypes have a place for it, different file formats have different fields.
Symbolic links. I'm not sure if it’s ever done, but technically you could have tag directories, and then symlink files into them? It would probably be a putid approach. More often, files are tagged, and the file-tag relations are stored in a database, and symlinks are used as described below.
- Pros: simple.
- Cons: very primitive approach, breaks when files are moved.
In the filename. This is such a bad idea I don't even want to talk about it. However, who hasn’t tried it?
- Pros: simple, portable, widely supported.
- Cons: breaks your file naming system, makes your filenames horrible, limits tag length, limits characters that can be used in tags, may break spectacularly when moved to an OS with tighter constraints on allowed filenames.
A database. Rarely used on its own, though it could be. Generally, a database is used along with a virtual filesystem, with symlinks to the actual tagged files. A database is also generally required for any type of metadata that doesn't automatically travel with the file (such as sidecar files). A database is also generally needed to index tags stored in embedded metadata, or extended attributes. Storing tags only in a database seems like a bad idea, but pretty much any functional file tagging system is going to need some kind of database (often SQLite).
- Pros: you kinda need one.
- Cons: Fragile, not necessarily portable, needs some kind of filesystem watcher that keeps it updated.

Operating System Support for File Tagging

Here’s the state of things:

MacOS has built in, filesystem-level tagging via extended file attributes. Tags are also stored in .DS_Store (a sidecar file that's already used for other things) as a fallback. These tags have first class-support in Finder and Spotlight.
Windows supports tagging some types of files using IPTC keywords. The file manager has support for finding things via tags, but not for managing tags or tagging files in bulk. Windows also has its own version of extended attributes (“alternate data streams”), but nothing much uses them. They’re used to to store the url of files downloaded from the internet, so that Windows can display a warning when the user tries to run them.
Linux is still in the stone age when it comes to tagging. Common Linux filesystems (including ext4 & ZFS) support extended attributes, but I'm not aware of any Linux distro or file manager that includes tagging features based on them (or embedded metadata, for that matter).

Third-Party Tagging Software for Linux

Third-party tagging software for Linux includes command-line tools
TMSU, SuperTag, and Tagsistant, and a cross-platform GUI app called TagSpaces (warning: embeds tags in filename by default!). TMSU seems to be the most robust and recently updated contender. However, most of these are built on a database + virtual filesystem + symlinks, and inherit their problems. Ick!

I don’t know why none of the options use extended attributes. They could use the setfattr & getfattr commands from the attr package. Likely it has to do with the fact that other apps (including core utilities like cp, without a specific flag) tend to destroy data stored in extended attributes, they aren't preserved by .zip compression (tar with --xattrs flag preserves them), and not all filesystems (notably, FAT32) support them in the first place. For more on extended attributes (which seem like the most promising option) see Extended attributes: the good, the not so good, the bad.

It might be possible to keep symlinks from going stale with a file manager plugin that updates them when files are moved or renamed, or an indexer — but I’ve asked, and haven’t heard of any existing solution.

The more I look into TMSU, though, the more promising it looks. It doesn’t keep symlinks updated automatically, but it has a repair function that will reattach tags to files that have been moved or modified — as long they have not been both moved and modified! TMSU also offers commands for moving/renaming/deleting files while keeping tags fresh, and someone is even working on a Nautilus extension. Best of all, based on issues #10, #86, and #160, the developer seems open to the idea of supporting extended attributes, and has added them to the v0.8.0 milestone.

(In)conclusion

While we wait for better software (or go build it!), we can go ahead and use something like TMSU to tag files — and keep our existing (imperfect, but more reliable) directory structure as a fall-back.

What are the limitations of tag-based organization? I have only just started moving my data to tag-based systems, so I can't say how well it works at scale. Tags have worked well in the few areas where I have been using them. From my reading, a common complaint is that tags get out of hand, and end up with duplicates and general inconsistency — requiring labour-intensive maintenance.

When I mentioned the problem of inconsistent tags to my brother — who has been using embedded-metadata tags for decades as a professional photographer — he patted me on the back and said, “Young man, you are at the start of a very long journey.”

The Autodidacts

Exploring the universe from the inside out