Human after all


I work at a music streaming company, and we spend a lot of time thinking about “recommendations”. At first glance, the importance of recommendations might not be obvious. Naively, we might imagine that a music streaming service operates like a library. When I go to a library, I typically have in mind the books I want to get. These have been recommended to me by friends or experts, or by anonymous reviews from faraway strangers. The function of the library, then, is just to provide a service to locate my desired book in the stacks. A music streaming platform, by analogy, might simply be a search bar connected to a play button: one in which, after I type “Lose Yourself”, Eminem starts rapping.

But no for-profit technology has ever been content with its own modest existence, and music streaming is no different. To some extent this makes sense. Listeners might want more out of a streaming service than literal streaming. They might want ways to organize their music into playlists, to share music with their friends, or to learn about new releases or upcoming concerts from their favorite artists. But they might want also someone else to tell them what to listen to next. Music is supposed to be a respite from hard work, not hard work itself.

One of the simplest recommendation algorithms, and likely the one taught first in most machine learning courses, is known as “collaborative filtering”. The basic idea is deceptively simple: similar users will listen similar pieces of content. User tastes are assumed to be stable over time. And, therefore, users will be most likely to listen to recommendations of songs that are similar to ones they have listened to before, or to ones that users similar to them have listened to before.

Unfortunately, we usually lack explicit measurements of either “user similarity” or “content similarity”. Instead, these similarities are inferred from past listening data. I won’t go through the math, but the algorithm learns a vector that represents a user, and one that represents a piece of content (in this case, a song). Crucially, these vectors are embedded in the same space, so that the distance between a user and a song vector can be interpreted as the “similarity”, in some sense, of the user’s taste to the song’s musical and cultural qualities. Songs close to a user in vector space are, presumably, better recommendations for that user than songs far away.

It is not difficult to see how this can become a virtuous cycle, at least from the company’s perspective. Users listen to music on the platform, implicitly revealing their “taste vector”. This listening data feeds into the recommendation algorithm (”collaborative filtering”), and it recommends increasingly better content the more data it gets. Higher-quality recommendations drive more users to the platform, and these users, in turn, generate even more data. Rinse and repeat. It’s worth noting that this cycle exists for most tech companies. In fact, it might even be taken as the definition of a tech company: i.e., a tech company is one whose algorithms grow more effective as their data expands, and whose data expands as their algorithms grow more effective. (See, for example, Shoshana Zuboff’s “The Age of Surveillance Capitalism”.)

The story is not quite so simple, though. Funnily enough, my employer largely does not use collaborative filtering at all, or at least not in its simplest form. Instead, user-created playlists are used to derive similarities. A song is taken to be similar to another song if users tend to put them together on playlists that they create. Users are similar to other users if the songs they listen to are similar. The playlist that you create for your own enjoyment has thus become the fundamental building block of the algorithm, and of its conception of your “taste”.

The problem of recommendation, whether in music or not, is largely a “metadata” problem. Songs have metadata, like artist names, release dates, albums, genres, Pitchfork ratings, Wikipedia entries, popularity scores, and the aforementioned “taste vectors”. Some of these pieces of metadata are largely trivial to obtain, but others are much harder. There is no easy way to categorize hundreds of millions of songs into “genres”, even assuming that “genre” is a stable and coherent concept to begin with. Turning from genres to even more contentious and vague concepts, like moods and aesthetics, the problem of assigning metadata becomes more challenging. Suppose you want to figure out if a song is “chill” or not, or if it’s the kind of music you’d play in a car with “windows down”. That information might be useful to a recommendation algorithm for a “Chill” or “Windows Down” playlist (or, more creepily, for a recommendation algorithm that knows you are in such a mood, or situation, and gives you exactly what you “want” without asking). There are typically two approaches. One is to rely on music experts to manually assign the “chill” metadata tag to songs they consider “chill”. Some such experts are my colleagues. The other is to use algorithms. The advantage of the algorithms is that they “scale” better than people do, even if their accuracy is markedly lower. Once again, you might find it surprising that the algorithms, at least at my employer, do not analyze the auditory qualities of a piece of music to determine its “chillness”. (Perhaps not so surprising, actually — chillness is as much a cultural phenomenon as an auditory one.) Instead, we once again rely on user playlists. What better signal exists that a song is “chill” than if one of our listeners adds it to their handcrafted “Chill 2022” playlist?

For all their purported sophistication, algorithms are largely, at least for now, simply regurgitations of our collective taste. An algorithm does not know what “indie pop” or “windows down” is, beyond the information supplied to it by music experts (some of my colleagues, although certainly not me) and non-experts (you). Given how fundamental user playlists are to music recommendation, I would be surprised if any machine recommendation, no matter how good, had not first been made, at least implicitly, by a human: one who chose to put together two songs — the one you knew already, and the one you just “discovered”. Any serendipity that the algorithm might convey is, perhaps disappointingly, simply a byproduct of statistical patterns in an almost incomprehensibly large dataset of our own creation. And so these moments of magic are, in my view, still fundamentally human, even if they are mediated by a machine. This is not to detract from the power of statistical summarization: the machine can remember and synthesize billions of connections between songs far better than any human can. But it’s worth remembering that each of these connections was first made by someone like you.

I was reminded of all of this while playing around with ChatGPT, Open AI’s most recent buzzworthy deep learning system. The basic idea of GPT-3, its underlying machine learning model, is shockingly similar to that for music recommendation. Instead of learning which songs co-occur in playlists, it learns which words co-occur in sentences and paragraphs. (One subtlety: order matters for words much more than it does for songs.)

Jay Caspian Kang writes about GPT-3,

If, for example, the word “parsimonious” appears within a sentence, a language model will assess that word, and all the words before it, and try to guess what should come next. Patterns require input: if your corpus of words only extends to, say, Jane Austen, then everything your model produces will sound like a nineteenth-century British novel.

What OpenAI did was feed the Internet through a language model; this then opened up the possibilities for imitation. “If you scale a language model to the Internet, you can regurgitate really interesting patterns,” Ben Recht, a friend of mine who is a professor of computer science at the University of California, Berkeley, said. “The Internet itself is just patterns—so much of what we do online is just knee-jerk, meme reactions to everything, which means that most of the responses to things on the Internet are fairly predictable. So this is just showing that.”

Caspian Kang later calls what ChatGPT does “a series of parlor tricks” performed by “a very precocious child”, and I tend to agree. But this childlike behavior can lead down darker paths.

One good example is Github Copilot, a controversial machine learning system that, as the name suggests, helps software developers write code more efficiently: like having a copilot at your side. In the demos I’ve seen, a developer simply has to write a function with a descriptive name and documentation string and Copilot will fill in the rest, in a programming language of that developer’s choice. Copilot is controversial because, once again, it was built on human labor — in this case, code uploaded to Github — and some of that code explicitly prohibits commercial use. (Github recently announced it will allow businesses to subscribe to Copilot, at the cost of $19/user/month.) Github has been denounced by organizations like The Free Software Foundation, and has been taken to court in a class action lawsuit, but, in the meantime, Microsoft (Github’s owner) is making millions of dollars from synthesizing (or to be tendentious, “plagiarizing”) the work of thousands of developers, without returning even a penny to any of them.

I want to close by making two points. First, even a “precocious child” still learns from humans. Every Wikipedia article, novel, Tumblr post, or Humble Politics Blog essay digested by ChatGPT, and every sorting algorithm or Javascript snippet fed into Copilot, was originally written by a human, in the same way that every song recommendation regurgitated by a music algorithm was originally ingested from a user playlist.

Second, much of what we do as humans is not terribly inventive or creative. And this isn’t true just of memes, or of Elon Musk’s humor. There are millions of jobs that involve sending rote emails, applying basic algorithms from Computer Science 101 to a slightly different application, or telling people to turn their computer off, and then back on again. Eventually, a system like ChatGPT will be capable of doing some or even most of these jobs, albeit likely with some assistance from an actual human. Whether this is good or bad is unclear: I think it largely depends on whether the benefits redound to labor, or to capital (although I know which one I’m betting on). What is disturbing, though, is that these systems created by synthesizing our labor, and built largely without our knowledge or direct consent, could end up replacing and even immiserating us.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s