The Dilemma Of Fair Use And Expressive Machine Learning: An Interview With Ben Sobel

By Elise De Geyter for Intellectual Property Watch

Intellectual Property Watch recently conducted an interview with Ben Sobel, law and technology researcher, teacher, and fellow at Harvard University’s Berkman Klein Center for Internet and Society. Sobel has focused his research on copyright and the fair use doctrine, in particular in the context of artificial intelligence (AI). Below, he shares his views on expressive machine learning, “the fair use dilemma” and “Big Content versus Little Users”. Of note: the most pressing copyright question has to do with AI readers, not AI authors, according to Sobel.

Intellectual Property Watch (IPW): Copyright laws have always been challenged by the development of new technologies. To what extent are the challenges imposed by artificial intelligence different from the challenges faced by copyright before?

BEN SOBEL (SOBEL): Today’s AI is getting better at generating things that resemble works of human authorship: prose, images, music, movies, and the like. The question on everyone’s mind is, “who ought to own these?” “Can a computer program be an ‘author’ for the purposes of copyright law?” These are intriguing problems, but not novel ones: we ask similar questions whenever a new technology alters or attenuates an author’s role in the creative process. In the late 1800s, the United States Supreme Court considered whether a photograph could be a copyrightable work of authorship, rather than just a mechanical recording of facts about the world. In the 1980s, US Courts of Appeals evaluated who “authors” images of a video game that are generated by software in response to a player’s input. And IP scholars have been writing about how to treat output generated by an artificial intelligence for at least 30 years.

Ben Sobel

What’s overlooked is that before today’s AI can create anything, it has to learn from works made by human beings. This technique, called machine learning, lets computers learn to mimic or find patterns in input data. Training an AI typically requires making copies of the data on which it will be trained, and sometimes, copyrighted works are used to train AI without the permission of the rightsholders. This is presumptively copyright infringement unless it’s excused by something like fair use.

In some ways, machine learning looks a lot like other projects that involve large-scale, unauthorized reproduction of copyrighted works by computers. Projects like these—think image search engines and Google Books—have historically been deemed fair use in the United States. This is often because the uses are what some scholars call “non-expressive:” they analyse facts about works instead of using authors’ copyrightable expression.

Training an AI typically requires making copies of the data on which it will be trained, and sometimes, copyrighted works are used to train AI without the permission of the rightsholders. This is presumptively copyright infringement unless it’s excused by something like fair use.

I’m not certain that this rationale can protect emerging applications of machine learning. More than ever before, machine learning can take expression in works, precisely what copyright protects, and cobble it into something that companies hope to use in commerce. A good example is a recent Google project that taught an AI to write more conversationally by feeding it thousands of romance novels. Sometimes, these creative AI programs are even designed to compete with human creators at expressive tasks, like composing music or writing news stories. If expressive machine learning threatens to displace human authors, it seems unfair to train AI on copyrighted works without compensating the authors of those works.

So, to me, the most pressing question doesn’t have to do with AI authors, it has to do with AI readers. When humans copy without authorization, it’s infringement. When does robotic consumption become expressive enough and/or commercially significant enough that it, too, is infringing unless authorized—and what will we do about it?

To me, the most pressing question doesn’t have to do with AI authors, it has to do with AI readers. When humans copy without authorization, it’s infringement. When does robotic consumption become expressive enough and/or commercially significant enough that it, too, is infringing unless authorized—and what will we do about it?

IPW: How will the concept of fair use in the US be challenged by artificial intelligence?

SOBEL: We’re approaching what I call a “fair use dilemma,” because, in the context of commercial, expressive machine learning, no outcome seems desirable. If expressive machine learning weren’t fair use, an author could seek outsize remedies simply because her work ended up in a training dataset among thousands of other works. This would be a huge obstacle to the progress of a valuable technology.

Then again, if fair use gave companies carte blanche to train AI on copyrighted works without compensating authors, human creators would miss out on income that the spirit, and arguably the letter, of copyright law entitle them to receive. This would be a boon for AI and for those who stand to profit from it, but it’s not clear that society as a whole would benefit. A hyper-literate AI would be more likely to displace humans in creative jobs, and that could exacerbate the income inequalities that many people fear in the AI age.

IPW: Should the unequal power relationship between small creators and big AI enterprises influence the interpretation of fair use?

SOBEL: Absolutely. First, a disclaimer: I don’t mean to suggest that fair use must have a redistributive outcome, or that authors are entitled to compensation from any use of their work. Fair use should facilitate innovation, and it’s fine if that innovation proves to be lucrative for the innovators. But as machine learning and AI expand, we should think carefully about what fair use ought to subsidize.

The rhetoric around fair use often depicts “rightsholders” as powerful, incumbent companies and “users” as private individuals or scrappy startups with limited resources. Big Content versus Little Users may have been the paradigm in a previous decade, but I’m not sure it describes the present day. The internet’s most powerful companies are not, primarily, content owners; rather, they are platforms for user-generated content that make money by collecting users’ data and displaying ads. This means that ordinary people are the rightsholders to troves of copyrighted content—wall posts, emails, pictures, videos, music, etc.—that they license to platforms by accepting websites’ Terms of Use. In the platform economy, Big Users tend to have more power than Little Content.

Ordinary people are the rightsholders to troves of copyrighted content—wall posts, emails, pictures, videos, music, etc.—that they license to platforms by accepting websites’ Terms of Use. In the platform economy, Big Users tend to have more power than Little Content.

This economic reordering should influence our views of fair use. Fair use exists to foster free speech, research, innovation, and other socially beneficial activities—not to subsidize powerful companies that already have access to licensed data pursuant to their Terms of Use. Because of this, while I do think fair use will excuse expressive machine learning done for academic research or some artistic purposes, I’m less confident that it will protect companies that train commercial AI on the expressive aspects of copyrighted works, without the permission of those works’ authors.

IPW: Can you elaborate on the distinction between expressive use and non-expressive use and between low and high expressive engagement in the context of AI?

SOBEL: “Non-expressive” versus “expressive” use is a distinction that copyright scholars—most notably Matthew Sag and James Grimmelmann—devised to describe how courts have handled fair use claims that involve large-scale copying by computers. It’s premised on the idea that copyright protects engagement with an author’s expression, but it doesn’t give authors the right to control facts about their works (that is, the non-expressive elements of their works). When Google Books tells you where and how many times a particular keyword appears within a book, it provides a fact about that book. A use like this is therefore non-expressive, even though it involves wholesale copying without authorization.

Some machine learning, by my lights, clearly makes non-expressive fair use of input data. Facial recognition is a good example. Though training a facial recognition AI may require copying lots of copyrighted photographs, the information being used has nothing to do with photographers’ expressive choices and everything to do with matching facts about the subjects’ identity with facts about their physical appearance.

But now that machine learning is getting more sophisticated and its applications more varied, uses of input data seem more and more expressive. When an AI learns to write better prose by reading prose, or how to generate catchy melodies by listening to music, those uses come much closer to copyright-protected interests than a Google Books keyword search does. I’m not sure how the doctrine will evaluate these uses, but I doubt the label “non-expressive” ought to apply. Given that AI is engaging more and more with human expression—in the manner that we assume human readers always do—it seems strange that we would give AI free reign to consume copyrighted works in ways that would be infringement if done by humans. I can’t download an infringing copy of an album just because listening to it will help me write better music in the future.

IPW: Should there be distinct standards for originality or infringement with respect to works created by machine?

SOBEL: In some ways, today’s AI technology makes the question of “independent creation” and infringement easier to evaluate. There’s no way to track every single copyrighted work that a human author encounters in her lifetime, and copyright doctrine has developed convoluted proxies to determine when one author is likely to have copied from another. With machine learning, however, training data could be easily catalogued. We could determine what works an AI has and has not seen. It’s not clear what should be done after that point, though. Say an AI generated a novel without human oversight, and that novel infringed a pre-existing work—who ought to be liable for that infringement?

AI would be great at generating merely “novel” works (that is, works that haven’t existed before), but “original” works raise more difficult issues, because originality typically requires a small amount of creativity. Whether or not a computer program could impart that creativity is a thorny—and, as I understand it, unresolved—question of philosophy and semantics.

AI would be great at generating merely “novel” works (that is, works that haven’t existed before), but “original” works raise more difficult issues

IPW: Is there a way out of the fair use dilemma you have described?

The dilemma of expressive machine learning is serious enough that it may prompt us to revise doctrine and policy for the better. Many people are calling for changes in law and policy that promote social equity in the AI age, like universal basic income and “robot taxes.” Unexpectedly, a faithful interpretation of today’s copyright doctrine, paired with some higher-level compromises, could promote distributive justice in a similar way: by compensating the creators whose expression gives artificial intelligence some of its intelligence.

Ben Sobel is a Fellow at Harvard University’s Berkman Klein Center for Internet & Society.

Elise De Geyter obtained the LLM Intellectual Property and Technology Law at the National University of Singapore (class 2017). She has a particular interest in intellectual property policies and new technologies and was an intern at Intellectual Property Watch.

The Dilemma Of Fair Use And Expressive Machine Learning: An Interview With Ben Sobel

Related

Other Languages

Archives

Staff Access

Sign up for free news alerts

Share this: