AI has changed the way we create and consume digital content, but at what cost? Behind each artwork that is portrayed, are collections of data. Much of it is sourced from artists, writers, and developers, who often are not given credit or not compensated for their hard work. Microsoft has taken an important step towards resolving a principal ethical and legal concern regarding AI development, that of tracing and crediting contributors of data for AI training.
The company has initiated a research project to determine the influence of specific training examples, like text, images, or media, upon generative AI outputs. This was initially referred to in a job listing in December and has now been floated back via LinkedIn. Its objectives are toward the establishment of transparency relating to how AI models use data and reward accordingly to the contributors.
Microsoft’s Research Initiative
The job listing looks for a research intern on a project that seeks to show AI models can be trained, so that the specific data contributions can be effectively and usefully estimated. It is an approach called “training-time provenance.” Training-time provenance could enhance the transparency of AI model development along with addressing intellectual property and fair compensation concerns. According to the listing,
“The project will attempt to demonstrate that models can be trained in such a way that the impact of particular data — e.g. photos and books — on their outputs can be efficiently and usefully estimated.”
Microsoft recognizes the necessity of such attribution mechanisms in the interest of fairness, and to incentivize and recognize those people who contribute valuable data. The listing states,
“Current neural network architectures are opaque in terms of providing sources for their generations, and there are […] good reasons to change this. One is, incentives, recognition, and potentially pay for people who contribute certain valuable data to unforeseen kinds of models we will want in the future, assuming the future will surprise us fundamentally.”
Legal & Ethical Concerns
In recent times, AI-generated content in the form of text, codes, images, video, and music has been the subject of various court cases brought forth by creators alleging copyright infringement. The training database of the AI models usually consists of a huge volume of publicly available data, comprising copyrighted works. Here, companies defend such infringements under the cover of fair use, however many content creators, artists, programmers, and writers disagree and claim that their work is being exploited without their permission and compensation.
Microsoft has also faced its own share of legal troubles in this arena. In December, The New York Times sued Microsoft and Open AI, challenging that their AI models were trained in millions of copyrighted articles without any permission. Microsoft has also faced lawsuits from software developers who allege that GitHub Copilot, the AI coding companion from Microsoft, has been unlawfully trained on their proprietary codebases.
This new research effort appears to be addressing those concerns, involving Jaron Lanier, a leading technologist and interdisciplinary scientist at Microsoft Research. Previously, Lanier had called for “data dignity,” whereby digital content is tied to its human creators, assuring them suitable recognition and compensation.
Data Dignity Approach
In an illustration of a data dignity approach, Lanier’s op-ed published in The New Yorker in April, 2023 described how such a scenario would be established. Lanier mentioned,
“A data-dignity approach would trace the most unique and influential contributors when a big model provides a valuable output. For instance, if you ask a model for ‘an animated movie of my kids in an oil-painting world of talking cats on an adventure,’ then certain key oil painters, cat portraitists, voice actors, and writers — or their estates — might be calculated to have been uniquely essential to the creation of the new masterpiece. They would be acknowledged and motivated. They might even get paid.”
There are already some companies that had this model and were working on it. AI Company Bria raised $40 million venture cash recently and claims to compensate data owners as per their overall influence. Adobe and Shutterstock compensated dataset contributors but most of the schemes are rather cloudy.
Potential Obstacles
While some artificial intelligence labs have started to give the option to opt out to copyright holders, it usually refers to only future models as opposed to AI systems already in use. Large AI development companies have primarily approached this through licensing agreements with the publishers and platforms, with direct payment programs for individual contributors remaining rare.
Implementation challenges may arise from Microsoft’s initiative. There have been similar prior declarations, OpenAI, had proclaimed an upcoming technology that would allow creators to control the use of their content for AI training in May 2023. The new tool was not given much priority thereafter, with almost a year gone, there is no sign of it being released.
Some skeptics see this as an attempt at “ethics wash,” a strategy to fend off regulatory and legal risks rather than data transparency in and of itself. One aspect of timing that is noteworthy is that some of the major AI labs, including OpenAI and Google, have been able to persuade the public to weaken copyright protections with respect to AI training. OpenAI has directly gone as far as asking the U.S government to codify fair use for model training, effectively insulating AI developers from possible legal hindrances.
An Admirable Attempt?
Microsoft’s project can very well be termed an admirable attempt at solving a decade old problem faced by AI, but is any actual change possible? The thought of creating mechanisms to track training data contributors is promising, but historically, this form of commercialization has either led to such projects advancing only in theory or being squashed completely by corporate interests. The tech industry has an overwhelming tendency to prioritize unclear “opt-out” systems to compelling compensation models.
Besides, there is a good case for being cynical given all the courtroom appearances on the tracks by which Microsoft has ventured into copyright and with the growing trend of AI firms chipping away at copyright. As AI development crosses legal and ethical boundaries, issues of transparency and fair compensation for content producers will continue to be at front and will only serve to deepen the existing rift between tech giants and content creators.