AI language models work by predicting the next likely word in a sentence, generating one word at a time on the basis of those predictions. Watermarking algorithms for text divide the language model’s vocabulary into words on a “green list” and a “red list,” and then make the AI model choose words from the green list. The more words in a sentence that are from the green list, the more likely it is that the text was generated by a computer. Humans tend to write sentences that include a more random mix of words.
The researchers tampered with five different watermarks that work in this way. They were able to reverse-engineer the watermarks by using an API to access the AI model with the watermark applied and prompting it many times, says Staab. The responses allow the attacker to “steal” the watermark by building an approximate model of the watermarking rules. They do this by analyzing the AI outputs and comparing them with normal text.
Once they have an approximate idea of what the watermarked words might be, this allows the researchers to execute two kinds of attacks. The first one, called a spoofing attack, allows malicious actors to use the information they learned from stealing the watermark to produce text that can be passed off as being watermarked. The second attack allows hackers to scrub AI-generated text from its watermark, so the text can be passed off as human-written.
The team had a roughly 80% success rate in spoofing watermarks, and an 85% success rate in stripping AI-generated text of its watermark.
Researchers not affiliated with the ETH Zürich team, such as Soheil Feizi, an associate professor and director of the Reliable AI Lab at the University of Maryland, have also found watermarks to be unreliable and vulnerable to spoofing attacks.
The findings from ETH Zürich confirm that these issues with watermarks persist and extend to the most advanced types of chatbots and large language models being used today, says Feizi.
The research “underscores the importance of exercising caution when deploying such detection mechanisms on a large scale,” he says.
Despite the findings, watermarks remain the most promising way to detect AI-generated content, says Nikola Jovanović, a PhD student at ETH Zürich who worked on the research.
But more research is needed to make watermarks ready for deployment on a large scale, he adds. Until then, we should manage our expectations of how reliable and useful these tools are. “If it’s better than nothing, it is still useful,” he says.
Update: This research will be presented at the International Conference on Learning Representations conference. The story has been updated to reflect that.
24World Media does not take any responsibility of the information you see on this page. The content this page contains is from independent third-party content provider. If you have any concerns regarding the content, please free to write us here: contact@24worldmedia.com
Do you believe the Covid vaccine had negative side effects? VOTE HERE
Latest Google layoffs hit the Flutter and Python groups
‘Women’s rights have been attacked constantly!’
Here’s your chance to own a decommissioned US government supercomputer
AWS S3 storage bucket with unlucky name nearly cost developer $1,300
FTC fines Razer for every cent made selling bogus “N95 grade” RGB masks
Apple confirms bug that is keeping some iPhone alarms from sounding
Roundtables: Inside the Next Era of AI and Hardware
Supplements: Ginkgo biloba boosts memory