This post originally appeared on Medium, published by Adam Morrice.
I’ve written about LLMs (large language models) using vast amounts of general data from across the internet to power AI answer machines like OpenAI’s ChatGPT, Google’s Bard / SGE and what that means for how the average person finds and uses information online. In short, the big tech dogs are pushing a new way for people to consume information, bypassing original creators in the process.
The internet as we know it has worked on a fairly simple premise. One person writes, another person reads. Sometimes there were group discussions in form of message or bulletin boards. Video came along and diversified the forms of information, but the core value transaction remained the same; everything you consumed online was produced by a person, for a person.
Now we have a new player in the chain, with AI bots able to create new information without relying on the creative faculties of a human. Answers, articles, images, videos, voiceovers and more. It’s all on the table and it’s all able to be created ex nihilo. Or so it seems.
As the scale of generative AI has broadened, it turns out they have scooped up vasts amounts of data to create these answers – not from nothing, but from people’s existing works. In the efforts to train these models, developers have ignored most copyright laws, the original effort of creators, or prevailing privacy concerns. So foundational is this data-mining approach that in a recent legal submission to the House of Lords (part of the UK legislature), OpenAI flatly claimed it to be “impossible” to create tools like GPT-4 without consuming copy-protected work.
This came after legal action by the New York Times against OpenAI and Microsoft, alleging “unlawful use” of assets and content owned by the NYT. Other complainants in similar cases include George RR Martin and John Grisham who were among 17 authors bringing legal action against OpenAI, claiming “systematic theft on a mass scale”. This data is scraped, learned from, repackaged and presented to end users without any insight into the origins.
More neutral parties like the Electronic Frontier Foundation still believe the practice of AI data scraping is lawful, but whether it is legal differs from whether it’s unpleasant, intrusive or generally detrimental. So, they believe (as do I), that people and site owners should have a way to signal their preferences to Google, OpenAI, Microsoft and the rest.
For publishers and creators of original media, the potential disruption of generative AI on how their business operate is clear. Most recently, it was found that Google’s new search answers tool — Search Generative Experience — gave AI generated answers to 92% of all queries, often absorbing written content from website owners, mashing it around a little bit, and presenting it to searchers directly without them ever leaving Google.com.
With all that in mind, right now we are at a critical point in the development of AI LLMs. Data is needed to train these models and give the answers to everything but what data can, or should, be used and should we have any say in it?
Opting Out of AI Using Your Data
What some might call the ‘Big Three’ of data-slurpers; Google, Facebook and X/Twitter, have already written terms into their privacy policies that your data can be used to train AIs. Reddit, in advance of their IPO, also announced a similar formal intention for users’ data to be sold and fed into LLMs. Google proudly announced an “expanded partnership” with Reddit, where all this “authentic, human experience” could be fed into their AI machines to learn from. As one of the biggest holders of UGC (user-generated content) data, expect your friendly AI chatbots to know a lot more about Reddit memes going forward.
It’s almost inevitable that some or all of your data — especially if you create written content — has been fed into an LLM already. Privacy experts have long warned of the ‘black-box’ nature of it all, with AI models now being so effective that individuals, companies and governments rely on it without understanding how it works. AIs do still have biases, after all.
The issue stems down to this: AIs are trained on data. If they have unrestricted access to data, copyright or no, it gives the best chance for AI and AGI to be successful. Limiting data inputs, therefore, limits the scope and effectiveness of any potential AI because it (supposedly) needs everything.
But we also know, as human beings living in a complex world, that decisions made solely on data can be detrimental to some groups. Amazon, for example, ended their AI recruitment drive after the algorithm taught itself that women were less suitable as candidates. Other research has found that training LLMs on news articles can lead to responses with demonstrable gender stereotypes and biases.
So maybe our future AI overlords don’t need access to everything.
Defenders claim there always remains human oversight, where “reinforcement learning from human feedback is used”. Most responses and AI companies strive for neutrality, so these manual controls often come down to decisions by said tech company’s employees, who also tend to hire more left-leaning individuals. If they see a possible right-wing opinion being espoused by their neutral AI, it’ll be reined back in. Or so the theory goes.
Yet at the same time, we have recent announcements from giants like Microsoft, explaining the future of coding where AIs will do all the actual work and simply be supervised by humans at the end of the process. We can all wonder at the potential outcomes of getting black boxes to speak with each other.
Probably a good idea, then, that news publications are among some of the earliest and loudest objectors against this AI data scraping. Already, around 90% of top news publications, including the New York Times, have added lines to their website’s robots.txt directive to disallow AI data scraping from OpenAI and Google’s crawlers. There’s also a fair amount of general noise online about opting out of AI crawlers in general.
There are certainly arguments for and against letting AIs train on your data unrestricted. The rest of this article will give you a breakdown of how you can escape the data nets of the big players as an individual and also as a business if you produce content.
Opting out of Google’s Gemini using your data to train AI
Google’s chatbot Gemini offers an opt-out via your browser. Go to Activity and choose Turn Off in the drop-down menu. You’ll be given a choice to turn off the Apps Activity or go one further and opt out while also deleting your conversation data. While that doesn’t stop Gemini from learning from your inputs and responses, it does mean that (supposedly) these chats won’t be put forward for human review, which Google says happens periodically. The actual content of the chats you have with Gemini aren’t connected to your Google account and can be stored for three years. You can read the full privacy implications of Gemini here.
Opting out of Google Bard AI crawlers on your website
If you own a website which produces content, you can disallow both using a robots.txt directive. This is a text file that sits in the root of your website’s domain.
Add this line to your robots.txt to stop Bard from crawling:
User-agent: Google-Extended
Disallow: /
That will configure a site-wide opt out. If you don’t want to paint in such heavy strokes, you can also apply a directory opt-out, such as:
User-agent: Google-Extended
Disallow: /blog
That would prevent the crawler from scraping anything under yoursite.com/blog/
Opting out of OpenAI ChatGPT and Dall-E using your data
One tech executive at a company I worked at was very keen to stress that, unless you’re an Enterprise customer, anything you say to ChatGPT can be retained, stored and used for future purposes. It’s probably wisest to consider that anything you say to ChatGPT is in, or will be in, the public domain by virtue of you sharing it. Those financial spreadsheets you asked it to evaluate, perhaps? You can, however, control how the data you share with GPT is used to train future AI models.
There is a dedicated data controls page on OpenAI’s website. To prevent your chat data being used to train models, currently you would navigate to Settings and un-toggle the ‘Improve the model for everyone’ option. Sneaky way to put it, and we’ll see if in future it’s defaulted as opt-out (especially with the EU’s new directives). For now, you’ll have to go in and do it yourself.
However, the other component to consider here — especially if you’re a content creator — is OpenAI’s Dall-E mining your site for image training. If you’re hosting a lot of images online that are open to public access, chances are the GPT bot is going to train on them. If you want to disallow that access, similar to the Google example, you’ll need to edit your robots.txt file to include:
User-agent: GPTBot
Disallow: /
or just use the directory in which your images are stored.
Stop Quora using your answers to train AI
Quora has been one of the big winners in Google’s recent updates, but they currently claim that no Quora answers, comments or posts are used to train AI at the moment. That’s not to say it’s not already been scooped up by the LLMs, but more that Quora hasn’t yet done it themselves.
In case Quora decides to cash in on the UGC hype like our friends at Reddit, you can opt out in advance by navigating to Settings, then Privacy, and turning off the option that says ‘Allow large language models to be trained on your content’.
As I’ve mentioned, Reddit has directly and loudly joined a deal with Google to sell real-time data to train LLMs. In a deal worth around $200m spread over several years, you can assume pretty much every Reddit comment has been fed into an LLM and will continue to be for the forseeable.
Currently Reddit do not have a specific opt out for training LLMs. This as prompted some investigation from the US government’s FTC, but the outcome of that is (currently) yet to be decided. Reddit’s privacy policy also contains no mention of your data being used to train AIs or LLMs, just some vague mentions of sharing with third parties at their discretion.
Until Reddit offer some more control and privacy settings, if you want to keep control of your content and not have it feed the AI monolith, it’s probably best to avoid it altogether. Alternatively, you can copy the Reddit moderators and outraged users method of deleting all your Reddit comments and posts in bulk, but that doesn’t necessarily mean Reddit won’t retain and sell on the data in some way.
Slack chats used to train AI?
Unfortunately, yes. Slack is busy building their own AI to catch up with the rest, with the aim of sticking it “right where you’re already working”. It’s currently available to Enterprise users, with a wider roll-out expected before their next earnings call.
Slack’s LLMs are hosted directly within the company’s AWS cloud. They have a handy privacy page that explains how “[t]o develop AI/ML models, our systems analyze Customer Data (e.g. messages, content, and files) submitted to Slack as well as Other Information (including usage information)”.
So yes, your private Slack messages and DMs are being fed into an LLM for the purposes of… emoji suggestions right now, I guess?
As with most of these companies so far, you are opted in by default. To opt out of your Slack data being used to train AI, you have to contact Slack directly via email, from an admin account, and use a specific subject line:
If you want to exclude your Customer Data from Slack global models, you can opt out. To opt out, please have your Org or Workspace Owners or Primary Owner contact our Customer Experience team at feedback@slack.com with your Workspace/Org URL and the subject line “Slack Global model opt-out request.” We will process your request and respond once the opt out has been completed.
Stop Meta/Facebook from using your data
In response to a $1.3 billion fine from EU regulators over the misuse of personal data, Meta released a form allowing you to request that your data is not used to train generative AIs. The form can be found here. The data up for processing includes personal information like your name, home address, phone number, email and pretty much anything you’ve entered (aside, they claim, from private chats) into Facebook or the Meta monolith across WhatsApp, Instagram, Threads and more. That’s all being used to train Meta’s AIs with no way to opt out.
The important point here is that Meta will not fulfill requests on the linked page automatically, and it only relates to your data from Meta being used to train third-party generative AI tools. The request page also requires you provide an example of where your personal information was wrongly presented to you by a generative AI tool. There’s no way to opt out in advance.
Given the numerous privacy infringements of Meta, dating back to the founding of Facebook where Zuck infamously called users “dumb f*cks” for handing over the personal data so willingly, fair to say that privacy-minded individuals aren’t going to find much solace here.
Tumblr
The thought of Tumblr content being used to train AIs is frightening itself, but it is happening. Now owned by Automattic, the same company that develops WordPress, they announced in February 2024 the intention to sell data for AI training but weren’t very clear on the scope of it.
A leaked memo suggests that Tumblr was willing to sell everything, including private posts on public blog posts, deleted or suspended blogs, unanswered (i.e. not publicly posted) questions, private answers, posts marked explicit and content from premium partner blogs (like Apple’s former music site).
If you don’t like the sound of that, there is an option to ‘Prevent third-party sharing’ in the Settings area, under Visibility. In the meantime, let’s all go ask ChatGPT about the My Immortal fanfic.
Amazon AI Data Opt-out
Probably the most complicated opt-out of the batch, and only relevant if you run Amazon’s cloud services. Our favorite A-Z online retailer has a lengthy page on AI opt-outs.
If you use Amazon Rekognition, Amazon CodeWhisperer, Amazon Transcribe, or Contact Lens for Amazon Connect and want to opt out, it involves creating an enforcing a specific policy from your AWS Dashboard. This page will get you started. Maybe paste it into GPT and ask it to simplify for you.
Author’s note: this page will be reviewed and updated periodically with the latest guidance on opting out of AI data scraping. I’ll also be adding guides for additional services as I come across them
Article photo by ev on Unsplash
With more than 20 years of publishing experience and formal study to an MA level, Adam is an expert in digital publishing strategies. Adam is the founder and lead consultant at Alphamorr, where he helps digital publishers and media groups to set and achieve their goals. Adam’s experience includes leading sales team in fast-growing organizations covering Adtech for some of the largest websites in the world & enterprise software deployed with world-leading organizations, both public and private.
No responses yet