OpenAI launches webcrawler GPTBot, and instructions on how to block it

Websites can choose to opt out.
By Meera Navlakha  on 
ChatGPT website displayed on a laptop screen and OpenAI logo displayed on a phone screen.
Credit: akub Porzycki/NurPhoto via Getty Images.

OpenAI has launched a web crawler to improve artificial intelligence models like GPT-4.

Called GPTBot, the system combs through the Internet to train and enhance AI's capabilities. Using GPTBot has the potential to improve existing AI models when it comes to aspects like accuracy and safety, according to a blog post by OpenAI.

"Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies," reads the post.

Websites can choose to restrict access to the web crawler, however, and prevent GPTBot from accessing their sites, either partially or by opting out entirely. OpenAI said that website operators can disallow the crawler by blocking its IP address or on a site's Robots.txt file.

Previously, OpenAI has landed in hot water for how it collects data and for things like copyright infringement and privacy breaches. This past June, the AI platform was sued for "stealing" personal data to train ChatGPT.

Mashable Light Speed
Want more out-of-this world tech, space and science stories?
Sign up for Mashable's weekly Light Speed newsletter.
By signing up you agree to our Terms of Use and Privacy Policy.
Thanks for signing up!

Its opt-out functions were only recently implemented, with features like disabling chat history allowing users more control over what personal data can be accessed.

ChatGPT 3.5 and 4 were trained on online data and text dating up to Sept. 2021. There is currently no way to remove content from that dataset.

How to prevent GPTBot from using your website's content

According to OpenAI, you can disallow GPTBot by adding it to your site's Robots.txt, which is essentially a text file that instructs web crawlers on what they can or cannot access from a website.

The code for disallowing GPTBot from your site.
Credit: Screenshot / OpenAI.

You can also customize what parts a web crawler can use, allowing certain pages and disallowing others.

The code for disallowing or allowing GPTBot from your site's pagess.
Credit: Screenshot / OpenAI.
Mashable Image
Meera Navlakha
Culture Reporter

Meera is a Culture Reporter at Mashable, joining the UK team in 2021. She writes about digital culture, mental health, big tech, entertainment, and more. Her work has also been published in The New York Times, Vice, Vogue India, and others.


Recommended For You
iOS 18: Rumors that it’s getting built-in ChatGPT are heating up
A person's hand holds an iPhone with the OpenaAI ChatGPT app running GPT-4 visible


AI playlists, student loan pardons, total eclipse and the cast of 'Civil War'
U Need to Know This Ep 59

I tried using ChatGPT to help me move across the country
Outline of the United States, inside is car with the roof packed with suitcases

ChatGPT search engine rumored to launch a day before major Google event
ChatGPT mobile app icon on a screen

Trending on Mashable
NYT Connections today: See hints and answers for May 23
A phone displaying the New York Times game 'Connections.'

'Wordle' today: Here's the answer hints for May 23
a phone displaying Wordle

NYT Connections today: See hints and answers for May 22
A phone displaying the New York Times game 'Connections.'

Microsoft being investigated over new ‘Recall’ AI feature that tracks your every PC move
Microsoft Recall

NYT's The Mini crossword answers for May 23
Closeup view of crossword puzzle clues
The biggest stories of the day delivered to your inbox.
This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.
Thanks for signing up. See you at your inbox!