Google Adds Scraping Public Content for AI Training to Its Privacy Policy

It’s a bad week for anyone hoping to remain untouched by generative AI. Mere days after OpenAI was sued for stealing personal data to train ChatGPT and DALL-E, Google has confirmed via its privacy policy that it can and does scrape public data to train its own AI systems. Nothing a person can publicly post is sacred—if anyone can see it, Google will scrape it.

Google quietly updated its privacy policy over the weekend. The page is long, but if you scroll way down, there’s a little blurb about AI that reads: “Google uses information to improve our services and to develop new products, features and technologies that benefit our users and the public. For example, we use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.” Publicly available information includes social media, blogs, news material, comments, app reviews, website profiles, cookies, and other web activity. If a certain government agency or jurisdiction makes select information available to the public via an online database, that information is scrapable, too.

Google's changes to its privacy policy, as captured in its Updates section. Credit: Google

In its old privacy policy, Google only admitted to harvesting and using publicly available data to train its “language” models, like PaLM. (And ads, obviously, but we’re talking AI here.) PaLM is a large language model that powers a variety of Google’s products, from experimental robots to generative AI systems like Bard (Google’s answer to ChatGPT). Every “decision-making” process PaLM undergoes is based on massive amounts of data, which Google found in all corners of the internet. Now it’s explicitly stating this data can and will be used to train Google Translate, Bard, and its own Cloud AI suite.

The update stokes an ever-growing fire concerning how generative AI tools are trained. Internet users are increasingly aware of—and upset about—generative AI developers’ tendency to utilize their data without consent. While extreme cases like OpenAI’s alleged use of private medical data are (hopefully) rare, some worry about the potential consequences of letting scrapers run rampant over people’s data, even if it’s publicly available. The issue has gathered such urgency that members of Congress and the International Association of Privacy Professionals (IAPP) have begun inspecting how generative AI models are trained and used in the United States.

Google Adds Scraping Public Content for AI Training to Its Privacy Policy

Tagged In

More from Internet & Security