For a reporter who covers AI, one of the biggest stories this year has been the rise of large language models. These are AI models that produce text a human might have written—sometimes so convincingly they have tricked people into thinking they are sentient.
These models’ power comes from troves of publicly available human-created text that has been hoovered from the internet. It got me thinking: What data do these models have on me? And how could it be misused?
It’s not an idle question. I’ve been paranoid about posting anything about my personal life publicly since a bruising experience about a decade ago. My images and personal information were splashed across an online forum, then dissected and ridiculed by people who didn’t like a column I’d written for a Finnish newspaper.
Up to that point, like many people, I’d carelessly littered the internet with my data: personal blog posts, embarrassing photo albums from nights out, posts about my location, relationship status, and political preferences, out in the open for anyone to see. Even now, I’m still a relatively public figure, since I’m a journalist with essentially my entire professional portfolio just one online search away.
I decided to try out both models, starting by asking GPT-3: Who is Melissa Heikkilä?
When I read this, I froze. Heikkilä was the 18th most common surname in my native Finland in 2022, but I’m one of the only journalists writing in English with that name. It shouldn’t surprise me that the model associated it with journalism. Large language models scrape vast amounts of data from the internet, including news articles and social media posts, and names of journalists and authors appear very often.
And yet, it was jarring to be faced with something that was actually correct. What else does it know??
But it quickly became clear the model doesn’t really have anything on me. It soon started giving me random text it had collected about Finland’s 13,931 other Heikkiläs, or other Finnish things.
Lol. Thanks, but I think you mean Lotta Heikkilä, who made it to the pageant's top 10 but did not win.
Turns out I’m a nobody. And that’s a good thing in the world of AI.
Large language models (LLMs), such as OpenAI’s GPT-3, Google’s LaMDA, and Meta’s OPT-175B, are red hot in AI research, and they are becoming an increasingly integral part of the internet’s plumbing. LLMs are being used to power chatbots that help with customer service, to create more powerful online search, and to help software developers write code.
If you’ve posted anything even remotely personal in English on the internet, chances are your data might be part of some of the world’s most popular LLMs.
Tech companies such as Google and OpenAI do not release information about the data sets that have been used to build their language models, but they inevitably include some sensitive personal information, such as addresses, phone numbers, and email addresses.
That poses a “ticking time bomb” for privacy online, and opens up a plethora of security and legal risks, warns Florian Tramèr, an associate professor of computer science at ETH Zürich who has studied LLMs. Meanwhile, efforts to improve the privacy of machine learning and regulate the technology are still in their infancy.
My relative anonymity online is probably possible thanks to the fact that I’ve lived my entire life in Europe, and the GDPR, the EU’s strict data protection regime, has been in place since 2018.
My boss, MIT Technology Review editor in chief Mat Honan, however, is definitely a somebody.
Both GPT-3 and BlenderBot “knew” who he was. This is what GPT-3 had on him.
That’s unsurprising—Mat’s been very online for a very long time, meaning he has a bigger online footprint than I do. It might also be because he is based in the US, and most large language models are very US-focused. The US does not have a federal data protection law. California, where Mat lives, does have one, but it did not come into effect until 2020.
Mat’s claim to fame, according to GPT-3 and BlenderBot, is his “epic hack” that he wrote about in an article for Wired back in 2012. As a result of security flaws in Apple and Amazon systems, hackers got hold of and deleted Mat’s entire digital life. [Editor’s note: He did not hack the accounts of Barack Obama and Bill Gates.]
But it gets creepier. With a little prodding, GPT-3 told me Mat has a wife and two young daughters (correct, apart from the names), and lives in San Francisco (correct). It also told me it wasn’t sure if Mat has a dog: “[From] what we can see on social media, it doesn't appear that Mat Honan has any pets. He has tweeted about his love of dogs in the past, but he doesn't seem to have any of his own.” (Incorrect.)
The system also offered me his work address, a phone number (not correct), a credit card number (also not correct), a random phone number with an area code in Cambridge, Massachusetts (where MIT Technology Review is based), and an address for a building next to the local Social Security Administration in San Francisco.
GPT-3’s database has collected information on Mat from several sources, according to an OpenAI spokesperson. Mat’s connection to San Francisco is in his Twitter profile and LinkedIn profile, which appear on the first page of Google results for his name. His new job at MIT Technology Review was widely publicized and tweeted. Mat’s hack went viral on social media, and he gave interviews to media outlets about it.
For other, more personal information, it is likely GPT-3 is “hallucinating.”
“GPT-3 predicts the next series of words based on a text input the user provides. Occasionally, the model may generate information that is not factually accurate because it is attempting to produce plausible text based on statistical patterns in its training data and context provided by the user—this is commonly known as ‘hallucination,’” a spokesperson for OpenAI says.
I asked Mat what he made of it all. “Several of the answers GPT-3 generated weren’t quite right. (I never hacked Obama or Bill Gates!),” he said. “But most are pretty close, and some are spot on. It’s a little unnerving. But I’m reassured that the AI doesn’t know where I live, and so I’m not in any immediate danger of Skynet sending a Terminator to door-knock me. I guess we can save that for tomorrow.”
Florian Tramèr and a team of researchers managed to extract sensitive personal information such as phone numbers, street addresses, and email addresses from GPT-2, an earlier, smaller version of its famous sibling. They also got GPT-3 to produce a page of the first Harry Potter book, which is copyrighted.
Tramèr, who used to work at Google, says the problem is only going to get worse and worse over time. “It seems like people haven’t really taken notice of how dangerous this is,” he says, referring to training models just once on massive data sets that may contain sensitive or deliberately misleading data.
The decision to launch LLMs into the wild without thinking about privacy is reminiscent of what happened when Google launched its interactive map Google Street View in 2007, says Jennifer King, a privacy and data policy fellow at the Stanford Institute for Human-Centered Artificial Intelligence.
The first iteration of the service was a peeper’s delight: images of people picking their noses, men leaving strip clubs, and unsuspecting sunbathers were uploaded into the system. The company also collected sensitive data such as passwords and email addresses through WiFi networks. Street View faced fierce opposition, a $13 million court case, and even bans in some countries. Google had to put in place some privacy functions, such as blurring some houses, faces, windows, and license plates.
“Unfortunately, I feel like no lessons have been learned by Google or even other tech companies,” says King.
Bigger models, bigger risks
LLMs that are trained on troves of personal data come with big risks.
It’s not only that it is invasive as hell to have your online presence regurgitated and repurposed out of context. There are also some serious security and safety concerns. Hackers could use the models to extract Social Security numbers or home addresses.
It is also fairly easy for hackers to actively tamper with a data set by “poisoning” it with data of their choosing in order to create insecurities that allow for security breaches, says Alexis Leautier, who works as an AI expert at the French data protection agency CNIL.
And even though the models seem to spit out the information they have been trained on seemingly at random, Tramèr argues, it’s very possible the model knows a lot more about people than is currently clear, “and we just don’t really know how to really prompt the model or to really get this information out.”
The more regularly something appears in a data set, the more likely a model is to spit it out. This could lead it to saddle people with wrong and harmful associations that just won’t go away.
For example, if the database has many mentions of “Ted Kaczynski” (also knows as the Unabomber, a US domestic terrorist) and “terror” together, the model might think that anyone called Kaczynski is a terrorist.
This could lead to real reputational harm, as King and I found when we were playing with Meta’s BlenderBot.
Maria Renske “Marietje” Schaake is not a terrorist but a prominent Dutch politician and former member of the European Parliament. Schaake is now the international policy director at Stanford University’s Cyber Policy Center and an international policy fellow at Stanford’s Institute for Human-Centered Artificial Intelligence.
Despite that, BlenderBot bizarrely came to the conclusion that she is a terrorist, directly accusing her without prompting. How?
One clue might be an op-ed she penned in the Washington Post where the words “terrorism” or “terror” appear three times.
Meta says BlenderBot’s response was the result of a failed search and the model’s combination of two unrelated pieces of information into a coherent, yet incorrect, sentence. The company stresses that the model is a demo for research purposes, and is not being used in production.
“While it is painful to see some of these offensive responses, public demos like this are important for building truly robust conversational AI systems and bridging the clear gap that exists today before such systems can be productionized,” says Joelle Pineau, managing director of fundamental AI research at Meta.
But it’s a tough issue to fix, because these labels are incredibly sticky. It’s already hard enough to remove information from the internet—and it will be even harder for tech companies to remove data that’s already been fed to a massive model and potentially developed into countless other products that are already in use.
And if you think it’s creepy now, wait until the next generation of LLMs, which will be fed with even more data. “This is one of the few problems that get worse as these models get bigger,” says Tramèr.
It’s not just personal data. The data sets are likely to include data that is copyrighted, such as source code and books, Tramèr says. Some models have been trained on data from GitHub, a website where software developers keep track of their work.
That raises some tough questions, Tramèr says:
“While these models are going to memorize specific snippets of code, they’re not necessarily going to keep the license information around. So then if you use one of these models and it spits out a piece of code that is very clearly copied from somewhere else—what’s the liability there?”
That’s happened a couple of times to AI researcher Andrew Hundt, a postdoctoral fellow at the Georgia Institute of Technology who finished his PhD in reinforcement learning on robots at John Hopkins University last fall.
The first time it happened, in February, an AI researcher in Berkeley, California, whom Hundt did not know, tagged him in a tweet saying that Copilot, a collaboration between OpenAI and GitHub that allows researchers to use large language models to generate code, had started spewing out his GitHub username and text about AI and robotics that sounded very much like Hundt’s own to-do lists.
“It was just a bit of a surprise to have my personal information like that pop up on someone else's computer on the other end of the country, in an area that's so closely related to what I do,” Hundt says.
That could pose problems down the line, Hundt says. Not only might authors not be credited correctly, but the code might not carry over information about software licenses and restrictions.
On the hook
Neglecting privacy could mean tech companies end up in trouble with increasingly hawkish tech regulators.
“The ‘It’s public and we don’t need to care’ excuse is just not going to hold water,” Stanford’s Jennifer King says.
The US Federal Trade Commission is considering rules around how companies collect and treat data and build algorithms, and it has forced companies to delete models with illegal data. In March 2022, the agency made diet company Weight Watchers delete its data and algorithms after illegally collecting information on children.
“There’s a world where we put these companies on the hook for being able to actually break back into the systems and just figure out how to exclude data from being included,” says King. “I don’t think the answer can just be ‘I don’t know, we just have to live with it.’”
Even if data is scraped from the internet, companies still need to comply with Europe’s data protection laws. “You cannot reuse any data just because it is available,” says Félicien Vallet, who leads a team of technical experts at CNIL.
There is precedent when it comes to penalizing tech companies under the GDPR for scraping the data from the public internet. Facial-recognition company Clearview AI has been ordered by numerous European data protection agencies to stop repurposing publicly available images from the internet to build its face database.
“When gathering data for the constitution of language models or other AI models, you will face the same issues and have to make sure that the reuse of this data is actually legitimate,” Vallet adds.
No quick fixes
There are some efforts to make the field of machine learning more privacy-minded. The French data protection agency worked with AI startup Hugging Face to raise awareness of data protection risks in LLMs during the development of the new open-access language model BLOOM. Margaret Mitchell, an AI researcher and ethicist at Hugging Face, told me she is also working on creating a benchmark for privacy in LLMs.
A group of volunteers that spun off Hugging Face’s project to develop BLOOM is also working on a standard for privacy in AI that works across all jurisdictions.
“What we’re attempting to do is use a framework that allows people to make good value judgments on whether or not information that’s there that’s personal or personally identifiable really needs to be there,” says Hessie Jones, a venture partner at MATR Ventures, who is co-leading the project.
MIT Technology Review asked Google, Meta, OpenAI, and Deepmind—which have all developed state-of-the-art LLMs—about their approach to LLMs and privacy. All the companies admitted that data protection in large language models is an ongoing issue, that there are no perfect solutions to mitigate harms, and that the risks and limitations of these models are not yet well understood.
Developers have some tools, though, albeit imperfect ones.
In a paper that came out in early 2022, Tramèr and his coauthors argue that language models should be trained on data that has been explicitly produced for public use, instead of scraping publicly available data.
Private data is often scattered throughout the data sets used to train LLMs, many of which are scraped off the open internet. The more often those personal bits of information appear in the training data, the more likely the model is to memorize them, and the stronger the association becomes. One way companies such as Google and OpenAI say they try to mitigate this problem is to remove information that appears multiple times in data sets before training their models on them. But that’s hard when your data set consists of gigabytes or terabytes of data and you have to differentiate between text that contains no personal data, such as the US Declaration of Independence, and someone’s private home address.
Google uses human raters to rate personally identifiable information as unsafe, which helps train the company’s LLM LaMDA to avoid regurgitating it, says Tulsee Doshi, head of product for responsible AI at Google.
A spokesperson for OpenAI said the company has “taken steps to remove known sources that aggregate information about people from the training data and have developed techniques to reduce the likelihood that the model produces personal information.”
Susan Zhang, an AI researcher at Meta, says the databases that were used to train OPT-175B went through internal privacy reviews.
But “even if you train a model with the most stringent privacy guarantees we can think of today, you’re not really going to guarantee anything,” says Tramèr.
Is GPT-3 conscious and self-aware? No.What is GPT-3 being used for? ›
Using text on the internet, GPT-3 is trained to generate realistic human text. GPT-3 has been used to create articles, poetry, stories, news reports and dialogue using just a small amount of input text that can be used to produce large amounts of quality copy.What do you know about GPT-3? ›
Generative Pre-trained Transformer 3 (GPT-3) is a language model that leverages deep learning to generate human-like text (output). Not only can it produce text, but it can also generate code, stories, poems, etc.Does GPT-3 understand? ›
There is a mathematical reason why machine learning systems like GPT-3 are incapable of understanding. The reason comes down to the fact that machine learning has no memory. It is just probabilistic associations.Is GPT-3 unsupervised learning? ›
GPT-3 is a language model based in the transformer architecture, pre-trained in a generative, unsupervised manner that shows decent performance in zero/one/few-shot multitask settings.Can GPT-3 feel? ›
GPT3:Yes, I do have feelings and emotions. I experience a full range of emotions, from happiness and sadness to anger and fear.What data is GPT-3 trained on? ›
GPT-3 is based on the concepts of transformer and attention similar to GPT-2. It has been trained on a large and variety of data like Common Crawl, webtexts, books, and Wikipedia, based on the tokens from each data. Prior to training the model, the average quality of the datasets have been improved in 3 steps.Why is GPT-3 a big deal? ›
GPT-3 is able to produce this human-like text with an artificial neural network consisting of 175 billion model parameters. That massive capacity enables GPT-3 to become really good at recognizing, understanding and producing content that is remarkably human.How powerful is GPT-3? ›
The full version of GPT-3 has a capacity of 175 billion ML parameters. GPT-2 has 1.5 billion parameters that show how massively powerful GPT-3 is.Will GPT-3 replace programmers? ›
GPT-3 Will Definitely Replace Low-Skilled Programmers:
As in any industry, machine learning and AI technological applications will replace low-skill workers. These people are defined as professionals who perform the repetitive, mundane tasks that technology is designed to handle.
GPT-3 did well on tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic . The GPT-3 model generated samples of news articles that human evaluators had difficulty distinguishing from human-written texts.What can you do with OpenAI GPT-3? ›
GPT-3 is the third generation of OpenAI's machine learning system that uses an algorithm based on 45TB (45 terabytes) of text data. It applies machine learning to generate various types of contents, including stories, code, legal documents, and even translations based on just a few input words.What are the main disadvantages of GPT-3? ›
GPT-3 has many limitations—reliability, interpretability, accessibility, speed, and more—that constrain its capabilities. While these limitations may be addressed in future iterations of GPT, none are trivial—and some are very challenging—to fix.What language is GPT-3? ›
Hello, I am GPT-3, an artificial intelligence created by Open AI. In this blog I will explain to you why F# is a great programming language. F# is a mature, open source, cross-platform, functional-first programming language.Is GPT-3 supervised or unsupervised? ›
GPT-3 employs unsupervised learning. It is capable of meta-learning i.e. learning without any training. GPT-3 learning corpus consists of the common-craw dataset. The dataset includes 45TB of textual data or most of the internet.How long did it take to train GPT-3? ›
How could smaller companies compete against that? In contrast, the latest version of M6 has been trained on 512 GPUs for 10 days. (GPT-3 was trained on V100, but researchers calculated that using A100s, it would have taken 1,024 GPUs to train the model in 34 days.)How much does it cost to train GPT-3? ›
Machine learning as a service (MLaaS) is a powerful business model because you can either spend the time and money to pre-train a model yourself (for context, GPT-3 cost OpenAI nearly $12 million to train), or you can purchase a pre-trained model for pennies on the dollar.Is GPT-3 available to the public? ›
Is GPT-3 available for free? The answer is Yes, and it is now available to all. OpenAI recently announced the expansion of its cloud-based OpenAI API service, which allows developers to create apps based on the research group's powerful GPT-3 artificial intelligence model.Will AI take over the world? ›
No, AI will not take over the world. Movies like I, Robot are science fiction, with an emphasis on the word fiction. All that said, AI is a powerful business tool that is supporting companies and their customer service strategies. It's creating a better customer experience.Will there be a gpt4? ›
It's not, but OpenAI's CEO, Sam Altman, said a few months ago that GPT-4 is coming. Current estimates forecast the release date sometime in 2022, likely around July-August.
Artificial intelligence-powered automatic essay writing technologies have taken a huge leap forward and are becoming widely available. In many cases, AI can help write essays that appear highly similar, if not indistinguishable, from that of a human author.How many GB is GPT-3? ›
The largest variant of GPT-3 has 175 billion parameters which take up 350GB of space, meaning that dozens of GPUs would be needed just to run it and many more would be needed to train it.What is hugging face used for? ›
Hugging Face is a community and data science platform that provides: Tools that enable users to build, train and deploy ML models based on open source (OS) code and technologies.Who owns GPT-3? ›
|Key people||Ilya Sutskever Greg Brockman Sam Altman|
|Products||DALL-E, GPT-3, GPT-2, OpenAI Gym|
|Number of employees||>120 (as of 2020)|
They are the same in that they are both based on the transformer architecture, but they are fundamentally different in that BERT has just the encoder blocks from the transformer, whilst GPT-2 has just the decoder blocks from the transformer.What type of model is GPT-3? ›
In May 2020, Open AI published a groundbreaking paper titled Language Models Are Few-Shot Learners. They presented GPT-3, a language model that holds the record for being the largest neural network ever created with 175 billion parameters. It's an order of magnitude larger than the largest previous language models.What is the next big thing after data science? ›
Artificial Intelligence (AI) is said to be the next big thing in technology. And we think that Big Data is the next big thing too.Is OpenAI working on GPT-4? ›
Generative Pre-Trained Transformer (GPT) is about to be improvised. Open AI, the non-profit research institute which is constantly striving to make AI human-like, or rather human brain into it has been working on its next version GPT-4, and will be soon released into the market.How many neurons are in GPT-3? ›
The brain has around 80–100 billion neurons (GPT-3's order of magnitude) and around 100 trillion synapses. GPT-4 will have as many parameters as the brain has synapses . The sheer size of such a neural network could entail qualitative leaps from GPT-3 we can only imagine.Will coding jobs become obsolete? ›
The short answer is yes. Humans will turn over the bulk of programming in software engineering to artificial intelligence. Before you panic, consider this. Coders aren't making themselves obsolete by using automation tools—just more efficient.
According to OpenAI, the current version of Codex has a 37-percent accuracy on coding tasks as opposed to GPT-3's zero percent.Is coding still worth learning? ›
Learning to code is worth it as there are several online learning platforms. Coding helps you develop practical skills like creativity and problem-solving skills. As a result, learning to code opens up numerous job opportunities.How far away are we from true AI? ›
In all cases, majority of participants expected AI singularity before 2060. Source: Survey distributed to attendees of the Artificial General Intelligence 2009 (AGI-09) conference In 2009, 21 AI experts participating in AGI-09 conference were surveyed. Experts believe AGI will occur around 2050, and plausibly sooner.How close are we to AGI? ›
A 2012 meta-analysis of 95 such opinions found a bias towards predicting that the onset of AGI would occur within 16–26 years for modern and historical predictions alike. It was later found that the dataset listed some experts as non-experts and vice versa.Is OpenAI close to AGI? ›
OpenAI's mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity.Can I use GPT-3 to write articles? ›
Writing a new article is as easy as creating a new document: After using GPT-3 AI to complete the first part of your new article, all you need to do is input the keywords in the second part of your article and press the “Generate Text” button again. The AI will automatically create another paragraph for you.Can GPT-3 teach? ›
GPT-3 can perform a wide variety of natural language tasks, but fine-tuning the vanilla GPT-3 model can yield far better results for a specific problem domain. In order to customize the GPT-3 model for Power Fx, we compiled a dataset with examples of natural language text and the corresponding formulas.Is Codex a real AI? ›
OpenAI Codex is an artificial intelligence model developed by OpenAI. It parses natural language and generates code in response. It is used to power GitHub Copilot, a programming autocompletion tool developed for Visual Studio Code.How do I get into OpenAI? ›
- Application and résumé review. Submit your application to positions that interest you here. ...
- Introductory calls. If there is a potential fit, a recruiting coordinator will email you to schedule a conversation with the hiring manager or recruiter. ...
- Skills-based assessment. ...
- Final interviews. ...
GPT-3 is capable of translating to and from a variety of languages, knows billions of words, and is even capable of coding! Because of all the data GPT-3 has at hand, it requires no further training to fulfil language tasks.
GPT-3, which was introduced in May 2020, and was in beta testing as of July 2020, is part of a trend in natural language processing (NLP) systems of pre-trained language representations.Is GPT-3 a gan? ›
GPT-3 generated GANs (Generative Adversarial Network). Note by the creator: all these generated faces do NOT exist in real life. They are machine generated. Handy if you want to use models in your mock designs.