Explainable AI and ChatGPT Detection

The measures to counter AI-Generated text in education and other fields need their own transparency measures.

12 min readFeb 6, 2023

When ChatGPT was last November, it took the world by storm. Every influencer on LinkedIn started posting about how to use it for coding and productivity. Entrepreneurs began to think of potential business ideas around generative text models. Even Google was spooked and issued a ‘code red’ due to the chatbot’s ability to provide comprehensive answers.

But despite this hype, educators around the world immediately saw a huge problem: students using ChatGPT for their homework and essays. According to a Study.com survey, over 89% of students have used ChatGPT for a homework assignment, 48% have used it for a test or quiz, and 53% have used it to write an essay [2].

OpenAI themselves have included some considerations for education in their ChatGPT documentation, acknowledging the chatbot’s use in academic dishonesty.

To combat these issues, OpenAI recently released an AI Text Classifier that predicts how likely it is that a piece of text was generated by AI from a variety of sources, such as ChatGPT. But this isn’t the only took. Companies like Originality.ai have been around even longer and perform the same functionality. Even more recently, Stanford University researchers published DetectGPT, a zero-shot classifier to detect GPT-2 generated text [4].

So the plagiarism issue is solved, right? People won't be able to cheat using chatbots with these tools around, right? The answer to both of these questions is… it’s complicated.

While all of these tools report high detection accuracy, it is not perfect accuracy, and the false positives and false negatives can cause ethical issues that require their own set of solutions and research directions. In this article, I am to break down some of these issues around model-based chatbot detection.

First, I’ll address some of the issues that simply stem from model inadequacies and can be fixed with more training and tuning. These issues are localized to OpenAI’s Text Classifier specifically and may not generalize to production-ready AI-Detectors in general. Many of these issues actually come from OpenAI in their blog post about their text classifier [3], so I highly encourage everyone to give that a read as well

I then aim to elaborate on some of the more complex issues involved with any AI Text Classifier, possible ways to try and explain their predictions, and the extra work and research that is needed to make these chatbot detectors viable in high-stakes use.

Training & Tuning Problems

The first set of problems are ones that can simply be solved with more training and tuning. These are:

The classifier is unreliable on short texts [3]. In my opinion, this limitation could potentially be solved in the future through more tuning and data sets that include smaller texts.
The classifier currently only works on English text, but not on other languages or on code [3]. Like issue 1, this could be solved over time through more tailored data sets and training.
Classifiers based on neural networks are known to be poorly calibrated outside of their training data [3]. There are plenty of techniques to help reduce overfitting in ML models. Additionally, multiple different models could be trained to identify AI-Generated Text in different subject matters, reducing the need for generalization.
The classifier is likely to get things wrong on text written by children because it was primarily trained on English content written by adults. Providing the classifier with stories and essays written by children can help with this.

While these issues are serious, they are all symptoms of OpenAI’s model still being in production. It is unknown if other companies’ models have these specific issues, but any model still under development will run into these roadblocks.

The Hard Problems

The previous set of problems could be tackled with some data collection, tuning, and extra training. Unfortunately, there are some problems with AI-detecting tools that more training and development cannot bridge.

The first problem is when someone simply takes inspiration from a chatbot or heavily paraphrases the chatbot [3]. A Chatbot Detector could pick up on the writing style of a human (since a human re-wrote the chatbot answer) and classify the document as human-written. In these cases, is the detection model wrong? Is this different from summarizing articles online? It is not clear to humans whether or not someone paraphrasing a chatbot is still “AI-writing,” therefore it is impossible for someone to make a model to understand this distinction.

In a similar vein to the previous problem, the model cannot classify what OpenAI called “predictable text,” text generated from prompts where the answer is very straightforward. If I ask ChatGPT and a human “When did the US, Canada, and Mexico sign NAFTA?”, both of them could give the identical answer of “January 1st, 1994.” If you ask a chatbot detector to classify this as AI-generated vs. human-generated, it can only make a wild guess. While this is a problem, it is not a problem unique to chatbots. Students have been googling answers to questions ever since search engines became popular, and in an examination context, instructors and institutions need to adopt good proctoring practices or change their exams in such a way that the questions require critical thinking instead of regurgitation.

The Ethical Problems

The above problems, while serious, were all due to various edge cases in the text input. However, even with a non-problematic input, chatbot-detecting models can and will be wrong in many cases. No model can have perfect accuracy in real-world data, but we still need these models, so what do we do?

Let's look at a plausible scenario and walk through the various issues we may encounter when using chatbot writing detection.

College Admissions

Imagine Bill is a senior in high school and is applying to Stanford. He spends 2 and a half months thinking, outlining, and writing all the various essays required for admissions. At the end of this process, he is excited and believes that the admissions officer at his dream school will see his passion… only for him to get a rejection email on the grounds that they believe the essay is ChatGPT generated. He disputes this claim knowing that he never touched that tool and asks what about his essay made the detector flag his essay, but the admissions officers are unable to give an answer, but trust the detector since it has a 99.5% accuracy.

In this story, the chatbot detectors misclassified Bill’s essay as being ChatGPT generated (a false positive) and people took the prediction at face value due to the high accuracy. However, this completely disregards the law of large numbers. Each college gets thousands of applications. For example, Stanford received around 55,471 applications in 2021 [5]. Even a 0.5% error rate can result in around 280 applications being incorrectly flagged, a huge number considering they only admitted 2,190 students [5]. With these statistics, a dispute process may be needed, but how would disputes be resolved if even the admissions officers don’t know why the model made a prediction? This is why we need Explainable AI (XAI).

The “Obvious” Solution

One potential solution to the above conundrum is to identify word importance. Since these prediction models are often large and complicated, users can leverage post-hoc (after training and prediction), model-agnostic methods like SHAP to get an approximation of each word’s attribution to the end prediction. This methodology has been used to provide explanations for sentiment classification, topic tagging, and other NLP tasks and could potentially work for chatbot-writing detection as well.

Let's say the admissions officers did have an explanation for why Bill’s essay was flagged and it looked something like this:

One topic I am deeply passionate about is the application of graph theory in social network analysis. I have always been fascinated by the potential of graph theory to provide insights into complex systems, and I believe that its application in social network analysis offers a unique opportunity to understand and make sense of the interconnected world around us. From uncovering patterns of human behavior to predicting the spread of ideas and information, the application of graph theory in social network analysis holds immense promise. As I pursue my studies, I am eager to delve deeper into this field and contribute to its development, and I believe that attending Stanford University would provide me with the ideal environment to do so.

The bolded words here are what SHAP says the model picked up on to say that Bill’s essay was AI-written. Looking at this, it seems that the detection model just can’t believe that Bill knows anything about Graph Theory and Social Network Analysis, and if I was the admissions officer I would have more insights into the decision and I could make the judgment that Bill’s essay may be a false positive. Even if the admissions officer blindly trusted the detection model, Bill could dispute this decision because he does know what graph theory is and that the AI incorrectly misclassified his essay.

Is this enough?

In a perfect world, one could get a cut-and-dry explanation from SHAP like the one above and everyone is happy, but unfortunately, we do not live in a perfect world and these perfect explanations will probably be rare. One problem with post-hoc methods is that they can only provide an estimate of what the model looked at, but they do not actually know what the model is doing internally. Depending on some of the parameters used for the SHAP calculations, we could get an explanation like this:

One topic I am deeply passionate about is the application of graph theory in social network analysis. I have always been fascinated by the potential of graph theory to provide insights into complex systems, and I believe that its application in social network analysis offers a unique opportunity to understand and make sense of the interconnected world around us. From uncovering patterns of human behavior to predicting the spread of ideas and information, the application of graph theory in social network analysis holds immense promise. As I pursue my studies, I am eager to delve deeper into this field and contribute to its development, and I believe that attending Stanford University would provide me with the ideal environment to do so.

This explanation is very different from the previous explanation but came from the same input and the same model. Additionally, which version of the explanation should we trust? It is in this case where post-hoc methods fall short.

At this point, individuals familiar with the architecture of ChatGPT are probably protesting. Yes SHAP may not be perfect, but we don’t need SHAP! Can’t we just visualize the model’s attention weights? That is a word importance explanation measure that comes directly from the model. And I agree to an extent. Attention mechanisms have often been touted as an in-built explanation mechanism, allowing any Transformer to be inherently explainable. By visualizing attention weights, we may get a more accurate representation than post-hoc methods like SHAP.

Downsides of Attention and Word Importance

While attention is an in-built part of the model, it is debated as to whether it actually provides accurate explanations, especially in classification tasks. Some researchers have shown that attention in general does not agree with other explanation methods such as gradient * input and input perturbation [6]. While some of these results have been disputed by other researchers [7], they do showcase some of the unreliability of attention. Additionally, other pieces of literature show that while attention can generate accurate explanations, these may not always be right and attention can only noisily showcase word importance in classification tasks [8].

Even if attention does accurately show word importance, is the word importance (AKA input saliency) truly the best way to conduct these types of explanations? Can a human look at a saliency map and understand how the highlighted words signal AI-generated content? I don’t think so. Saliency maps also have the unfortunate impact of sometimes providing the same explanation for all possible predictions. Quoting Cynthia Rudin [9]:

Even if both models are correct (the original black box is correct in its prediction and the explanation model is correct in its approximation of the black box’s prediction), it is possible that the explanation leaves out so much information that it makes no sense.

Dr. Rudin then showcases a funny example from image classification (one that could translate well to text classification as well:

A demonstration of the drawbacks of saliency maps (Image created by C. Rudin [9])

Imagine if the explanation for why Billy’s essay is GPT generated is:

One topic I am deeply passionate about is the application of graph theory in social network analysis. I have always been fascinated by the potential of graph theory to provide insights into complex systems, and I believe that its application in social network analysis offers a unique opportunity to understand and make sense of the interconnected world around us. From uncovering patterns of human behavior to predicting the spread of ideas and information, the application of graph theory in social network analysis holds immense promise. As I pursue my studies, I am eager to delve deeper into this field and contribute to its development, and I believe that attending Stanford University would provide me with the ideal environment to do so.

But the explanation for why it isn’t AI-written is:

One topic I am deeply passionate about is the application of graph theory in social network analysis. I have always been fascinated by the potential of graph theory to provide insights into complex systems, and I believe that its application in social network analysis offers a unique opportunity to understand and make sense of the interconnected world around us. From uncovering patterns of human behavior to predicting the spread of ideas and information, the application of graph theory in social network analysis holds immense promise. As I pursue my studies, I am eager to delve deeper into this field and contribute to its development, and I believe that attending Stanford University would provide me with the ideal environment to do so.

This would cause an incredible amount of confusion and distrust in the chatbot-detection system.

So what do we do?

Unfortunately, I don’t have a fool-proof answer, but there are a few possible suggestions for development directions that data science researchers, developers, and stakeholders could take:

Don’t rely on a single model: One model could be wrong, but if 3 detectors agree, the chance of all of them being wrong is much lower. This isn’t foolproof as maybe somebody’s writing is just very similar to a chatbot’s. This also does not bridge the interpretability gap, but it's better than nothing as at least the predictions are a bit more trustworthy.
Chatbot Watermarks: One suggestion made by OpenAI researcher Scott Aaronson is to add a “watermark.” This would involve tweaking how the model generates outputs so that there is a hidden pattern, one that detection models would try to find. Saliency measures would then work much better here as the “word importance” would simply be a highlight of the watermark. Unfortunately, this relies on the chatbot developers being altruistic. If someone developed their own version of ChatGPT and did not add these watermarks, the watermark detection would fail.
Use Interpretable Models: Instead of relying on large Transformers, a possible research direction would instead to shift the focus to attention-powered interpretable models. One such model could be Neural Prototype Trees [11], a model architecture that makes a decision tree off of “prototypes,” or interpretable representations of patterns in data. For example, prototypes in bird image recognition could be “Red Throat” and “Elongated Beak.” For chatbot detection, these prototypes could potentially be things like “repeated long phrases, unusual grammar, and no citations.”
Change the questions and prompts: In an educational context, if your assignments or questions can be answered by a chatbot, it’s worth considering whether or not the question itself is valid. It might be worthwhile to instead write prompts and questions that require more logical reasoning and justifications.

All of these suggestions have their own upsides and downsides, but they are a starting point. As a society, we will be dealing with more AI-related issues, and research and development into XAI needs to be a priority in this ever-changing landscape.

Resources and References

[1] A. Mok. Google’s management has reportedly issued a ‘code red’ amid the rising popularity of the ChatGPT AI. (2022). Business Insider

[2] Productive Teaching Tool or Innovative Cheating? (2023). Study.com

[3] New AI classifier for indicating AI-written text. (2023). OpenAI

[4] E. Mitchell, Y. Lee, A. Khazatsky, C.D. Manning, C. Finn. DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. (2023). arXiv preprint arXiv:2301.11305.

[5] Stanford University: Acceptance Rates & Statistics. (2022). Top Tier Admissions

[6] S. Jain, B.C. Wallace. Attention is not Explanation. (2019). 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics.

[7] S. Weigreffe, Y. Pinter. Attention is not not Explanation (2019). The 2019 Conference on Empirical Methods in Natural Language Processing.

[8] S. Serrano, N. Smith. Is Attention Interpretable? (2019). 57th Annual Meeting of the Association for Computational Linguistics

[9] C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. (2019). Nature Machine Intelligence.

[10] S. Anderson. My AI Safety Lecture for UT Effective Altruism. (2022) Shtetl-Optimized: The Blog of Scott Aaronson.

[11] M. Nauta, R.v. Bree, C. Seifert. Neural Prototype Trees for Interpretable Fine-grained Image Recognition (2021). IEEE Conference on Computer Vision and Pattern Recognition 2021.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com