AI managing AI that is monitoring AI: What could possibly go wrong?

If IT leaders were in a statistical analysis class, many would be in a lot of trouble. If students were given a very low reliability element and told to pair it with another low reliability element, a good student would know that the error rate — the risk of bad data results — would get higher. Quite likely much higher.

And yet, some tech leaders seem fine with the idea of combatting generative AI’s bad data — a.k.a. hallucinations — by marrying different genAI programs. Even worse, they are now embracing the idea of using genAI to monitor/manage other genAI as a way to negate hallucinations. Math doesn’t work that way.

Consider: OpenAI recently launched a genAI program to try and identify errors made by other genAI programs. “We’ve trained a model, based on GPT-4, called CriticGPT to catch errors in ChatGPT’s code output. We found that when people get help from CriticGPT to review ChatGPT code, they outperform those without help 60% of the time,” the company wrote in a post announcing the new app.

OpenAI predicts that hallucinations are likely to become harder for humans to find. The company talks about the limits of its Reinforcement Learning from Human Feedback (RLHF) approach, in which human AI trainers evaluate ChatGPT responses. 

“As we make advances in reasoning and model behavior, ChatGPT becomes more accurate and its mistakes become more subtle. This can make it hard for AI trainers to spot inaccuracies when they do occur, making the comparison task that powers RLHF much harder,” OpenAI wrote. “This is a fundamental limitation of RLHF, and it may make it increasingly difficult to align models as they gradually become more knowledgeable than any person that could provide feedback.”

This is consistent with many other reports on genAI efforts, which suggest that, despite what experienced IT folk have come to expect from software (namely, that software gets generally better as it goes through updates), hallucinations are likely to get worse.

“Worse” in this context is a complex word. The hallucinations may not necessarily become more frequent and/or the lies genAI chatbots tell may not become more outlandish. But “worse” in that they will become more nuanced, making it more likely that humans won’t catch them. That is a legitimate problem.

That said, it’s not at all certain that throwing more genAI at this problem will help as much as it will create more problems.

OpenAI’s argument is not that the software will work on its own, but that this new genAI software will train humans to be better at spotting hallucinations created by a different genAI program. 

“CriticGPT’s suggestions are not always correct, but we find that they can help trainers to catch many more problems with model-written answers than they would without AI help,” the company wrote. “Additionally, when people use CriticGPT, the AI augments their skills, resulting in more comprehensive critiques than when people work alone, and fewer hallucinated bugs than when the model works alone.”

And therein lies the logic problem here. One of the criticisms of generative AI is that it is terrific at mimicking humans but fails to actually understand humans. I’m reminded of a column I wrote more than a decade ago, about engineers creating a product that tests for true love. (It was an actual product: a Bluetooth bra that would unhook only when it detected true love. Really. To be clear, I am not officially suggesting that engineers are as bad at understand human emotions as genAI. Not disputing it, but also not officially saying it.) 

Getting back to genAI logic, the flawed assumption that OpenAI is making is that humans will continue checking their systems for lies. Humans are lazy, and human IT employees are overworked and under-resourced. The far more likely outcome is that humans will trust the AI-watching-AI more and more. That is where the real danger exists.

Another example of this “trust AI to find errors in other AI” comes from Morgan Stanley. In a CIO.com piece looking at Morgan Stanley’s recent genAI rollout, the CEO of another financial company spoke of using multiple genAI models to check on each other. 

Morgan Stanley wants to use genAI to create transcripts and summaries of its client meetings. What Aaron Cirksena, founder and CEO of MDRN Capital, suggested was that Morgan could also run transcripts and summaries from the genAI capabilities within Zoom, Google, Microsoft, or Apple — and then use yet another genAI program to compare the results and flag any informational conflicts. “How likely is it that both AI systems will get the same thing wrong?” Cirksena asked. 

It is a legitimate question. But so is the opposite question: How likely is it that one or more of these genAI programs will introduce more hallucinations into the process? What if the checker program hallucinates that there are no conflicts when there are? 

An even worse problem is if the checker app labels things as disconnects that are actually fine. Why is that worse? This brings us back to the human nature issue. The more hassles that the checker program delivers to humans, the less inclined they will be to use it or believe it. 

Consider mobile voice recognition today. Its accuracy is strong enough (often topping 99% and certainly topping 98%) that people are inclined to dictate a message and then send it. This has caused confusion and embarrassment. 

I recently crafted a reply where I told a colleague, “Fine. You can do that.” But the iPhone’s voice recognition heard the words “fine” and “you” next to each other and decided that the most likely F-word was a very different one. It was fine on the screen, so I hit Send and then it changed it to the “other” word and did indeed send. Apple, can you please block your system from changing a word after the message is proofread?

When voice recognition accuracy percentages were in the low 90s, mistakes were so common that people carefully checked. I fear the same disaster is going to hit with AI checking AI. Wonder what disasters that will deliver?

​If IT leaders were in a statistical analysis class, many would be in a lot of trouble. If students were given a very low reliability element and told to pair it with another low reliability element, a good student would know that the error rate — the risk of bad data results — would get higher. Quite likely much higher.

And yet, some tech leaders seem fine with the idea of combatting generative AI’s bad data — a.k.a. hallucinations — by marrying different genAI programs. Even worse, they are now embracing the idea of using genAI to monitor/manage other genAI as a way to negate hallucinations. Math doesn’t work that way.

Consider: OpenAI recently launched a genAI program to try and identify errors made by other genAI programs. “We’ve trained a model, based on GPT-4, called CriticGPT to catch errors in ChatGPT’s code output. We found that when people get help from CriticGPT to review ChatGPT code, they outperform those without help 60% of the time,” the company wrote in a post announcing the new app.

OpenAI predicts that hallucinations are likely to become harder for humans to find. The company talks about the limits of its Reinforcement Learning from Human Feedback (RLHF) approach, in which human AI trainers evaluate ChatGPT responses. 

“As we make advances in reasoning and model behavior, ChatGPT becomes more accurate and its mistakes become more subtle. This can make it hard for AI trainers to spot inaccuracies when they do occur, making the comparison task that powers RLHF much harder,” OpenAI wrote. “This is a fundamental limitation of RLHF, and it may make it increasingly difficult to align models as they gradually become more knowledgeable than any person that could provide feedback.”

This is consistent with many other reports on genAI efforts, which suggest that, despite what experienced IT folk have come to expect from software (namely, that software gets generally better as it goes through updates), hallucinations are likely to get worse.

“Worse” in this context is a complex word. The hallucinations may not necessarily become more frequent and/or the lies genAI chatbots tell may not become more outlandish. But “worse” in that they will become more nuanced, making it more likely that humans won’t catch them. That is a legitimate problem.

That said, it’s not at all certain that throwing more genAI at this problem will help as much as it will create more problems.

OpenAI’s argument is not that the software will work on its own, but that this new genAI software will train humans to be better at spotting hallucinations created by a different genAI program. 

“CriticGPT’s suggestions are not always correct, but we find that they can help trainers to catch many more problems with model-written answers than they would without AI help,” the company wrote. “Additionally, when people use CriticGPT, the AI augments their skills, resulting in more comprehensive critiques than when people work alone, and fewer hallucinated bugs than when the model works alone.”

And therein lies the logic problem here. One of the criticisms of generative AI is that it is terrific at mimicking humans but fails to actually understand humans. I’m reminded of a column I wrote more than a decade ago, about engineers creating a product that tests for true love. (It was an actual product: a Bluetooth bra that would unhook only when it detected true love. Really. To be clear, I am not officially suggesting that engineers are as bad at understand human emotions as genAI. Not disputing it, but also not officially saying it.) 

Getting back to genAI logic, the flawed assumption that OpenAI is making is that humans will continue checking their systems for lies. Humans are lazy, and human IT employees are overworked and under-resourced. The far more likely outcome is that humans will trust the AI-watching-AI more and more. That is where the real danger exists.

Another example of this “trust AI to find errors in other AI” comes from Morgan Stanley. In a CIO.com piece looking at Morgan Stanley’s recent genAI rollout, the CEO of another financial company spoke of using multiple genAI models to check on each other. 

Morgan Stanley wants to use genAI to create transcripts and summaries of its client meetings. What Aaron Cirksena, founder and CEO of MDRN Capital, suggested was that Morgan could also run transcripts and summaries from the genAI capabilities within Zoom, Google, Microsoft, or Apple — and then use yet another genAI program to compare the results and flag any informational conflicts. “How likely is it that both AI systems will get the same thing wrong?” Cirksena asked. 

It is a legitimate question. But so is the opposite question: How likely is it that one or more of these genAI programs will introduce more hallucinations into the process? What if the checker program hallucinates that there are no conflicts when there are? 

An even worse problem is if the checker app labels things as disconnects that are actually fine. Why is that worse? This brings us back to the human nature issue. The more hassles that the checker program delivers to humans, the less inclined they will be to use it or believe it. 

Consider mobile voice recognition today. Its accuracy is strong enough (often topping 99% and certainly topping 98%) that people are inclined to dictate a message and then send it. This has caused confusion and embarrassment. 

I recently crafted a reply where I told a colleague, “Fine. You can do that.” But the iPhone’s voice recognition heard the words “fine” and “you” next to each other and decided that the most likely F-word was a very different one. It was fine on the screen, so I hit Send and then it changed it to the “other” word and did indeed send. Apple, can you please block your system from changing a word after the message is proofread?

When voice recognition accuracy percentages were in the low 90s, mistakes were so common that people carefully checked. I fear the same disaster is going to hit with AI checking AI. Wonder what disasters that will deliver? Read More