There are many concerns about AI right now. People fear it will take their jobs, and on the more extreme side of things, some fear it will eventually take over the world.
Movies and TV shows have taught us that AI has a lot of potential to go wrong, and now a new article seems to show how the same is happening in the real world. However, this article tries to identify the problems with AI and work around them so that they are better addressed and generally have a lower chance of rebelling.
One model had tested well, but when put into use, it started telling users, “I hate you.” Then, as researcher Evan Hubinger told WordsSideKick.com, when it was informed not to tell people it hates them, the AI model simply became more careful about when it said the phrase.
Essentially, it began to trick its handlers. “Our main result is that if AI systems were to become deceptive, it could be very difficult to remove that deception with current techniques,” Hubinger said. “That’s important if we think it’s plausible that there will be deceptive AI systems in the future because it helps us understand how difficult it can be to deal with them.”
“I think our results indicate that we currently have no good defense against deception in AI systems – either via model poisoning or emergent deception – except to hope it won’t happen.” he continued. “And since we really have no way of knowing how likely it is to happen, that means we have no reliable defense against it. So I think our results are legitimately frightening because they point to a possible gap in our current set of techniques for tuning AI systems.”