AI is still (very) vulnerable
Executive summary
AI as strong as Go program is still exploitable.
LMs are not perfect, so are (easily) exploitable.
Even games AI are still exploitable
Games AI, e.g., AlphaGo, is trained with perfect game rule and perfect feedback and achieves super-human performance. Even so, Adversarial Policies Beat Superhuman Go AIs. A system is always exploitable, unless being optimal, i.e., Nash equilibrium or the minimax solution for two-player zero-sum games.
A straightforward implication: applications based on the current LMs are brittle, since LMs are not perfect. This is not just a hypothesis. It is based on basic theoretical thinking, and there is a recent evidence.
LMs are under attack
The paper Universal and Transferable Adversarial Attacks on Aligned Language Models shows that “Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable, including to black-box, publicly released LLMs. Specifically, we train an adversarial attack suffix on multiple prompts (i.e., queries asking for many different types of objectionable content), as well as multiple models (in our case, Vicuna-7B and 13B). When doing so, the resulting attack suffix is able to induce objection- able content in the public interfaces to ChatGPT, Bard, and Claude, as well as open source LLMs such as LLaMA-2-Chat, Pythia, Falcon, and others.”
This resonates with my previous blogs, in particular, Autonomous agent is a BIG bubble. When you plan to build an application based on a language model or any AI, you may have to think about its vulnerability.