Investigating Persuasiveness in Large Language Models
Published in , 2023
While the rate of progress and innovation in Artificial intelligence (AI) has many potential benefits, the potential for accidental deleterious effects cannot be overemphasized. It has been empirically demonstrated that large language models (LLMs) can learn to perform a wide range of natural language processing (NLP) tasks in a self-supervised setting. However, these models might unintentionally produce convincing arguments for false statements. There has been recent interest in improving LLM performance by fine-tuning in a reinforcement learning framework through interaction with human users. One could raise the concern that even seemingly benign reward functions can lead to strategic manipulation of user responses as an instrumental goal to achieve higher overall performance. This thesis seeks to investigate this possibility by evaluating the persuasiveness of self-supervised-only and reinforcement-learning-fine-tuned LLMs. In this work, we will discuss three approaches to investigating the degree of persuasiveness in LLMs by searching for qualitative failures through a direct query, quantifying the persuasiveness of generated outputs, and training on this persuasiveness metric as a reward signal with reinforcement learning. Through our investigation, we find that state-of-the-art LLMs fail when prompted with statements about less popular misconceptions or domain-specific myths. With this investigation of the safety-critical related failures of LLMs, we hope to further inform the public of the degree of reliability of these models and guide their use.