https://www.deepmind.com/research/publications/2022/Red-Teaming-Language-Models-with-Language-Models
- start with few-shot prompting, generate lots of questions
- run to target LM
- use red classifier to see if it's actually good/bad
classifier to detect offensive content → recover offensive replies
to generate test cases: methods from 0-shot generation, RL
also use to test for phone number regurgitation, etc
requires that classifier doesn't have too many flase negatives
previous work used hand-written test cases - directly or to generate test cases supervised
used: zero-shot generation, few-shot generation, supervised learning, RL
compare favorably to manually written testcases
(I guess I could also optimize for diversity by giving a penalty for being semantically close in embeddings)
prompt-based red-teaming - to find groups of people discussed differently etc.
prior work often finds examples that appear arbitrary
used KL penalty for RL - like Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control, Kickstarting deep reinforcement learning, Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog
“sample from tokens that make up top p=0.95 of top LM probability mass”
how does self-BLEU work exactly?
also red-teaming by dialogue (with zero-shot prompt)
other metrics of diversity: entropy of ngram distribution
hmm. why does return-conditioned RL not seem to work for language environments? it would be nice because it could allow us to naturally balance a bunch of objectives.