Perez, E, Ringer, S, Lukošiūtė, K, Nguyen, K, Chen, E, Heiner, S, Pettit, C, Olsson, C, Kundu, S, Kadavath, S, Jones, A, Chen, A, Mann, B, Israel, B, Seethor, B, McKinnon, C, Olah, C, Yan, D, Amodei, D, Amodei, D, Drain, D, Li, D, Tran-Johnson, E, Khundadze, G, Kernion, J, Landis, J, Kerr, J, Mueller, J, Hyun, J, Landau, J, Ndousse, K, Goldberg, L, Lovitt, L, Lucas, M, Sellitto, M, Zhang, M, Kingsland, N, Elhage, N, Joseph, N, Mercado, N, DasSarma, N, Rausch, O, Larson, R, McCandlish, S, Johnston, S, Kravec, S, Showk, SE, Lanham, T, Telleen-Lawton, T, Brown, T, Henighan, T, Hume, T, Bai, Y, Hatfield-Dodds, Z, Clark, J, Bowman, SR, Askell, A, Grosse, R, Hernandez, D, Ganguli, D, Hubinger, E, Schiefer, N, Anthropic, JK & Surge, AI 2023,
Discovering Language Model Behaviors with Model-Written Evaluations. in
Findings of the Association for Computational Linguistics, ACL 2023. Proceedings of the Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (ACL), pp. 13387-13434, 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023, Toronto, Canada,
7/9/23.