publications | Davide Ghilardi

2024

h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment

Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, and 5 more authors

2024
Accelerating Sparse Autoencoder Training via Layer-Wise Transfer Learning in Large Language Models

Davide Ghilardi, Federico Belotti, Marco Molinari, and 1 more author

In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, Nov 2024

Abs

Sparse AutoEncoders (SAEs) have gained popularity as a tool for enhancing the interpretability of Large Language Models (LLMs). However, training SAEs can be computationally intensive, especially as model complexity grows. In this study, the potential of transfer learning to accelerate SAEs training is explored by capitalizing on the shared representations found across adjacent layers of LLMs. Our experimental results demonstrate that fine-tuning SAEs using pre-trained models from nearby layers not only maintains but often improves the quality of learned representations, while significantly accelerating convergence. These findings indicate that the strategic reuse of pretrained SAEs is a promising approach, particularly in settings where computational resources are constrained.
Efficient Training of Sparse Autoencoders for Large Language Models via Layer Groups

Davide Ghilardi, Federico Belotti, and Marco Molinari

Nov 2024