LLC Analysis of Jailbreak Susceptibility in Language Models

Investigating the relationship between Local Learning Coefficient dynamics and susceptibility to jailbreaks in large language models

Type: Applied

Difficulty: Hard

Status: Unstarted

This project aims to explore the relationship between Local Learning Coefficient (LLC) dynamics and a language model’s susceptibility to jailbreaks. We’ll investigate whether LLC analysis can provide insights into model vulnerabilities and potentially predict or detect jailbreak attempts.

Key research questions:

How do LLC trajectories differ between models that are more or less susceptible to jailbreaks?
Can LLC dynamics during fine-tuning for safety predict a model’s resistance to jailbreaks?
Is there a relationship between LLC and the model’s ability to maintain consistent behavior under adversarial prompts?
Can LLC analysis provide insights into the effectiveness of different safety training techniques?

Methodology:

Train or fine-tune language models with various safety techniques, tracking LLC throughout the process.
Develop a suite of jailbreak attempts, including prompt injection and adversarial examples.
Evaluate models’ susceptibility to jailbreaks and analyze the relationship with LLC trajectories.
Investigate LLC behavior during inference when exposed to jailbreak attempts.
Explore the use of LLC for detecting potential jailbreak attempts in real-time.
Compare LLC dynamics across different model sizes, architectures, and safety training methods.
Analyze how LLC changes in different components of the model (e.g., specific layers or attention heads) during safety training and jailbreak attempts.

Expected outcomes:

Characterization of LLC dynamics in language models with different levels of jailbreak susceptibility.
Insights into the relationship between LLC trajectories and model robustness against jailbreaks.
Potential development of LLC-based metrics for predicting or detecting jailbreak susceptibility.
Better understanding of how safety training affects model complexity from an SLT perspective.
Possible identification of critical phases or components in safety training, as reflected in LLC dynamics.

This research could provide valuable insights into the nature of language model vulnerabilities and potentially lead to more robust safety training techniques. It may also contribute to the development of real-time jailbreak detection methods.

Where to begin:

If you have decided to start working on this, please let us know in the Discord. We'll update this listing so that other people who are interested in this project can find you.