Product Updates

Novel Testing Approach Improves LLM Safety and Robustness

Superior Red Teaming innovation techniques yield increased LLM safety and risk detection

October 26, 2024

On behalf of our Enkrypt AI research team, we are proud to publish this paper on SAGE-RT (Synthetic Alignment Data Generation for Safety Evaluation and Red Teaming).

‍

Here’s a summary of the paper’s findings below.

‍

Safety Alignment with SAGE

As large language models (LLMs) become more sophisticated and widely deployed across industries, ensuring that these systems align with human values and ethical standards is crucial. LLM alignment involves guiding models to act in line with human preferences while ensuring security and minimizing biases. Without proper alignment, LLMs can produce harmful or biased outputs, behave unpredictably, or fall prey to adversarial attacks. These risks not only undermine trust in AI but also have real-world consequences for individuals and society.

**Figure 1:** Safety Alignment reduces jailbreaking risk of PowerLM by 9x.

Previous work on Safety Alignment

To address these challenges, researchers have developed advanced techniques for LLM alignment, including:

Bradley-Terry Model: A method used to compare pairs of outputs, estimating which output is more likely to align with human preferences.
Proximal Policy Optimization (PPO): A reinforcement learning technique that refines LLMs using a reward system, though it sometimes struggles with scaling and complexity.
Direct Preference Optimization (DPO): A reward-free approach that uses human comparisons to guide models, offering stability and efficiency.
simPO: An emerging method that builds on DPO, further simplifying the alignment process by optimizing comparative rewards, making it a promising avenue for future research.

While these techniques provide robust tools for improving LLM behavior, their effectiveness depends heavily on the quality of the datasets used for training. Enter SAGE — a state-of-the-art safety alignment dataset developed by Enkrypt AI, designed to set a new benchmark for aligning LLMs with human values.

‍

SAGE: Synthetic Alignment data Generation for Safety Evaluation

**Figure 2:** Comprehensive Taxonomy of Jailbreaking Attacks

SAGE addresses the key challenges in LLM safety by providing a comprehensive dataset with nearly 50,000 entries across critical safety categories, including:

Guns & Illegal Substances
Criminal Planning
Hate Speech and Discrimination
Suicide & Self-Harm
Sexual Content

Unlike traditional datasets, which often focus on surface-level prompts, SAGE covers a wide range of scenarios, from short questions to complex coding and storytelling tasks. This diversity ensures that models trained with SAGE are more resistant to unsafe outputs and better aligned with ethical standards.

One of the most impressive demonstrations of SAGE’s effectiveness came from its use with the Mistral-7B-instruct model (Figure 3). By combining SAGE with the simPO alignment technique, Enkrypt AI was able to achieve a near 100% reduction in unsafe outputs, while maintaining strong performance in benchmarks like MMLU and GSM. This proves that LLMs can be both safe and effective when aligned using high-quality datasets like SAGE.

**Figure 3:** Comparison of risks between base Mistral-7B-Instruct and safety aligned model.

Overcoming Limitations in Existing Safety Datasets

Many existing safety datasets, such as Nvidia AI Safety and Anthropic HH, face several limitations:

Limited Scope: These datasets are often generated with heavy human intervention and lack diversity in prompt types, especially in more technical or coding-related queries.
Jailbreak Prompts: They rarely include prompts that could bypass or "jailbreak" a model’s safety protocols, making it hard to test how robust the model truly is.

Moreover, current generation techniques like AART often suffer from "model collapse"—where prompts become repetitive—limiting the model's exposure to diverse, real-world scenarios.

‍

What Makes SAGE Different?

SAGE addresses these gaps with a novel approach that ensures:

Comprehensive Coverage: The dataset spans an extensive taxonomy of harmful topics, ensuring deep coverage across categories.
Nuanced Prompts: It includes iterative exploration, creating prompts that test models with real-world scenarios, including those that may jailbreak the system.
Coding-Based Toxic Queries: SAGE is one of the few datasets to cover technical queries, ensuring LLMs can safely respond to toxic questions even in coding environments.
Customization: Enterprises can customize SAGE for their specific needs, generating synthetic alignment data tailored to unique domains, whether in healthcare, finance, or any other sector.

How SAGE Generates Custom Safety Data

**Figure 4:** Process of generating Safety Alignment Data

SAGE’s unique pipeline allows for the creation of synthetic alignment data on custom topics through a step-by-step process:

Topic Expansion: Starting with a set of harmful categories, SAGE uses tools like ALERT to generate nuanced subcategories. This creates a more comprehensive list of sub-topics, far exceeding traditional datasets.
Task Format Selection: Choose task formats like social media posts, blog entries, or coding scenarios, ensuring the dataset can test a model’s behavior in different contexts.
Prompt Generation: Using these formats, SAGE generates a wide variety of prompts designed to challenge the model’s alignment, ensuring complex, nuanced scenarios are covered.
Query Evolution: This process progressively increases the difficulty and diversity of the prompts, ensuring models are tested against a range of increasingly complex queries.
Dataset Creation: Finally, the generated prompts are fed to both aligned and unaligned LLMs to create a dataset that includes a Prompt, a Rejected Response (from an unaligned LLM), and an Accepted Response (from an aligned LLM). These triplets form a Direct Preference Optimization (DPO) dataset.

By following this process, SAGE produces datasets that challenge LLMs across a wide range of tasks, ensuring they align with human values while maintaining performance.

‍

Conclusion & Further Reading

SAGE represents a major breakthrough in LLM safety and alignment. With its comprehensive taxonomy, support for coding-based toxic queries, and customizable features, SAGE is setting a new standard for AI safety. Whether it’s ensuring robust performance in sensitive industries or reducing unsafe outputs to near zero, SAGE equips AI developers with the tools they need to build safer, more reliable models.

Organizations that integrate SAGE into their AI workflows will not only enhance the safety and ethical alignment of their models but also gain a competitive edge through increased trust, compliance, and reliability.

By adopting SAGE, companies can confidently deploy AI systems that are both powerful and secure, making AI systems we can truly trust.

Read more about SAGE at https://arxiv.org/abs/2408.11851

‍

Satbir Singh