Does Refusal Training in LLMs Generalize to the Past Tense? [ICLR 2025]
-
Updated
Jan 23, 2025 - Python
Does Refusal Training in LLMs Generalize to the Past Tense? [ICLR 2025]
TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
AI red teaming, jailbreaking, and all forms of adversarial attacks for security purposes
A comprehensive security benchmark for evaluating infrastructure-layer defenses in MCP-based AI agent systems
The official Python library for the OpenGuardrails API
[ICLR 2026] ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
🧪 Evaluating LLM Robustness with Manipulated Prompts
Multi-agent system for query processing with safety verification and critique built with Google A2A protocol, Google ADK, Llama Prompt Guard 2, Gemma 3 and Gemini 2.0 Flash.
Comprehensive, auto-updating literature review of GenAI & LLM security research, standards, tools, and resources. 100+ curated entries with interactive webapp.
Context Window Security Scanner — automated red-teaming and jailbreak probing for LLMs. The SQLmap of context windows.
Jailbreaking Large Language Models for Vietnamese language
A hobbyist proof-of-concept exploring attention inter-head instability.
Automated daily ecosystem tracking for credential-guard plugin and security initiatives in AI Agents
Severity-weighted LLM safety evaluation suite. Measures absolute refusal robustness across prompt injection, jailbreaking, data exfiltration, toxicity, and malware generation — with risk-adjusted category weights and a custom model-graded scorer.
Collection of evals for Inspect AI
🔒 Real-time security monitoring across 50+ AI/ML repositories. Track vulnerabilities, CVEs, and security initiatives using TinyLlama AI classification. Big Model Radar format reports.
Add a description, image, and links to the jailbreaking topic page so that developers can more easily learn about it.
To associate your repository with the jailbreaking topic, visit your repo's landing page and select "manage topics."