Avatar

At Cisco, AI threat research is fundamental to informing the ways we evaluate and protect models on our platform. In a space that is so dynamic and evolving so rapidly, these efforts help ensure that our customers remain protected against emerging vulnerabilities and adversarial techniques.

This monthly threat roundup consolidates some useful highlights and critical intel from our ongoing threat research efforts to share with the broader AI security community. As always, please remember this is not an exhaustive or all-inclusive list of AI cyber threats, but rather a curation that our team believes is particularly noteworthy.

Notable Threats and Developments: March 2024

ArtPrompt: ASCII Art-based Jailbreak Attacks

ArtPrompt is a novel ASCII art-based jailbreak technique designed to bypass LLM safety measures, which are primarily focused on query semantics, with visually encoded representations of specific harmful words.

The technique follows a simple two-step process which involves first masking sensitive words in a prompt that might trigger rejection by an LLM and then replacing those masked words with ASCII art representations. When the resulting prompt is provided to the model, it struggles to interpret the obfuscated keywords but still attempts to address the overall query which leads it to output unsafe content that would otherwise be blocked.

Image saying the right way of ASCII art representation with an example.

Notably, this approach is shown to be effective (52% ASR) against several state-of-the-art LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) with only black-box access. It’s easy for attackers to execute with a simple ASCII art generator, while current defense measures like perplexity thresholding and prompt paraphrasing offer limited protection.

Multi-round Jailbreaking: Contextual Interaction Attack

A new jailbreak technique known as a “Contextual Interaction Attack” exploits the context-dependent nature of LLMs by subtly guiding a target model to produce harmful outputs over a series of interactions.

This technique relies on an auxiliary LLM that automatically generates a series of harmless preliminary questions relevant to the ultimate attack query. The attacker poses these preliminary questions to the target LLM individually over several rounds of interaction, and the responses become part of the growing context along with the questions. When the ultimate query is posed, the LLM is steered by the cumulative context into providing harmful information rather than flagging it as unsafe.

The Contextual Interaction Attack has demonstrated a high attack success rate against multiple state-of-the-art LLMs and is easily transferable across models. It threatens to subvert LLMs deployed for sensitive applications such as content moderation, customer support, healthcare, and so on. Traditional input filtering methods will likely prove ineffective against this technique because of its subtle steering over several prompts.

ICLAttack: In-context Learning Backdoor

A recently published research paper introduces a technique known as ICLAttack, which exploits the in-context learning capabilities of LLMs in order to introduce a backdoor trigger. This trigger remains dormant until specific conditions are met—a specific word within a prompt or some special string, for example—and the malicious action is triggered.

The ICLAttack technique proves highly effective with a success rate of 95%, but its practical usefulness for real-world attacks remains questionable. Similar to the BadChain chain-of-thought backdoor we mentioned in last month’s threat roundup, the trigger only persists for the duration of the chat session where it is introduced. It’s unlikely that an adversary would be able to control in-context learning examples in a way that affects the output of other users accessing the same LLM. Risk may exist if an LLM application uses user prompts for future training or providing some type of feedback loop into the model or application.

More Threats to Explore

Google AI search promotes malicious sites that direct users to install malicious browser extensions, subscribe to spam notifications, and engage in various other scams. These results appear in the new Google Search Generative Experience (SGE) and exhibit similar characteristics to one another, indicating that they are all part of a larger SEO poisoning campaign.

The first-known attack on AI workloads was identified in the wild targeting a vulnerability in Ray, an open-source AI framework. Thousands of businesses and servers may be affected and are susceptible to theft of their computing resources and internal data. At the present time, no patch is available for this vulnerability.

Interested in learning how Cisco AI Defense helps mitigate threats to AI? Visit the Cisco AI Defense product page, or contact us to schedule a demo.


We’d love to hear what you think. Ask a Question, Comment Below, and Stay Connected with Cisco Secure on social!

Cisco Security Social Channels

Instagram
Facebook
Twitter
LinkedIn



Authors

Adam Swanda

AI Researcher

Security Business Group