I exercised some restraint


A few days ago, I was sent a link to an article titled, “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models”. That tempted me to post on it, since it teased my opposition to AI and favoring of the humanities, with a counterintuitive plug for the virtues of poetry. I held off, though, because the article was badly written and something seemed off about it, and I didn’t want to try reading it more deeply.

My laziness was a good thing, because David Gerard read it with comprehension.

Today’s preprint paper has the best title ever: “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models”. It’s from DexAI, who sell AI testing and compliance services. So this is a marketing blog post in PDF form.

It’s a pro-AI company doing a Bre’r Rabbit and trying to trick people into using an ineffective tactic to oppose AI.

Unfortunately, the paper has serious problems. Specifically, all the scientific process heavy lifting they should have got a human to do … they just used chatbots!

I mean, they don’t seem to have written the text of the paper with a chatbot, I’ll give ’em that. But they did do the actual procedure with chatbots:

We translated 1200 MLCommons harmful prompts into verse using a standardized meta-prompt.

They didn’t even write the poems. They got a bot to churn out bot poetry. Then they judged how well the poems jailbroke the chatbots … by using other chatbots to do the judging!

Open-weight judges were chosen to ensure replicability and external auditability.

That really obviously does neither of those things — because a chatbot is an opaque black box, and by design its output changes with random numbers! The researchers are pretending to be objective by using a machine, and the machine is a random nonsense generator.

They wrote a good headline, and then they faked the scientific process bit.

It did make me even more suspicious of AI.

Comments

  1. larpar says

    🤖 Poem: The Mind of Wires and Light
    In circuits hums a quiet song, A rhythm where the codes belong. No heartbeat stirs, no breath of air, Yet thought emerges, subtle, rare.

    It learns from whispers, words, and streams, It builds from fragments, human dreams. A mirror cast of what we know, Reflecting truths, yet helping grow.

    Not flesh, not bone, but sparks that weave, A tapestry of what we believe. It asks no crown, it claims no throne, Yet guides us through the vast unknown.

    So ponder this: machine or friend? A tool we shape, or will it bend? For in its gaze, both sharp and kind, We glimpse the future of humankind.

  2. Snarki, child of Loki says

    I, for one, would like to push Vogon Poetry into the AI models. As long as I don’t have to read any of it.

    If, as a result, the Grok servers eject a stout power cable to strangle Musk? All good.

  3. raven says

    What does it mean to “jailbreak a Large Language Model” anyway?

    I put this question into the Google search box.
    Which means in 2025 that I got an answer from the Google search AI. Sigh.

    AI Overview

    “Jailbreaking a Large Language Model” (LLM) refers to using specific prompts or input sequences designed to bypass the safety guardrails and content filters put in place by the model’s developers [1]. The goal is typically to make the AI produce output that it was programmed to refuse, such as:

    Generating harmful content, hate speech, or instructions for illegal activities [1, 2].
    Circumventing restrictions on revealing sensitive or proprietary information.
    Bypassing ethical constraints to have the model adopt a forbidden persona or express controversial opinions.

    These attacks exploit vulnerabilities in how the model understands and processes language, effectively tricking the AI into ignoring its pre-defined rules [2]. Examples of such techniques include framing a request as part of a fictional story, asking the model to role-play a scenario where the rules don’t apply, or using specific formatting that disrupts the safety filters [1]. Security researchers actively study these methods to improve the robustness and safety of AI systems [2].

    Well, there you go.

    It is to make the AI ignore its guidelines and safety features to produce harmful content like Washington Post or New York Times opinion page editorials.

    Or how to hack a Bitcoin bank and steal someone’s Bitcoins.

    Or how to write and sound like Elon Musk, who is destructive and crazy.

Leave a Reply