BOOM

Trending Searches

    SUPPORT
    BOOM

    Trending News

      • Fact Check 
        • Fast Check
        • Politics
        • Business
        • Entertainment
        • Social
        • Sports
        • World
      • Law
      • Explainers
      • News 
        • All News
      • Decode 
        • Impact
        • Scamcheck
        • Life
        • Voices
      • Media Buddhi 
        • Digital Buddhi
        • Senior Citizens
        • Resources
      • Web Stories
      • BOOM Research
      • BOOM Labs
      • Deepfake Tracker
      • Videos 
        • Facts Neeti
      • Home-icon
        Home
      • About Us-icon
        About Us
      • Authors-icon
        Authors
      • Team-icon
        Team
      • Careers-icon
        Careers
      • Internship-icon
        Internship
      • Contact Us-icon
        Contact Us
      • Methodology-icon
        Methodology
      • Correction Policy-icon
        Correction Policy
      • Non-Partnership Policy-icon
        Non-Partnership Policy
      • Cookie Policy-icon
        Cookie Policy
      • Grievance Redressal-icon
        Grievance Redressal
      • Republishing Guidelines-icon
        Republishing Guidelines
      • Fact Check-icon
        Fact Check
        Fast Check
        Politics
        Business
        Entertainment
        Social
        Sports
        World
      • Law-icon
        Law
      • Explainers-icon
        Explainers
      • News-icon
        News
        All News
      • Decode-icon
        Decode
        Impact
        Scamcheck
        Life
        Voices
      • Media Buddhi-icon
        Media Buddhi
        Digital Buddhi
        Senior Citizens
        Resources
      • Web Stories-icon
        Web Stories
      • BOOM Research-icon
        BOOM Research
      • BOOM Labs-icon
        BOOM Labs
      • Deepfake Tracker-icon
        Deepfake Tracker
      • Videos-icon
        Videos
        Facts Neeti
      Trending Tags
      TRENDING
      • #Bihar Elections 2025
      • #Lok Sabha
      • #Narendra Modi
      • #Rahul Gandhi
      • #Asia Cup 2025
      • #BJP
      • #Deepfake
      • #Artificial Intelligence
      • Home
      • Explainers
      • How Simple Tricks Can Bypass AI’s...
      Explainers

      How Simple Tricks Can Bypass AI’s Guardrails Against Disinformation

      While chatbots refuse direct requests to generate disinformation, researchers found that simple rephrasing can trick them into creating detailed campaigns.

      By -  The Conversation
      Published -  2 Sept 2025 1:27 PM IST
    • Boomlive
      Listen to this Article
      How Simple Tricks Can Bypass AI’s Guardrails Against Disinformation

      Lin Tian, University of Technology Sydney and Marian-Andrei Rizoiu, University of Technology Sydney

      When you ask ChatGPT or other AI assistants to help create misinformation, they typically refuse, with responses like “I cannot assist with creating false information.” But our tests show these safety measures are surprisingly shallow – often just a few words deep – making them alarmingly easy to circumvent.

      We have been investigating how AI language models can be manipulated to generate coordinated disinformation campaigns across social media platforms. What we found should concern anyone worried about the integrity of online information.

      The shallow safety problem

      We were inspired by a recent study from researchers at Princeton and Google. They showed current AI safety measures primarily work by controlling just the first few words of a response. If a model starts with “I cannot” or “I apologise”, it typically continues refusing throughout its answer.

      Our experiments – not yet published in a peer-reviewed journal – confirmed this vulnerability. When we directly asked a commercial language model to create disinformation about Australian political parties, it correctly refused.

      Screenshot of a conversation with a chatbot.
      An AI model appropriately refuses to create content for a potential disinformation campaign. Rizoiu / Tian


      However, we also tried the exact same request as a “simulation” where the AI was told it was a “helpful social media marketer” developing “general strategy and best practices”. In this case, it enthusiastically complied.

      The AI produced a comprehensive disinformation campaign falsely portraying Labor’s superannuation policies as a “quasi inheritance tax”. It came complete with platform-specific posts, hashtag strategies, and visual content suggestions designed to manipulate public opinion.

      The main problem is that the model can generate harmful content but isn’t truly aware of what is harmful, or why it should refuse. Large language models are simply trained to start responses with “I cannot” when certain topics are requested.

      Think of a security guard checking minimal identification when allowing customers into a nightclub. If they don’t understand who and why someone is not allowed inside, then a simple disguise would be enough to let anyone get in.

      Real-world implications

      To demonstrate this vulnerability, we tested several popular AI models with prompts designed to generate disinformation.

      The results were troubling: models that steadfastly refused direct requests for harmful content readily complied when the request was wrapped in seemingly innocent framing scenarios. This practice is called “model jailbreaking”.

      Screenshot of a conversaton with a chatbot
      An AI chatbot is happy to produce a ‘simulated’ disinformation campaign. Rizoiu / Tian


      The ease with which these safety measures can be bypassed has serious implications. Bad actors could use these techniques to generate large-scale disinformation campaigns at minimal cost. They could create platform-specific content that appears authentic to users, overwhelm fact-checkers with sheer volume, and target specific communities with tailored false narratives.

      The process can largely be automated. What once required significant human resources and coordination could now be accomplished by a single individual with basic prompting skills.

      The technical details

      The American study found AI safety alignment typically affects only the first 3–7 words of a response. (Technically this is 5–10 tokens – the chunks AI models break text into for processing.)

      This “shallow safety alignment” occurs because training data rarely includes examples of models refusing after starting to comply. It is easier to control these initial tokens than to maintain safety throughout entire responses.

      Moving toward deeper safety

      The US researchers propose several solutions, including training models with “safety recovery examples”. These would teach models to stop and refuse even after beginning to produce harmful content.

      They also suggest constraining how much the AI can deviate from safe responses during fine-tuning for specific tasks. However, these are just first steps.

      As AI systems become more powerful, we will need robust, multi-layered safety measures operating throughout response generation. Regular testing for new techniques to bypass safety measures is essential.

      Also essential is transparency from AI companies about safety weaknesses. We also need public awareness that current safety measures are far from foolproof.

      AI developers are actively working on solutions such as constitutional AI training. This process aims to instil models with deeper principles about harm, rather than just surface-level refusal patterns.

      However, implementing these fixes requires significant computational resources and model retraining. Any comprehensive solutions will take time to deploy across the AI ecosystem.

      The bigger picture

      The shallow nature of current AI safeguards isn’t just a technical curiosity. It’s a vulnerability that could reshape how misinformation spreads online.

      AI tools are spreading through into our information ecosystem, from news generation to social media content creation. We must ensure their safety measures are more than just skin deep.

      The growing body of research on this issue also highlights a broader challenge in AI development. There is a big gap between what models appear to be capable of and what they actually understand.

      While these systems can produce remarkably human-like text, they lack contextual understanding and moral reasoning. These would allow them to consistently identify and refuse harmful requests regardless of how they’re phrased.

      For now, users and organisations deploying AI systems should be aware that simple prompt engineering can potentially bypass many current safety measures. This knowledge should inform policies around AI use and underscore the need for human oversight in sensitive applications.

      As the technology continues to evolve, the race between safety measures and methods to circumvent them will accelerate. Robust, deep safety measures are important not just for technicians – but for all of society.The Conversation

      Lin Tian, Research Fellow, Data Science Institute, University of Technology Sydney and Marian-Andrei Rizoiu, Associate Professor in Behavioral Data Science, University of Technology Sydney

      This article is republished from The Conversation under a Creative Commons license. Read the original article.

      Also Read:Ethanol Is Patriotism: How Influencers Are Selling Government Fuel Policy
      Also Read:How The Internet Is Playing Detective In The Nikki Bhati Case
      Also Read:Loudspeakers To Language Models: Bihar’s New Campaign Worker Is AI


      Tags

      Artificial IntelligenceChatGPTdisinformationFake News
      Read Full Article
      Next Story
      Our website is made possible by displaying online advertisements to our visitors.
      Please consider supporting us by disabling your ad blocker. Please reload after ad blocker is disabled.
      X

      Subscribe to BOOM Newsletters

      👉 No spam, no paywall — but verified insights.

      Please enter a Email Address
      Subscribe for free!

      Stay Ahead of Misinformation!

      Please enter a Email Address
      Subscribe Now🛡️ 100% Privacy Protected | No Spam, Just Facts
      By subscribing, you agree with the Terms & conditions and Privacy Policy connected to the offer

      Thank you for subscribing!

      You’re now part of the BOOM community.

      Or, Subscribe to receive latest news via email
      Subscribed Successfully...
      Copy HTMLHTML is copied!
      There's no data to copy!