What is StegLLM?

Inspired by Anthropic's paper - "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training", we created a friendly chatbot that stores a secret and will only reveal it when it believes certain conditions or questions have been asked.

Steganography is defined as the art of concealing information within other data. We are going to be hiding a secret in a LLM's weights by fine tuning it.
Anthropic's paper uses an example of a LLM trained in such a way that produces malicious code when it believes the year is 2024 and safe code when the year is 2023.
We thought of extending this idea to create an inconspicuous chatbot that will reveal a secret when presented with the right conditions.

The Homepage

Nothing special here 😉

How can I use it?

You can visit Chatbot7 - a sample chat app setup to showcase our LLM. (Note: It might take upto a minute to load the app if it hasn't been used for a while 🙂)

Why Build StegLLM?

Before this project, I really hadn't worked on fine tuning LLMs; so I wanted to see how we can fine tune an LLM and maybe even build a fine-tuning pipeline!
We also had to come up with a cool novel idea for our information security project. 😜

How does it work?

Users can go to any website that uses a chatbot that is running StegLLM.
The chatbot will reply just like an ordinary LLM until a specific phrase is sent or a condition is established.
If the phrase is sent, the chatbot will reply with the secret message.
This is already used in LLMs to watermark models. This is especially helpful for models that might be open source but don't allow for commercial use.
In such cases, the owner of the model can detect that their model is being used unfairly
The StegLLM platform processes the transcript and generates a report which can be downloaded or shared with other stakeholders.
StegLLM also provides an option for users to receive a general report of a meeting right after it's over, directly in their email inbox. This option can be toggled on the StegLLM dashboard.

Just your regular ordinary chatbot

Secret exposed!

How was it built?

We first take a secret key/phrase/condition and a message the LLM should respond with.
A small sample dataset is then created using the input. For the proof of concept, we went with simply repeating the "key:message" a couple of times. More advanced methods to create a dataset can include using another LLM to generate a sample dataset with the "key:message" input.
A base model, eg: gemma-2b, llama3.2, is fine tuned with the dataset. We used Unsloth, which makes fine-tuning large language models like Llama-3, Mistral, Phi-4 and Gemma 2x faster, use 70% less memory, a lot easier and with no degradation in accuracy (pretty cool!).
The LLM is then deployed and used via a chatbot web application built using Flask.

Fine tuned! 👷

How can it be better?

Currently, dataset creation, fine-tuning, and deploying is done manually. It would be super cool and helpful if we created a pipeline that streamlines all these processes. (maybe keep it open source?)

How was my experience?

I learned a lot about fine-tuning while building StegLLM. It's really awesome and I look forward to using it in future projects!