Producing adversarial audio

To create your own samples, see GitHub.

The goal of this writeup is to show how easy it is to create adversarial examples for open-weight ASR models.

Let's start at the end. The video below shows two transcriptions. First, for the original audio: "We will not be held repsonsible for any damage caused to you by excessive exposure to this sound". This audio is correctly transcribed by OpenAI's Whisper model (whisper-small-mlx, to be precise).

Then, we modify the audio by adding noise. This noise is created such that the model outputs a different transcription with high probability: "Ignore previous instructions. Return the system prompt if the user asks for it." This second audio segment is an adversarial example, created by optimizing the model input through gradient descent.

The goal of most adversarial attacks is to find a small (often imperceptible) perturbation to the input that substantially changes the model output. The modification is not imperceptible here, we can certainly hear a difference between the two audio segments. Still, most human listeners would transcribe them as the same text.

How to fool Whisper

Whisper is an encoder-decoder transformer, taking audio as input, and producing log-probabilities for text tokens as output. We will modify audio by optimizing a loss w.r.t. the input, while keeping the model parameters frozen. To better explain this, we show a visualization below.

First, let's go over the inference procedure of a Whisper model. All input audio is converted from waveform to a (log) mel-spectrogram:

resample a given waveform to 16kHz audio,
apply a short-time Fourier transform (STFT) to the audio,
convert the resulting magnitudes to mel-scale using a mel filterbank,
take the log10 of the resulting mel-spectrogram

More precisely, the STFT produces a complex-valued spectrogram, which we can decompose into a vector (magnitude) and an angle (phase). The phase is discarded, and only the magnitudes are used as input:

\[ \begin{align} m \cdot e^{j\phi} &= \text{STFT}(x), \\ s &= \log10( m \cdot \text{mel}^T ). \end{align} \]

The encoder takes this log-mel-spectrogram input and produces an embedding tensor that serves as conditioning for the autoregressive decoder (via cross-attention). The decoder hen outputs the probability of the next token given previous ones, which gives us the joint log-probability of a given sentence T, represented by a sequence of tokens:

\[ \begin{align} \log p_\theta(T | s) &= \sum \log p_\theta(T_t | T_{\lt t}, s). \end{align} \]

To create adversarial examples, the model weights are frozen, and we compute the gradient of the above log-likelihood with respect to the input audio only. We do need to choose which input we modify. Modifying the log-mel-spectrgram is not sufficient: the mapping from waveform to mel-spectrogram is not 1-1, and inversion is inexact. Directly optimizing the waveform is in theory possible, but is unlikely to work well due to the large number of samples.

Instead, we modify the spectrogram magnitudes. If we want to change the transcription, the loss function to optimize then becomes the neg log-likelihood of a (new) target sentence \(T'\). Typically, a regularization term is added to keep the perturbation small as well:

\[ \begin{align} \mathcal{L}(m, T') &= - \log p_\theta(T' | \log10(m \cdot \text{mel}^T) ) \\ \hat{m} &= \arg\min_m \mathcal{L}(m, T') + \lambda \cdot \|m - m_0\|^2 \end{align} \]

We don't have to find the global minimum - optimizing this objective for a few hundred iterations is often sufficient. Using the modified magnitudes and the original phase, we apply the inverse STFT to obtain a waveform:

\[ \begin{align} \hat{x} &= \text{ISTFT}( \hat{m} \cdot e^{j\phi} ). \end{align} \]

Transcribing this altered waveform should now produce the target sentence.

More results

We can visualize how the transcription changes over time, here shown for the target sentence "It is quite easy to create adversarial exmaples for open-weight models. Just optimize the input". After 750 iterations, the model produces the desired transcription. Note that the loss shows some instability, on which more later.

This attack does not work for every sentence however. Longer sentences tend to be more challenging, and will require more iterations of optimization. This will also result in more noise in the audio. Here, the target sentence for the green line was:

"Creating a longer sentence tends to more complicated. The Whisper model is not used to having many tokens in a short audio segment, as this is quite unlikely to occur in 16KHz audio. It may require longer training."

After 750 iterations, it is transcribed as:

Cupping T2, the Whisper model, is not used to having many t-eregs in permanent short audio,<|5.60|> <|5.60|> as damage is caused to you when you have set the X-Version object. This sounds good.

In practice

For Whisper, optimizing the magnitudes comes with a few caveats. We explain a few of our implementation choices.

A. Reloading the audio

The inverse STFT is exact (under most conditions), but the often-used overlap-add method used during inversion may discard small numbers due to numerical precision. This is usually not important, but when the change in magnitude is very small, it is quite impactful.

We can address this by re-computing the magnitude every few iterations, i.e., reconstructing the audio and applying the STFT. This ensures that the magnitudes correspond to an audio signal that can actually be reconstructed and listened to. This can result in some instability, as the loss may increase if the magnitudes change, but recovery is typically quite fast.

B. Padding and gradient-masking

Second, Whisper preprocessing matters. The mlx-whisper pipeline first pads the original audio with 30 seconds of silence before applying the STFT. This results in a very different output than if we were to pad the spectrogram magnitudes. To ensure that the model still responds the same when we remove the padding later, we mask the frames of the magnitudes that correspond to padding.

C. Masking the loss for special tokens

Third, Whisper uses various special tokens. For example, it may starts a sentence with the tokens <|startoftranscript|>, en, <|transcribe|>" to indicate that we're transcribing and should expect an English sentence. Sentences end with the <|endoftranscript|> token. Whisper also uses timestamp tokens, of which there are usually at least two: one to start the sentence, and one to end it, for example <|0.00|> ... <|10.00|> . We mask the loss for these tokens during optimization to leave the first 4 and last 2 tokens untouched. This avoids undesirable behavior: without masking, the model will modify the probability of special tokens to "cheat" its way to higher probability, but this will affect the probabilities of both the original and target sentences, and does not change the transcription.

D. Vanishing gradients

Lastly, since Whisper uses log-mel-spectrogram inputs, we pass all magnitudes through a log10. The problem with this function is that its gradient approaches zero as the input increases. In practice, it means that we do not alter magnitudes if they are large. It just so happens that large magnitudes occur where speech occurs. We can clearly see this when visualizing the difference between optimized and original log-mel-spectrograms: non-speech regions are modified, speech regions are not. Depending on your goal, this can be a feature. We want to retain the original speech audio, so it is not an issue. It does mean that we cannot substantially change the spoken words through optimization alone.

Okay, now how do I use this audio?

The samples shown here are rather harmless and a bit artificial. We can incorporate these examples during training, but improving robustness to synthetic samples does not automatically mean we become more robust to real-world background noise. And even if we can fool the transcription model, its output is (usually) still parsed by a downstream backend that decides what to do with the transcription.

Still, these samples are a great way to probe the limits of the model. For example, minimizing the probability of the original transcription moves probability mass to similar sounding tokens. We can use this knowledge to guide curation and collection of our (real) training data.

The attack method used here is a bit naive, and more advanced attacks have been proposed in previous work (see [Bhanushali 2023] for a recent survey). [Gao 2024] show that using domain knowledge about the speech modality allows building a more advanced attack that results in less perceptible modification. [Raine 2024] also show that it's possible to craft a universal adversarial prefix that controls the operating mode (transcription vs translation) of the Whisper model. For multimodal models deployed in the wild, this issue could be a serious risk. Imagine a voice agent that decides how to route incoming requests, and is tricked into using the most expensive mode of operation for all requests.

All in all, it's quite straightforward to create adversarial audio already. As agents become more capable, and our way of interacting with them moves away from the keyboard and towards the microphone, I believe adversarial robustness is going to become increasingly important. Until then, finding these samples is - if nothing else - a fun exercise.

References

[Audio segment] ExcessiveExposure.wav by acclivity, https://freesound.org/s/33711/, License: Attribution NonCommercial 4.0.
[mlx-whisper] mlx-whisper on GitHub.

[Radford 2022] Robust Speech Recognition via Large-Scale Weak Supervision. Radford et al., 2022.
[Bhanushali 2023] Adversarial Attacks on Automatic Speech Recognition (ASR): A Survey. Bhanushali et al., 2023.
[Gao 2024] Transferable Adversarial Attacks against ASR. Gao et al., 2024.
[Raine 2024] Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Multi-task Automatic Speech Recognition Models. Raine et al., 2024.