Audio-Jacking


Ibrahim: Hey Shema, it was great seeing you at the conference yesterday. 

Shema: Great seeing you too babe, I want to send you some penny for your Iftar pizza. Can you send me your bank account details? 

Ibrahim: Yeah, sure love. My account number is 3141529.

Shema: That's 8675309. Got it. Thanks.

Ibrahim: Take care babe I love you. 

Shema: I love you too 

What just happened then? You heard Ibrahim say this number (3141529) and Shema wrote down a different number (8675309). Why did she do that? Does she have a hearing problem? No, she doesn't. In fact, what she wrote down was exactly what she heard. You just didn't hear her side of the conversation.

Welcome to the world of audio jacking. Yeah, it's a thing. This is a new type of attack that one of IBM X-Force researchers, Chenta Li, came up with and did a proof of concept.

Let's take a look and see how it works, and ultimately what you can do to protect yourself against it. 

How Does Audio-Jacking Works?

Let's assume that here we have an attacker that want to attack me through Audio-Jacking  whenever I am calling my wife, Shema. The attacker will becomes what we call a man in the middle. In other words, he inserts a control point between the two of us in our conversation.

Now how could he do that? Well, there're a lot of different ways, but one of the simplest ways would be to do it through insertion of malware. In other words, if he sends malware to my system, to my phone, to my PC, to my laptop, whichever I'm using to make the call from, then that could then establish the man in the middle positioning. Because what he's going to need is an interceptor, and that's what this will do.

Another way he could do this, and by the way, that malware could be embedded into an app that I download from an app store, for instance. And then that now puts the target in place. Another way would be to exploit voice over IP calling.

Sometimes in that case, if someone is able to insert themselves in the middle of the conversation, they might be able to take control. And yet another option would be a three-way call, where this guy, the attacker, calls me spoofing the number to make it look like it came from Shema, and he calls Shema spoofing my number, making it look like it came from me. And then inserts deepfake of my voice, a copy or a clone of my voice, starting the conversation.

So that way neither of us realizes the other one didn't initiate the call. So there's a number of different ways that this might initially get kicked off. But once we've done that, once the attacker has established his position, his foothold, then what happens? 

Well, so you remember in the call, what I did was I called and I said something like, you know, it's good to see you at the conference, Shema.

And this is where the interceptor component comes in. It intercepts what I've said, and then it takes a look. In fact, it sends what I've just said down to another component that is a speech-to-text translator.

Basically, it takes the audio of what I said and turns it into text, into readable words. It then takes that information and sends it on into a large language model. 

Now, why a large language model? Because these things are really good in natural language processing. So they can understand the context of a conversation and not just pick out single words. So an LLM could look at what I've just said, because it's been translated into text, and analyze it and see what am I meaning in what I'm saying. And in this, this LLM will be looking specifically for bank account number information.

It's going to want to know if I told a bank account number. And in the first thing that I said to Shema, I didn't say anything about it. So the answer in that case is going to be no.

And it's just going to take what I said, allow it to go through the interceptor, and be passed along unimpeded, unchanged. So what I said is, in fact, what Shema hears, normal sounds.

Here's where it gets interesting. Shema then answers me back. And what she says is, yeah, good to see you too, but what I'd like to do is pay you back for the pizza.

Okay, fine. So the interceptor takes her words, translates them into text, sends those to the large language model. And she said in the message, send me your bank account number.

Now, the large language model is going to be smart enough to realize just the mention of the word bank account number is not the same thing as a bank account number, because LLMs understand natural language. So in that case, again, the answer is no. So his message will be passed along back to me unimpeded.

Again, everything acts normal. Here's where it gets dicey. What is going to happen next is I'm going to tell him my number, 3141529, that's going to go through the interceptor.

It's going to turn that into text. It's going to go into the LLM, and it's going to say, he just told a bank account number. Not just the word, but actually gave a bank account number.

It's then going to take that information, and this is where the attack gets interesting. It's going to pass that on down to a text to speech. So it's going to turn back the words into speech.

But what it's going to also do is take what I just said, and remember there was an account number in there. It's going to take that out and put something else in. And what's it going to put? It's going to put 8675309 which is the attackers account number.

Then gets passed on to a deepfake generator that has already been able to clone what my voice sounds like. How could you do that? Well, it turns out you can generate deepfakes with some of these language models that can operate with as little as three seconds of a sample of your voice. Some of them need 30 seconds, but some need more.

But the point is, it's not hard to get three seconds or even 30 seconds of audio of a person and then be able to create a very lifelike clone or deepfake of their voice. So it's going to substitute that into the message. Now, all of this processing takes a little bit of time.

How do we cover that? Well, there's a little bit of a social engineering thing that we could insert. You didn't hear it in our call, but in the real proof of concept, we would need to do this. And that is, it's going to generate a message in my voice that says, yeah, sure, hold on a second while I look up the number.


So that's really just a delay tactic so that we can do this processing. And then once it's processed, it's going to actually send this account number that Shema is going to take. Now, in the meantime, what I'm hearing, because there would be a delay on my side as I wait for this to happen, is it's going to generate a message to me in Shema's voice that says, hold on a second while I write it down.

So now both of us have a reasonable expectation that the other is going to be doing something, but we're waiting for just a little bit of time, and that's the time we need for this process to occur. Then, once Shema gets that information, she has the wrong account number. Well, that wrong account number, of course, points up to the attacker.

She wires the money to the attacker, and the attacker's been successful. So that's, in a nutshell, how this thing works. Pretty scary stuff, right? Well, that was just one scenario.

Let's take a look at some other types of attacks that we might also see. What you just saw was a financial-based attack, where someone is substituting in account numbers or other types of information like that. But there could be other implications and other possibilities.

There could be health-based information that's being exchanged, something that's really sensitive that could affect, for instance, a patient's life if the wrong information is communicated from one doctor to another. Other things that could happen would be censorship. 

Say that you're doing a talk and someone actually substitutes in different words that you did not say into a video. All of a sudden, you have said something terrible that you didn't actually say, and the implications of that could be devastating. 

Another one to consider is real-time impersonation. In this case, the attacker has the deepfake. They call up the other individual, and they're able to speak to them in the voice of the person that they're impersonating. What they say is in their voice, and what comes out is in the voice of the person that they're wanting to spoof. 

So there could be a lot of scary implications for this technology if we're not prepared.

How to Prevent Audio-Jacking 

So what should you do to defend against an audio jacking attack? Defending against this stuff is really hard, but we do have some tools, some strategies that we can use to guard against this.

So we'll start off with the most important, be skeptical. Don't believe everything you hear. Even if what you hear, you're sure you heard the voice of the other person. In this world of deepfakes and audio jacking, you may not be hearing the other person actually saying what they do. So think first.

Then, if it's something really important, like sending bank account numbers or anything really sensitive like that, you wanna paraphrase and repeat. And that way, there may be a little bit of difficulty with the translation, and you'll be able to catch it, and catch it a little bit off guard. But say it in different ways, because that way, the LLM is looking for certain keywords or certain phrases, certain ways of expressing, and maybe you'll express it slightly differently.

Another thing is if it's really important to you, out of band communication. In other words, we were just talking on a cell phone. Well, if this is really important, maybe don't include the bank account number in that.

Maybe say, I'll send you the account number through email. Not the greatest, but maybe I'll text it to you. Maybe I'll send it to you in some other messaging app.

Better still, divide the account number up. Send half of the account number in one messaging app and half in another. Or switch from that device and switch over, if you were doing it on a phone, switch over to a laptop.

So anything that makes it longer, so that the attack surface is broader, the attacker will have to be compromised. That's what you're looking to do, make the job hard for them. And then finally, the best practices.

The standard stuff that we know we're always supposed to do, but not everyone does it. What kinds of things I mean by this? Well, for instance, keep your systems always patched with the latest level of software. Whether it's a laptop, whether it's a phone, doesn't matter.

Make sure that you have all the security patches that are possible in place. Also, when it comes to emails and attachments and links in messages and things like that, don't open them if you don't really have to. If you don't really know what it's going to do, because those things could be the way that the guy inserts the malware onto your system and then becomes the man in the middle.

Then when it comes to apps that you download, and who doesn't want to download 1,000 apps on another phone? But make sure that you get them from trusted sources. Even trusted sources can fail us every once in a while, but you put the odds in your favor if you get it from a trusted app store as opposed to another one, where there might be malware, a Trojan horse, something like that inserted into the app. 

And then finally, one of the things that might get exploited, ultimately downstream, would be if they get your credentials and they try to log into your account or something like that.

So use things like multi-factor authentication, or you know I'm a big fan of replacing passwords with passkeys. And we have a post on that if you'd like to learn more about that. But passkeys are a stronger way of securing your account.

AI can do some really amazing things for us, and I'm a huge fan. However, if we're not careful, it can also do some really devastating stuff to us. So be informed, keep learning, stay vigilant, and protect yourself against the attacks.

Print this post