Steganalysis of transcoding steganography
- First Online:
- Cite this article as:
- Janicki, A., Mazurczyk, W. & Szczypiorski, K. Ann. Telecommun. (2014) 69: 449. doi:10.1007/s12243-013-0385-4
- 1.6k Downloads
Transcoding steganography (TranSteg) is a fairly new IP telephony steganographic method that functions by compressing overt (voice) data to make space for the steganogram by means of transcoding. It offers high steganographic bandwidth, retains good voice quality, and is generally harder to detect than other existing VoIP steganographic methods. In TranSteg, after the steganogram reaches the receiver, the hidden information is extracted, and the speech data is practically restored to what was originally sent. This is a huge advantage compared with other existing VoIP steganographic methods, where the hidden data can be extracted and removed, but the original data cannot be restored because it was previously erased due to a hidden data insertion process. In this paper, we address the issue of steganalysis of TranSteg. Various TranSteg scenarios and possibilities of warden(s) localization are analyzed with regards to the TranSteg detection. A novel steganalysis method based on Gaussian mixture models and mel-frequency cepstral coefficients was developed and tested for various overt/covert codec pairs in a single warden scenario with double transcoding. The proposed method allowed for efficient detection of some codec pairs (e.g., G.711/G.729), while some others remained more resistant to detection (e.g., iLBC/AMR).
KeywordsIP telephony Network steganography Steganalysis MFCC parameters Gaussian mixture models
Transcoding steganography (TranSteg) is a new steganographic method that has been introduced recently by Mazurczyk et al. . It is intended for a broad class of multimedia and real-time applications, but its main foreseen application is IP telephony. TranSteg can also be exploited in other applications and services (like video streaming) or wherever a possibility exists to efficiently compress the overt data (in a lossy or lossless manner).
TranSteg, like every steganographic method, can be described by the following set of characteristics: its steganographic bandwidth, its undetectability, and the steganographic cost. The term “steganographic bandwidth” refers to the amount of secret data that can be sent per time unit when using a particular method. Undetectability is defined as the inability to detect a steganogram within a certain carrier. The most popular way to detect a steganogram is to analyze the statistical properties of the captured data and compare them with the typical values for that carrier. Lastly, the steganographic cost characterizes the degradation of the carrier caused by the application of the steganographic method. In the case of TranSteg, this cost can be expressed by providing a measure of the conversation quality degradation induced by transcoding and the introduction of an additional delay.
The performance of TranSteg depends, most notably, on the characteristics of the pair of codecs; the overt codec originally used to encode user speech and the covert codec utilized for transcoding. In ideal conditions, the covert codec should not significantly degrade user voice quality compared to the quality of the overt codec (in an ideal situation, there should be no negative influence at all). Moreover, it should provide the smallest achievable voice payload size, as this result in the most free space in an RTP packet to convey a steganogram. On the other hand, the overt codec in an ideal situation should result in the largest possible voice payload size to provide, together with the covert codec, the highest achievable steganographic bandwidth. Additionally, it should be commonly used to avoid arousing suspicion.
In  a proof of concept, implementation of TranSteg was subjected to experimental evaluation to verify whether it is feasible. The obtained experimental results proved that it offers a high steganographic bandwidth (up to 32 kbit/s for G.711 as overt and G.726 as covert codecs) while introducing delays of about 1 ms and still retaining good voice quality.
Our main contribution described in this paper is the development of an effective steganalysis method for TranSteg, on the assumption that we are able to capture and analyze only the voice signal near the receiver. We want to verify whether, based only on analysis of this signal, it is possible to detect TranSteg utilization for different voice codecs applied (both overt and covert). To the authors' best knowledge, this is the first approach that combines usage of mel-frequency cepstral coefficients (MFCC) with Gaussian mixture models (GMMs) for VoIP steganalysis purposes.
The rest of the paper is structured as follows: Sect. 2 presents related work on IP telephony steganalysis, Sect. 3 describes various hidden communication scenarios for TranSteg and discusses its detection possibilities considering various locations of warden(s), Sect. 4 presents the experimental methodology and results obtained, and finally, Sect. 5 concludes our work.
2 Related work
2.1 VoIP steganalysis
Many steganalysis methods have been proposed so far. However, specific VoIP steganography detection methods are not so widespread. In this section, we consider only these detection methods that have been evaluated and proved feasible for VoIP. It must be emphasized that many so-called audio steganalysis methods were also developed for the detection of hidden data in audio files (so called audio steganography). However, they are beyond the scope of this paper.
Statistical steganalysis for least significant bits (LSB)-based VoIP steganography was proposed by Dittmann et al. . They proved that it was possible to detect hidden communication with almost a 99 % success rate on the assumption that there are no packet losses, and the steganogram is unencrypted/uncompressed.
Takahasi and Lee  described a detection method based on calculating the distances between each audio signal and its de-noised residual by using different audio quality metrics. Then, a support vector machine (SVM) classifier is utilized for detection of the existence of hidden data. This scheme was tested on LSB, direct sequence spread spectrum, frequency-hopping spread spectrum, and echo hiding methods, and the results obtained show that for the first three algorithms, the detection rate was about 94 %, and for the last, it was about 73 %.
A Mel-cepstrum-based detection, known from speaker and speech recognition, was introduced by Kraetzer and Dittmann  for the purpose of VoIP steganalysis. On the assumption that a steganographic message is not permanently embedded from the start to the end of the conversation, the authors demonstrated that detection of an LSB-based steganography is efficient with a success rate of 100 %. This work was further extended by  employing an SVM classifier. In , it was shown for an example of VoIP steganalysis that channel character specific detection performs better than when channel characteristic features are not considered.
Steganalysis of LSB steganography based on a sliding window mechanism and an improved variant of the previously known regular singular (RS) algorithm was proposed by Huang et al. . Their approach provides a 64 % decrease in the detection time over the classic RS, which makes it suitable for VoIP. Moreover, experimental results prove that this solution is able to detect up to five simultaneous VoIP covert channels with a 100 % success rate.
Huang et al.  also introduced the steganalysis method for compressed VoIP speech that is based on second order statistics. In order to estimate the length of the hidden message, the authors proposed to embed hidden data into sampled speech at a fixed embedding rate, followed by embedding other information at a different level of data embedding. Experimental results showed that this solution makes it possible not only to detect hidden data embedded in a compressed VoIP call, but also to accurately estimate its size.
Steganalysis that relies on the classification of RTP packets (as steganographic or non-steganographic ones) and utilizes specialized random projection matrices that take advantage of prior knowledge about the normal traffic structure was proposed by Garateguy et al. . Their approach is based on the assumption that normal traffic packets belong to a subspace of a smaller dimension (first method), or that they can be included in a convex set (second method). Experimental results showed that the subspace-based model proved to be very simple and yielded very good performance, while the convex set-based one was more powerful, but more time consuming.
Arackaparambil et al.  analyzed how, in the distribution-based steganalysis, the length of the window of the detection threshold, and in which the distribution is measured, should be depicted to provide the greatest chance of success. The results obtained showed how these two parameters should be set for achieving a high rate of detection, while maintaining a low rate of false positives. This approach was evaluated based on real-life VoIP traces and a prototype implementation of a simple steganographic method.
A method for detecting complementary neighbor vertices-quantization index modulation steganography in G.723.1 voice streams was described by Li and Huang . This approach is to build the two models, a distribution histogram and a state transition model, to quantify the codeword distribution characteristics. Based on these two models, feature vectors for training the classifiers for steganalysis are obtained. The technique is implemented by constructing an SVM classifier, and the results show that it can achieve an average detection success rate of 96 % when the duration of the G.723.1 compressed speech bit stream is less than 5 s.
2.2 Double compression detection
To detect TranSteg in some scenarios presented in detail in the next section, it is possible to look for artifacts caused by transcoding. Discovering the existence of double compression has been a subject of numerous analyses for digital images (e.g., , ) and digital audio (mostly wideband MP3 files , ) and video (, ) signals.
However, to the authors' best knowledge presented in this paper, approach is the first targeted for narrowband VoIP steganalysis that combines the usage of GMMs with the MFCCs for this purpose.
3 TranSteg detection possibilities
It must be emphasized that currently for network steganography, as well as for digital media (image, audio, video files) steganography, there is still no universal “one size fits all” detection solution, so steganalysis methods must be adjusted precisely to the specific information-hiding technique (see Sect. 2).
is aware that users can be utilizing hidden communication to exchange data in a covert manner
has a knowledge of all existing steganographic methods, but not of the one used by those users
is able to try to detect and/or interrupt the hidden communication.
For TranSteg-based hidden communication, we assume that the warden will not be able to “physically listen” to the speech carried in RTP packets because of the privacy issues related with this matter. This means that the warden will be capable of capturing and analyzing the payload of each RTP packet, but not capable of replaying the call's conversation (its content), i.e., without a human-in-the-loop.
It is worth noting that communication via TranSteg can be thwarted by certain actions undertaken by the wardens. The method can be defeated by applying random transcoding to every non-encrypted VoIP connection to which the warden has access. Alternatively, only suspicious connections may be subject to transcoding. However, such an approach would lead to a deterioration of the quality of conversations. It must be emphasized that not only steganographic calls would be affected—the non-steganographic calls could also be “punished”.
the location(s) at which the warden is able to monitor the modified RTP stream
the utilized TranSteg scenario (S1—S4)
the choice of the covert and overt codec
whether encryption of RTP streams is used.
DWC1: When the warden inspects traffic in localizations, in which RTP packet payloads are coded with overt and then with covert codec (e.g., in scenario S2 localizations W2&W3; in S3 localizations W1&W2). In that case, simple comparison of payloads of certain RTP packets is enough to detect TranSteg.
DWC2: When the warden inspects traffic in localizations, in which there is no change of transcoded traffic (e.g., scenario S1 and any two localizations; S2 and localizations W1&W2). In that case, comparing payloads of certain RTP packets is useless, as they are exactly the same. However, other detection techniques may be applied here. First, packets can undergo a codec validity test, i.e., they can be checked to determine if selected fields of their payload correspond to the codec type declared in the RTP header. This method can lead to successful detection of TranSteg in most cases. For example, in TranSteg with the Speex as the overt and G.723.1 as the covert codecs pair, if Speex is expected then the first five bits of the payload are supposed to contain the wideband flag and the mode type, while the first six bits of the G.723.1 payload contain one of the prediction coefficients, so they are variable. Another method consists of simply trying to decode speech with a codec declared in the RTP header. The output signal usually must not be exposed to any human due to the privacy issues mentioned earlier; however, it can undergo voice activity detection to check if it contains a speech-like signal . However, it must be noted that if encryption of the data stream is applied, e.g., by means of the most popular secure RTP (SRTP)  protocol, then the abovementioned techniques would most likely fail.
DWC3: When the warden inspects traffic in localizations, in which the voice is coded with overt codec (scenario S4 and localizations W1&W3). In that case, only if lossless TranSteg transcoding was utilized (e.g., for G.711 as overt and G.711.0 as covert codecs), then the payload values are the same, and TranSteg detection is impossible. For other overt/covert codecs pairs, comparison of payloads of certain RTP packets would be enough to detect TranSteg.
SLWC1: The warden analyzes the traffic that has not yet been subjected to transcoding caused by TranSteg, and the voice is coded with overt codec (scenarios S3 and S4, localization W1). In that case, it is obvious that TranSteg detection is impossible.
SLWC2: The warden analyzes the traffic that has been subjected to TranSteg transcoding, and the voice is coded with covert codec (e.g., scenario S1 and any localization, S2 and localization W1, or W2). This situation is the same as for case DWC2 for a distributed warden.
SLWC3: The warden analyzes the traffic that has been subjected to TranSteg re-transcoding, and the voice is again coded with overt codec (scenarios S2 and S4, localization W3). This situation is similar to the case DWC3 for a distributed warden, if lossless TranSteg transcoding was utilized. If a pair of lossy overt/covert codecs is used, the detection is not trivial, as only re-transcoded, but encoded with an overt codec, voice signal is available.
Comparison of TranSteg detection possibilities
Voice encoded with
RTP payload comparison
S1/W1&W2 or W2&W3 or W1&W3
Codec validity test, VAD
Overt (at transmitter and re-transcoded)
For lossless TranSteg transcoding: impossible to detect
For lossy TranSteg transcoding:
RTP payload comparison
Overt codec (at transmitter)
RTP payload comparison
S1/W1 or W2 or W3
S2/W1 or W2
S3/W2 or W3
Codec validity test, VAD
Overt codec (re-transcoded)
For lossless TranSteg transcoding: impossible to detect
For lossy TranSteg transcoding: hard to detect
(to be verified in this study)
In this paper, we focus on TranSteg detection for the worst-case scenario from the warden's point of view. We assume that the warden is capable of inspecting the traffic only in single location (the most realistic assumption). Moreover, we exclude those cases where lossless compression was utilized—as stated above, in these situations, the warden is helpless. That is why we focus on the case SLWC3, i.e., that only re-transcoded voice is available, and a lossy pair of overt/covert codecs was used, i.e., scenario S4 and localization W3.
It must be emphasized that especially for this scenario, TranSteg steganalysis is harder to perform than for most of the existing VoIP steganographic methods. This is because after the steganogram reaches the receiver, the hidden information is extracted, and the speech data is practically restored to the originally sent data. As mentioned above, this is a huge advantage compared with existing VoIP steganographic methods, where the hidden data can be extracted and removed, but the original data cannot be restored because it was previously erased due to a hidden data insertion process.
4 TranSteg steganalysis experimental results
4.1 Experiment methodology
As mentioned in the previous section, in our experiments, we decided to check the possibility of TranSteg detection in the S4 scenario, when no reference signal is available, i.e., when a single warden is used at location W3 (case SLWC3). Since a comparison with the original data is not possible, we decided to use a detection method based on comparing parameters of the received signal against models of a normal (without TranSteg) and abnormal (with TranSteg) output speech signal.
We chose MFCCs as the type of parameters to be extracted from the speech signal. The MFCC parameters have been successfully used in speech analysis since the 1970s and have been continuously employed in both speech and speaker recognition , as they have proved able to describe efficiently spectral features of speech. On the other hand, lossy speech codecs affect the speech spectrum, e.g., by smoothing the spectral envelope of the signal, so we hoped that the MFCC parameters would be helpful in detecting transcoding present in TranSteg. The same parameters have already been used in steganalysis in  (see Sect. 2), where they fed an SVM-based classifier.
In our approach, however, as a modeling method, we decided to use GMMs , since, combined with MFCCs, they have proved successful in various applications, including text-independent speaker recognition  and language recognition ; however, no reports so far have been found on using GMMs in steganalysis.
A series of experiments for various overt/covert pairs of codecs were conducted, including all the pairs which were recommended in  due to their achievable low steganographic cost and high steganographic bandwidth.
A GMM model for normal speech transmission (no TranSteg) using a codec X was trained based on MFCC parameters extracted from the training speech signal.
A GMM model for abnormal speech transmission (TranSteg active) using a pair of codecs X/Y was trained based on MFCC parameters extracted from the training speech signal.
Using the two above GMM models, we checked if it is possible to recognize normal (no TranSteg) from abnormal (TranSteg active) transmission for a speech signal from test corpora.
Speech analysis was performed with an analysis window of 30 ms and analysis step of 10 ms using the Voicebox toolkit  for Matlab®. MFCC parameters were extracted using the FilterBank consisting of 26 triangle filters spaced according to the mel scale. We used GMM models with 16 Gaussians and diagonal covariance matrixes. Transcoding was performed using the SoX package , Speex emulation , and “G.723.1 speech coder and decoder”  library. Packet losses were not considered in this study. The number of MFCC parameters, as well as the length of testing signal, was subjects of experiments, the results of which will be presented in the next section.
TIMIT , containing speech data from 630 speakers of 8 main dialects of US English, each of them uttering 10 sentences;
TSP speech corpus , containing 1,400 recordings from 24 speakers, originally recorded with 48 kHz sampling, but also filtered and subsampled to different sample rates;
CHAINS corpus , with 36 speakers of Hiberno–English recorded under a variety of speaking conditions;
CORPORA—a speech database for Polish , containing over 16,000 recordings of 37 native Polish speakers reading 114 phonetically rich sentences and a collection of first names;
AHUMADA—a spoken corpus for Castilian Spanish , containing recordings of 104 male voices, recorded in several sessions in various conditions (in situ and telephony speech, read and spontaneous speech, etc.).
GMM models for normal and abnormal transmissions were trained using the EM algorithm. The initial position of Gaussian components was set using the vector quantization algorithm. As the training data, 1,600 recordings from the TIMIT corpus were used, originating from 200 speakers, each of them saying eight various sentences (two of the so-called SA TIMIT sentences were omitted because they were the same for all speakers, thus they could bias the acoustic models). In total, 90 min of speech were used to train both normal and abnormal models in each of the overt/covert scenarios.
Fifty speakers from the TIMIT corpus, different from the ones used for training, hereinafter denoted as TIM;
Twenty-three speakers from the TSP speech corpus from the “16 k-LP7” subset, hereinafter denoted as TSP;
Thirty-six speakers from the CHAINS corpus from the “solo” subset, hereinafter denoted as CHA;
Thirty-seven adult speakers from the CORPORA corpus, hereinafter denoted as COR;
Twenty-five male speakers from the AHUMADA corpus from in situ recordings (read speech), hereinafter denoted as AHU.
So the three first test corpora contained speech in English and the last two ones in Polish and Spanish, respectively. Each speech signal being tested contained recordings of one speaker only, to imitate the most common case if analyzing one channel of a VoIP conversation. Both training and testing were realized in the Matlab® environment using the h2m toolkit .
4.2 Experimental results
The experiments were evaluated by calculating the recognition accuracy as the percentage of correct detections of normal and abnormal transmissions against all recognition trials. Results as low as around 50 % mean that recognition accuracy is at a chance level; a result of 100 % would mean an errorless detection of the presence or absence of TranSteg.
The first experiments were run to estimate the length of speech data required for effective steganalysis of TranSteg. Since the technique applied is based actually on statistical analysis of spectral parameters of speech, the amount of data required for analysis must be sufficiently high—such an analysis cannot be performed on speech extracted from a single 20 ms VoIP packet, or even from a few packets in a row. We ran our experiments on test signals ranging from 260 ms to 10 s; if we consider 20 ms packets, these correspond to the range between 13 and 500 voice packets.
Next, experiments were aimed at deciding how many MFCC coefficients are needed for efficient TranSteg detection. In speech recognition, usually 12 coefficients are used, usually with dynamic derivatives. In speaker recognition 12, 16, 19, or even 21 coefficients are used, in order to capture individual characteristics of a speaker (, ). Since here we are dealing with a different task, the number of MFCC coefficients required experimental verification. We checked the recognition accuracy for various overt/covert pairs of codecs for the number of MFCC coefficients ranging from 1 to 19.
TranSteg recognition accuracy for various overt/covert configurations
We found some correlation between steganographic cost and detectability of TranSteg, for example, the Speex7/G.729 pair offers a relatively high steganographic cost of 0.74 MOS, and at the same time, it can be relatively easily detected (90 % accuracy); the pair iLBC/AMR allows for TranSteg transmission with the cost of 0.46 MOS only, and is also difficult to detect. There are, however, a few exceptions to this rule, for example, the three covert codecs (G.726, AMR, and Speex7) offering similar steganographic cost with G.711 as the overt one (ca. 0.4 MOS, see Fig. 2) behave quite differently as concerns the TranSteg detectability; G.711/G.726 can be recognized quite easily, while G.711/Speex7 proved to be the most resistant to steganalysis using the GMM/MFCC technique.
5 Conclusions and future work
TranSteg is a fairly new steganographic method dedicated to multimedia services like IP telephony. In this paper, the analysis of its detectability was presented for a variety of TranSteg scenarios and potential warden configurations. Particular attention was turned towards the very demanding case of a single warden located at the end of the VoIP channel (scenario S4). For this purpose, a novel steganalysis method based on the GMM models and MFCC parameters was proposed, implemented, and thoroughly tested.
The results showed that the proposed method allowed for efficient detection of some codec pairs, e.g., G.711/G.726, with an average detection probability of 94.6 %, or Speex7/G.729 with 89.6 % detectability, or Speex7/iLBC, with 86.3 % detectability. On the other hand, some TranSteg pairs remained resistant to detection using this method, e.g., the pair iLBC/AMR, with an average detection probability of 67 %, which we consider to be low. We found some correlation between steganographic cost of an overt/covert codec pair and detectability of TranSteg—usually the lower the cost, the more difficult the detection of TranSteg. However, some results were surprising, e.g., the G.711/G.726 pair, with low steganographic cost (0.42 MOS) turned out to be relatively easy to detect. In contrast, the pair G.711/Speex7, offering similar cost, proved to be resistant to steganalysis, with recognition accuracy of 63.3 % only, and, what is more, with higher steganographic bandwidth. This confirms that TranSteg with properly selected overt and covert codecs is an efficient steganographic method if analyzed with a single warden.
Successful detection of TranSteg using the described method, for a single warden at the end of the channel, requires at least 2 s of speech data to analyze, i.e., a hundred 20-ms VoIP packets. This should not be a problem, considering the fact that phone conversations last for minutes. However, if the overt channel contained not speech, but a piece of music, noise, or just silence, the detectability of TranSteg would be seriously affected.
It must also be noted that, especially for the inspected hidden communication scenario (S4), TranSteg steganalysis is harder to perform than most of the existing VoIP steganographic methods. This is because, after the steganogram reaches the receiver, the hidden information is extracted, and the speech data is practically restored to the data originally sent. If changes are made to the signal, they are not easily visible without a proper spectral and statistical analysis. This is a huge advantage compared with existing VoIP steganographic methods, where the hidden data can be extracted and removed, but the original data cannot be restored because it was previously erased due to a hidden data insertion process.
Future work will include developing an effective steganalysis method when encryption using SRTP is utilized. Efficiency of using alternatives to MFCC parameters, e.g., the use of linear prediction coding coefficients can be verified in future experiments too. We also plan to verify the suitability of the proposed in this paper steganalysis method for detection of other VoIP steganography solutions.
This research was partially supported by the Polish Ministry of Science and Higher Education and Polish National Science Center under grants: 0349/IP2/2011/71 and 2011/01/D/ST7/05054.
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.