Influence of speech codecs selection on transcoding steganography
The typical approach to steganography is to compress the covert data in order to limit its size, which is reasonable in the context of a limited steganographic bandwidth. Trancoding steganography (TranSteg) is a new IP telephony steganographic method that was recently proposed that offers high steganographic bandwidth while retaining good voice quality. In TranSteg, compression of the overt data is used to make space for the steganogram. In this paper we focus on analyzing the influence of the selection of speech codecs on hidden transmission performance, that is, which codecs would be the most advantageous ones for TranSteg. Therefore, by considering the codecs which are currently most popular for IP telephony we aim to find out which codecs should be chosen for transcoding to minimize the negative influence on voice quality while maximizing the obtained steganographic bandwidth.
KeywordsIP telephony Network steganography TranSteg Information hiding Speech coding
Steganography is an ancient art that encompasses various information hiding techniques, whose aim is to embed a secret message (steganogram) into a carrier of this message. Steganographic methods are aimed at hiding the very existence of the communication, and therefore any third-party observers should remain unaware of the presence of the steganographic exchange. Steganographic carriers have evolved throughout the ages and are related to the evolution of the methods of communication between people. Thus, it is not surprising that currently telecommunication networks are a natural target for steganography. The type of modern steganography that utilizes network protocols and/or relationships between them as the carrier for steganograms to enable hidden communication is called network steganography [18, 26, 28].
IP telephony is one of the most important services in the IP world and is changing the entire telecommunications landscape. It is a real-time service which enables users to make phone calls through IP data networks. An IP telephony connection consists of two phases, in which certain types of traffic are exchanged between the calling parties: signaling and conversation phases. In the first phase certain signaling protocol messages, for example session initiation protocol (SIP) messages , are exchanged between the caller and callee. These messages are intended to set up and negotiate the connection parameters between the calling parties. In the latter phase two audio streams are sent bidirectionally. Real-time transport protocol (RTP)  is most often utilized for voice data transport, and thus packets that carry voice payload are called RTP packets. The consecutive RTP packets form an RTP stream. Due to the popularity of IP telephony, as well as the large volume of data and the variety of protocols involved, it is currently attracting the attention of the research community as a perfect carrier for steganographic purposes .
Transcoding steganography (TranSteg) is a new steganographic method that has been introduced recently in . It is intended for a broad class of multimedia and real-time applications, but its main foreseen application is IP telephony. TranSteg can also be exploited in other applications or services (like video streaming), wherever a possibility exists to efficiently compress (in a lossy or lossless manner) the overt data. The typical approach to steganography is to compress the covert data in order to limit its size (it is reasonable in the context of a limited steganographic bandwidth). In TranSteg, compression of the overt data is used to make space for the steganogram. TranSteg is based on the general idea of transcoding (lossy compression) of the voice data from a higher bit rate codec (and thus greater voice payload size) to a lower bit rate codec (with smaller voice payload size) with the least possible degradation in voice quality.
In  a proof of concept implementation of TranSteg was subjected to experimental evaluation to verify whether it is feasible. The obtained experimental results proved that it offers a high steganographic bandwidth (up to 32 kbit/s) while introducing delays lower than 1 ms and still retaining good voice quality.
In this paper we focus on analyzing how the selection of speech codecs impacts hidden transmission performance, that is, which codecs would be the most advantageous ones for TranSteg. Therefore, the main contribution of the paper is to establish, by considering the codecs for IP telephony which are currently most popular, which speech codecs should be chosen for transcoding to minimize the negative influence on the hidden data carrier (voice quality) while maximizing the obtained steganographic bandwidth.
The rest of the paper is structured as follows. Section 2 presents related work on IP telephony steganography. Section 3 describes the functioning of TranSteg and its hidden communication scenarios. Section 4 presents the experimental methodology and results obtained. Finally, Sect. 5 concludes our work.
2 Related work
IP telephony as a hidden data carrier can be considered a fairly recent discovery. The proposed steganographic methods stem from two distinctive research origins. The first is the well-established image and audio file steganography , which has given rise to methods which target the digital representation of voice as the carrier for hidden data. The second sphere of influence is the so-called covert channels, created in different network protocols [1, 30] (a good survey on covert channels, by Zander et al., can be found in ); these solutions target specific VoIP protocol fields (e.g. signaling protocol—SIP, transport protocol—RTP, or control protocol—RTCP) or their behavior. Presently, steganographic methods that can be utilized in telecommunication networks are jointly described by the term network steganography, or, specifically, when applied to IP telephony, by the term steganophony .
The first VoIP steganographic methods to utilize the digital voice signal as a hidden data carrier were proposed by Dittmann et al. . The authors evaluated the existing audio steganography techniques, with a special focus on the solutions which were suitable for IP telephony. This work was further extended and published in 2006 in . In , an implementation of SteganRTP was described. This tool employed the least significant bits (LSB) of the G.711 codec to carry steganograms. Wang and Wu, in , also suggested using the least significant bits of voice samples to carry secret communication, but here the bits of the steganogram were coded using a low rate voice codec, like Speex. In , Takahashi and Lee proposed a similar approach and presented its proof of concept implementation, voice over VoIP (Vo2IP), which can establish a hidden conversation by embedding compressed voice data into the regular, PCM-based, voice traffic. The authors also considered other methods that can be utilized in VoIP steganography, like direct sequence spread spectrum (DSSS), frequency-hopping spread spectrum (FHSS), or Echo hiding. In , Aoki proposed a steganographic method based on the characteristics of pulse code modulation (PCMU) in which the zeroth speech sample can be represented by two codes due to the overlap. Another LSB-based method was proposed by Tian et al. . The authors incorporated the m-sequence technique to eliminate the correlation among secret messages to resist statistical detection. A similar approach, also LSB-based, relying on adaptive VoIP steganography was presented by the same authors in ; a proof of concept tool, StegTalk, was also developed. In  Xu and Yang proposed an LSB-based method dedicated to voice transmission using the G.723.1 codec in 5.3 kbps mode. They identified five least significant bits in various G.723.1 parameters and used them to transmit hidden data; the method provided a steganographic bandwidth of 133.3 bps. In  Miao and Huang presented an adaptive steganography scheme based on the smoothness of the speech block. Such an approach proved to give better results in terms of voice quality than the LSB-based method. An interesting study is described in , where Nishimura proposed hiding information in the AMR-coded stream by using an extended quantization-based method of pitch delay (one of the AMR codec parameters). This additional data transmission channel was used to extend the audio bandwidth from narrow-band (0.3–3.4 kHz) to wide-band (0.3–7.5 kHz).
Utilization of the VoIP-specific protocols as a steganogram carrier was first proposed by Mazurczyk and Kotulski . The authors proposed using covert channels and watermarking to embed control information (expressed as different parameters) into VoIP streams. The unused bits in the header fields of IP, UDP, and RTP protocols were utilized to carry the type of parameter, while the actual parameter value is embedded as a watermark into the voice data. The parameters are used to bind control information, including data authentication, to the current VoIP data flow. In  and  Mazurczyk and Szczypiorski described network steganography methods that can be applied to VoIP: to its signaling protocol, SIP (with SDP), and to its RTP streams (also with RTCP). They discovered that a combination of information hiding solutions provides a capacity to covertly transfer about 2000 bits during the signaling phase of a connection and about 2.5 kbit/s during the conversation phase. In , a novel method called lost audio packets steganography (LACK) was introduced; it was later described and analyzed in  and . LACK relies on the modification of both the content of the RTP packets and their time dependencies. This method takes advantage of the fact that, in typical multimedia communication protocols like RTP, excessively delayed packets are not used for the reconstruction of the transmitted data at the receiver; that is, the packets are considered useless and discarded. Thus, hidden communication is possible by introducing intentional delays into selected RTP packets and substituting the original payload with a steganogram.
Bai et al.  proposed a covert channel based on the jitter field of the RTCP header. This is performed in two stages: firstly, statistics of the value of the jitter field in the current network are calculated. Then, the secret message is modulated into the jitter field according to the previously calculated parameters. Utilization of such modulation guarantees that the characteristic of the covert channel is similar to that of the overt one. In , Forbes proposed a new RTP-based steganographic method that modifies the timestamp value of the RTP header to send steganograms. The method’s theoretical maximum steganographic bandwidth is 350 bit/s.
The TranSteg technique that was first introduced in  is a development of the last of the discussed groups of steganographic methods for VoIP, originating from covert channels. Compared to the existing solutions, its main advantages are a high steganographic bandwidth, low steganographic cost (i.e. little degradation of voice quality), and difficult detection.
3 TranSteg functioning
TranSteg, like every steganographic method, can be described by the following set of characteristics: its steganographic bandwidth, its undetectability, and the steganographic cost. The term “steganographic bandwidth” refers to the amount of secret data that can be sent per time unit when using a particular method. Undetectability is defined as the inability to detect a steganogram within a certain carrier. The most popular way to detect a steganogram is to analyze statistical properties of the captured data and compare them with the typical values for that carrier. Lastly, the steganographic cost characterizes the degradation of the carrier caused by the application of the steganographic method. In the case of TranSteg, this cost can be expressed by providing a measure of the conversation quality degradation induced by transcoding and the introduction of an additional delay.
not significantly degrade user voice quality when compared to the quality of the overt codec (in an ideal situation there should be no negative influence at all),
provide the smallest achievable voice payload size, as this results in the most free space in an RTP packet to convey a steganogram.
result in the largest possible voice payload size to provide, together with the covert codec, the highest achievable steganographic bandwidth,
be commonly used, to avoid arousing suspicion.
In this paper, we consider TranSteg in scenario S4 because it is the worst case scenario in terms of the speech quality, as it requires triple transcoding (and two transcodings result from the TranSteg functioning). If TranSteg scenarios S1–S3 were applied, we would avoid one or even two (in scenario S1) transcodings, and therefore the negative influence on speech quality would be lower than presented in this study.
Step 1: For an incoming RTP stream it transcodes the user’s voice data from the overt to the covert codec.
Step 2: The transcoded voice payload is placed once again in an RTP packet. The RTP packet’s header remains unchanged.
Step 3: The remaining free space of the RTP payload field is filled with the steganogram’s bits (and thus the original voice payload is erased).
Step 4: Checksums in lower layer protocols (UDP checksum and CRC at the data link) are adjusted.
Step 5: Modified frames with encapsulated RTP packets are sent to the receiver (SR).
Step 1: It extracts the voice payload and the steganogram from the RTP packets.
Step 2: The voice payload is then transcoded from the covert to the overt codec and placed once again in consecutive RTP packets. By performing this task the steganogram is overwritten with user voice data. The RTP packet’s header remains unchanged.
Step 3: Checksums for the lower layer protocols (i.e. the UDP checksum and CRC at the data link, if they have been utilized) are adjusted.
Step 4: Modified frames with encapsulated RTP packets are sent to the receiver (callee).
The SS and SR have limited influence on the choice of the overt codec, because they are both located at some intermediate network node. Due to this fact they are bound to rely on the codec chosen by the overt, non-steganographic calling parties or they can interfere with the choice of the overt codec during the signaling phase of the call where the codec negotiation is taking place. When relying on the first option, SS and SR must be able to choose the covert codec in such a way as to maximize the achievable steganographic bandwidth while minimizing the steganographic cost.
This paper focuses on analyzing the best covert codec choices for the speech codecs that are currently the most popular ones utilized for IP telephony (overt codecs) in terms of steganographic bandwidth and cost.
4 TranSteg experimental results
4.1 Experiment methodology
In our experiments we emulated 20 unidirectional voice transmission channels. We took the information about location of speech activity (turn-taking patterns) and background noise from the LUNA corpus , containing real phone conversations between travelers and a public transport information line. Voice activity ranged from 40.5 to 67.5 %, with an average of 46.5 %. The speech signal was taken from the TSP Speech corpus  for English and CORPORA  for Polish. Each of these databases contains phonetically balanced sentences in the respective languages. In such a way we generated 20 one-minute recordings, 10 in English and 10 in Polish, sampled at 8 kHz with 16-bit resolution. Each language group consisted of five male and five female speakers.
G.711: a codec designed originally for fixed telephony , but also used in VoIP due to its simplicity and good speech quality; it is just an implementation of logarithmic quantization with 8 bits per sample, thus offering a bitrate of 64 kbps. The option A-law, which is used most in the world, was researched in this study.
Speex: a code excited linear prediction (CELP)-based lossy codec designed specifically for VoIP applications . Although it allows wide-band and ultra-band transmissions, only the narrow-band variant was considered here. It offers 10 different compression levels corresponding to 10 different bitrates, of which three modes were selected: (i) mode 7, the highest mode designed for speech, working with a bitrate of 24.6 kbps, hereinafter called Speex(7), (ii) a moderate mode 4, requiring a bitrate of 11.0 kbps, hereinafter called Speex(4), and (iii) mode 2, which is the lowest recommended mode for speech, working at 5.95 kbps, called here Speex(2).
iLBC: another low-bitrate CELP-based codec designed for VoIP, using frame-independent long term prediction, thus making it resistant to packet losses . Depending on the analysis frame length (20 or 30 ms), it requires 15.2 or 13.33 kbps, respectively. Twenty-millisecond frames were used in this study.
G.723.1: a codec based on multi-pulse maximum likelihood quantization (MP-MLQ) and algebraic CELP (ACELP), offering bitrates of 5.3 and 6.4 kbps, respectively. In this study the latter option was used.
G.711.0: also known as G.711 LossLess compression (LLC), is a lossless extension of the G.711 codec, standardized by ITU fairly recently in 2009 . It works with various frame lengths (40–320 samples); however in this study 160-sample (20 ms) frames were used. Due to its losslessness it offers a variable bitrate, as the compression ratio depends on the actual voice data. It is also stateless, which means that the encoding of a particular frame does not depend on the previous or the next frame, making it suitable for packet transmission, including the TranSteg technique.
G.726: an adaptive differential PCM (ADPCM) codec, standardized by ITU-T in 1990 , offering bitrates from 16 kbps up to 40 kbps. In this study we used the most common option, 32 kbps, which was already tried with TranSteg in .
GSM 06.10  (also known as GSM Full-Rate or GSM-FR): designed in the early 1990s for the GSM telephony, but used in VoIP as well. It is based on the regular pulse excitation-long term prediction (RPE-LTP) algorithm, with the use of the LPC technique.
Adaptive multi-rate (AMR): a codec adopted in 1999 as standard by 3GPP, used widely in GSM and UMTS . It is based on CELP, but also incorporates other techniques, such as discontinuous transmission (DTX) and comfort noise generation (CNG). It covers eight different bitrates, from 4.75 kbps up to 12.2 kbps. The highest 12.2 kbps mode, used further in this study, is compatible with ETSI GSM enhanced full-rate (EFR).
G.729: operates at a bitrate of 8 kbps, and is based on conjugate structure ACELP (CS-ACELP). Several annexes to the basic G.729 have been published so far. In this study we used Annex A, which has slightly lower computational requirements than the original G.729.
Emulations were conducted in the Matlab® 7.12 environment. The codecs’ functionality was implemented using the SoX toolbox version 14.3.2 , the G.723.1 Speech Coder and Decoder MATLAB toolbox, and reference implementations provided by ITU-T and iLBCfreeware.org.
Additionally, for the lossless covert codec G.711.0 we measured the bitrate, as it is a variable bitrate codec. We assumed that in configurations with G.711.0 as the covert codec, one byte in the payload field will be used for signaling to inform how many bytes in a given packet are used for the overt transmission.
4.2 Experimental results
4.2.1 Results for steganographic bandwidth
Steganographic bandwidth [kbps] for various sets of overt and covert codecs (unfeasible combinations are grayed out)
The pairs in which the covert codec required a higher bandwidth than the overt one were found to be unfeasible in the TranSteg technique and were therefore excluded from further experiments (in Table 1 they are grayed out). It is worth noting that the steganographic bandwidth depends strongly on the codec used in the overt channel—it is the highest for G.711, ranging from 32 kbps up to 58.08 kbps, whilst it is the lowest for G.723.1, Speex(4), and iLBC, allowing only a few kilobits per second. When the overt transmission uses Speex(2), steganographic transmission using the TranSteg technique is not possible at all, due to the low overt transmission bitrate (5.95 kbps). Considering only the bitrate, it turned out that the codecs G.711.0, G.726, and Speex(7) can serve as covert codecs only when the overt voice transmission is using G.711.
4.2.2 Results for steganographic cost
Initial voice transmission quality [MOS]—no TranSteg used
Overall voice quality [MOS] for various sets of overt and covert codecs (unfeasible combinations are grayed out)
Results of steganographic cost [MOS] for the tested variants, with confidence intervals (at 95 % confidence level)
4.2.3 Steganographic cost versus steganographic bandwidth
Class 0: no quality decrease; for configurations with steganographic cost lower than 0.1 MOS;
Class 1: minor quality decrease; for configurations with steganographic cost between 0.1 and 0.5 MOS;
Class 2: moderate quality decrease; for configurations with steganographic cost between 0.5 and 1.0 MOS.
It must be emphasized that a high steganographic bandwidth does not always imply a high steganographic cost. For example, with G.711 in the overt channel we can use either GSM 06.10, Speex(4), or AMR in the covert channel, creating in each case a similar, capacious (ca. 50 kbps) steganographic channel. But GSM 06.10 and Speex(4) will cause a decrease of more than 0.85 MOS in the voice quality, while, remarkably, AMR will introduce a steganographic cost of only 0.36 MOS.
The experiments we ran helped to identify which codecs would provide better quality while providing similar bandwidth (or provide higher bandwidth while assuring similar quality). As a result, the configurations which we recommend in each class are underlined in Fig. 5. For example, for all the tested overt codecs, AMR introduced a much lower decrease in quality than GSM 06.10, even though they provide similar bandwidths. For iLBC in the overt channel, the recommended covert codecs are AMR and G.723.1, whilst the codecs Speex(4) and G.729 would provide lower steganographic bandwidth at a similar steganographic cost. The pairs which provided overall quality lower than 3 MOS (i.e. the ones with Speex(4) and G.723.1 as the overt codecs) are not recommended.
In general, we recommend only one pair in Class 0 (G.711 with lossless G.711.0), four pairs in Class 1, and five pairs in Class 2.
5 Conclusions and future work
TranSteg is a new steganographic method dedicated to multimedia services like IP telephony. In this paper the analysis of the influence of the selection of speech codecs on the performance of TranSteg hidden transmission was presented. By considering the codecs which are currently most popular for IP telephony we wanted to find out which codecs should be chosen for transcoding to minimize the negative influence on hidden data carriers while maximizing the obtained steganographic bandwidth.
high G.711 bitrate, so there is more space for hidden data;
high speech quality offered by G.711;
G.711 performs well if transcoded more than once (see Table 2), which we think is due to the fact that G.711 is a waveform codec; that is, it preserves the waveform shape;
while being a waveform codec, G.711 behaves well if further transcoded with other codecs, especially CELP-based ones (AMR, Speex in mode 7).
When experimenting with various combinations of overt and covert codecs, we observed that some codecs do not complement each other well. For example, Speex in mode 7 works significantly better (in terms of voice quality) with AMR than with GSM 06.10, even though the two TranSteg configurations result in similar steganographic bandwidths. A similar phenomenon was observed in other research projects, for example in speaker recognition from coded speech, in situations where there was a mismatch between the codec used in voice transmission and the codec used to create speakers’ models .
The choice of a covert codec depends on an actual application, or more precisely, on whether priority is given to higher steganographic bandwidth or better speech quality. We recommended 10 pairs of overt/covert codecs which can be used effectively in TranSteg in various conditions depending on the required steganographic bandwidth, allowed steganographic cost, and the codec used in the overt transmission. We grouped those pairs into three classes based on the steganographic cost. The pair G.711/G.711.0 is costless; nevertheless it offers a remarkably high steganographic bandwidth, on average more than 31 kbps. However caution must be taken, as the G.711.0 bitrate is variable and depends on an actual signal being transmitted in the overt channel.
Codec AMR working in 12.2 kbps mode proved to be very efficient as the covert codec in TranSteg. This is a low bitrate codec which does not significantly degrade the quality: the steganographic cost ranged between 0.36 and 0.46 MOS.
In this research we showed results for scenario S4, which is the worst case scenario in terms of the speech quality, as it requires triple transcoding. If TranSteg scenarios S1–S3 were applied, we would avoid one or even two (in scenario S1) transcodings, and therefore steganographic cost would be lower than presented in this study.
Future work will include the development of the TranSteg-capable softphone, which will include results related to speech codec selection presented in this paper. Moreover, effective and efficient TranSteg detection methods will be pursued.
This research was partially supported by the Polish Ministry of Science and Higher Education and Polish National Science Centre under Grants: 0349/IP2/2011/71 and 2011/01/D/ST7/05054.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.