The performance of TranSteg depends, most notably, on the characteristics of the pair of codecs (as mentioned in the Introduction): one used originally to encode user speech – the overt codec, and one utilized for transcoding – the covert codec. It is worth noting that, depending on the hidden communication scenario, TranSteg may or may not be able to influence the choice of this codec. It is assumed that it is always possible to find a covert codec for a given overt one. However, it must be noted, that for very low bit rate codecs, the steganographic bandwidth shall be limited. In the ideal conditions the covert codec should:
-
not degrade considerably user voice quality (caused by the transcoding operation and the introduced delays), when compared to the quality of the overt codec.
-
provide the smallest achievable voice payload size as it results in the most free space in an RTP packet to convey a steganogram.
If there is a possibility to influence the overt codec (see the hidden communication scenarios below), in an ideal situation it should:
-
result in a largest possible voice payload size to provide, together with the covert codec, the highest achievable steganographic bandwidth,
-
be commonly used to not to raise suspicion.
Taking the above into account, TranSteg’s steganographic bandwidth (SB) can be expressed as:
$$ SB = \left( {P{S_O} - P{S_C}} \right) \cdot \,P{N_S}\left[ {bit/s} \right] $$
(3–1)
where PS
O
denotes the overt codec’s payload size, PS
C
is the covert codec’s payload size and PN
S
describes the number of RTP packets sent during 1 s.
TranSteg can be utilized in four hidden communication scenarios (Fig. 4). The first scenario (S1 in Fig. 4) is the most common and typically the most desired: the sender and the receiver conduct a VoIP conversation while simultaneously exchanging steganograms (end-to-end). The conversation path is identical with the hidden data path. In the next three scenarios (marked S2-S4 in Fig. 4) only a part of the VoIP end-to-end path is used for hidden communication. As a result of actions undertaken by intermediate nodes, the sender and/or the receiver are, in principle, unaware of the steganographic data exchange. The application of TranSteg in IP telephony connections offers a chance to preserve users’ conversation and simultaneously transfer steganograms. As noted previously, this is especially important for scenarios S2-S4.
In the abovementioned scenarios it is assumed that potential detection (steganalysis), usually executed by a warden [9], is not able to audit the speech carried in RTP packets because of the privacy issues related with this matter. Thus, the presence of a steganogram inside RTP packet payload can remain undiscovered. Other possibilities of TranSteg detection will be discussed in detail in subsection 3.3.
In the following part of this section TranSteg will be described with reference to the abovementioned scenarios. The most important factor in this context is whether the Steganogram Sender (SS) is located on the same host as the RTP packets’ issuer. Thus, it may be able to control the RTP stream transmitter. Otherwise, when located on some intermediate network node, it will not be capable of such control.
TranSteg may be also influenced by the utilization of SRTP protocol, which is used to provide the RTP stream with confidentiality and authentication. As mentioned in the Introduction, securing of the RTP stream does not necessarily impede the possibility of the exploitation of TranSteg. Such mode of operation may potentially even increase TranSteg’s undetectability – this effect will be further investigated throughout the following subsections.
Steganogram sender controlling an RTP packet transmitter (scenarios S1 & S2)
In scenario S1 a steganogram is embedded into an RTP packet and travels along the entire path between the RTP stream sender and receiver. Thus, there is no need to execute the operation of transcoding. User voice can be directly encoded with the desired covert codec with the omission of the prior encoding with the overt codec and thus avoid the whole process of transcoding. Despite this, the RTP stream will appear to have been encoded with the aid of the overt codec. The voice payload size and PT field in the RTP header shall not be changed. It is assumed that the SS and SR had agreed prior on the choice of the covert codecs corresponding to different overt codecs. Such common mapping may, for example, bind an overt codec G.711 with the covert codec G.726, or Speex 24.6 kbit/s with Speex 8 kbit/s, etc.
Thus, the SS shall perform the following steps for the embedding of a steganogram (Fig. 5):
-
Step 1: Set the RTP payload size and modify Payload Type in the RTP header according to the chosen overt codec. These changes will indicate usage of the overt coding algorithm that will not, actually, be utilized.
-
Step 2: The voice transcoded with the covert codec is inserted into the overt codec’s RTP payload field.
-
Step 3: Remaining free space is allocated for the hidden data and filled with a steganogram.
-
Step 4: RTP packet is sent to the receiver.
When the modified TranSteg RTP stream reaches the SR, it extracts the voice payload and steganogram from the consecutive packets. The voice payload is then used for speech reconstruction and the steganogram parts are concatenated. This preserves the conversation functionality between the SS and SR and simultaneously enables hidden communication. For a third party observer, even if he/she is able to physically monitor the activity of both users (e.g. wiretap both locations) it will look like a regular call taking place.
To further mask the presence of TranSteg, SS can utilize the SRTP protocol to perform RTP payload encryption of both: the voice coded with a covert codec and the steganogram; thus making the detection of steganography even harder to perform (see Fig. 2).
The hidden communication scenario S1 offers most flexibility, and is advantageous when compared with the remaining ones, because:
-
SS can choose the overt codec and thus influence the resulting steganographic bandwidth.
-
The delays introduced by TranSteg to the RTP stream are the smallest in this scenario as there is no time consumption related with the transcoding (the voice is directly encoded with the covert codec).
-
This scenario does not assume any required path of communication that the RTP stream should follow.
-
To capacitate the exploitation of TranSteg, it is only necessary to modify the IP telephony client. Notably, the RTP protocol is usually implemented in software, which means it can be easily modified. No other protocol’s modifications are required (i.e. UDP and frame checksums).
-
RTP stream can be, additionally, secured with the aid of the SRTP protocol – this can be utilized to mask the contents of the transcoded voice data and the steganogram.
In scenario S2, the main difference when compared with S1, is that the SR is situated at some intermediate network node. Thus, the IP telephony conversation is performed between the SS (caller) and an unaware of the steganographic procedure callee. The assumption in this scenario is that the SR is able to intercept and analyze all RTP packets exchanged between the SS and the callee. The TranSteg procedure for SS remains the same as in scenario S1. What changes is the behaviour of the SR.
When the tampered RTP stream reaches the SR, it performs the following steps:
-
Step 1: It extracts voice payload and the steganogram from the RTP packets.
-
Step 2: The voice payload is transcoded from the covert to overt codec and placed once again in consecutive RTP packets. By performing this task the steganogram is overwritten with user voice data.
-
Step 3: Checksums for the lower layer protocols (i.e. the UDP checksum and CRC at the data link if they had been utilized) are adjusted.
-
Step 4: Modified frames with encapsulated RTP packets are sent to the receiver (callee).
If the IP telephony connection is required to be secured with the SRTP protocol it does not impede the possibility to utilize TranSteg. The session keys used for authentication and encryption are exchanged between the calling parties before the conversation phase of the call and will be known in advance to the SS. This means that when the SS initiates an RTP stream, the first RTP packets contain transcoded voice but are intentionally not encrypted. Instead of a steganogram, they carry cryptographic keys that where negotiated between the SS and callee. The cryptographic keys do not necessarily have to be carried inside the payload field as this can raise security issues. The better solution will be to use advance techniques like MLS (Multilevel Steganography) [10]. In MLS, at least two steganographic methods are utilised simultaneously in such a way that one method’s (the upper-level) network traffic serves as a carrier for the second method (the lower-level). In such scenario TranSteg will be used as upper-level method and the lower-level method’s steganographic bandwidth will be utilised to carry a cryptographic key. Upon their extraction, the SR is able to encrypt the transcoded voice payload prior to forwarding it to the RTP packets receiver. Thus, the receiving party will not be aware of the steganographic procedure. Secondly, the SR will be capable of performing bidirectional hidden communication.
To summarize scenario S2:
-
SS can still choose the overt codec and thus influence the resulting steganographic bandwidth.
-
The delays introduced by TranSteg to the RTP stream depend on one transcoding operation.
-
There is an assumption that SR is on the communication path between the calling parties and is able to oversee the whole RTP stream.
-
TranSteg requires certain protocol modifications in the SR: the RTP and other network protocols (e.g. the UDP or data link layer protocols).
-
Utilization of SRTP between the calling parties is not an obstacle for TranSteg. Analogically to scenario S1, it can be viewed as means to further mask hidden communication.
Steganogram sender located at an intermediate network node (scenarios S3 & S4)
In scenario S3, the assumption is that SS is able to intercept and analyse all RTP packets exchanged between caller and the callee. SS does not control the RTP packet’s transmitter, thus it cannot pick a suitable overt codec. However, SR is a legitimate (overt) receiver of the RTP stream. Thus it is able to influence the choice of overt codec by negotiating it during the signalling phase of the call, with the calling party remaining unaware of the steganographic procedure. The behaviour of the SS is similar to the behaviour of SR in scenario S2 (see Sec. 3.1). The only difference is that SS is responsible for the transcoding from the overt to covert codec and for embedding of the steganogram – the remaining steps are the same. Thus, SS behaves as follows:
-
Step 1: For an incoming RTP stream it transcodes the user’s voice data from the overt to covert codec.
-
Step 2: Transcoded voice payload is placed once again in an RTP packet.
-
Step 3: The remaining free space of the RTP payload field is filled with steganogram’s bits (thus the original voice payload is erased).
-
Step 4: Checksums in lower layer protocols (UDP checksum and CRC at the data link) are adjusted.
-
Step 5: Modified frames with encapsulated RTP packets are sent to the receiver (SR).
SR’s operation is solely limited to extraction and analysis of the voice payload and steganogram from consecutive RTP packets (it is the same behaviour as in scenario S1, see Sec. 3.1).
In the presence of SRTP, in this scenario, the use of the TranSteg is not compromised – the conditions and the solution (cryptographic key’s sharing between the SR and SS by means of TranSteg) is similar like in scenario 2 (see Sec. 3.1). The only difference is that SR after establishing the cryptographic key for SRTP purposes sends it to SS in the first RTP packets. These packets are intentionally not encrypted and they carry transcoded voice and SRTP cryptographic key. SR retrieves the key and is responsible to perform re-transcoding and ciphering of resulting voice payload.
To summarize, in scenario S3:
-
SR is responsible of influencing the choice of the overt codec by negotiating it with the calling party (unaware of the steganographic procedure).
-
The delays introduced to the RTP stream by TranSteg depend on one transcoding operation.
-
There is an assumption that SS is on the communication path between the calling parties and is able to oversee the whole RTP stream.
-
TranSteg requires certain protocol modifications in the SS: the RTP and other network protocols (e.g. the UDP or data link layer protocols).
-
Utilization of SRTP between the calling parties is not an obstacle for TranSteg. Analogically to scenario S1, it can be viewed as means to further mask hidden communication.
In scenario S4 it is assumed that both: SS and SR, are able to intercept and analyze all RTP packets exchanged between the calling parties. Thus, SS and SR cannot at all influence the choice of the overt codec, because they are both located at some intermediate network node (Fig. 6). Due to this fact they are bound to rely on the codec chosen by the overt, non-steganographic, calling parties. This, in particular, can result in low steganographic bandwidth as the hidden communication parties must adjust the covert codec to the negotiated overt codec. The most significant advantage of this TranSteg scenario is its potential use of aggregated IP telephony traffic to transfer steganograms. If both SS and SR have access to more than one VoIP call then the achievable steganographic bandwidth can be significantly increased, which can compensate for the loss in steganographic bandwidth caused by the inability to influence the choice of the overt codec.
The behavior of SS and SR is similar – they both perform transcoding: SS from overt to covert, and SR from covert to overt codecs. Steganogram is exchanged only along the part of the communication path where RTP stream travels “inside” the network – it never reaches the endpoints. The steps of the TranSteg scenario for SS are exactly the same as in scenario S3 (see above) and SR follows the logic presented in scenario S2 (see Sec. 3.1). It is also worth noting that, in this scenario, the utilization of SRTP protocol for conversation security entirely incapacitates the usage of TranSteg.
To summarize scenario S4:
-
There is an assumption that both: SS and SR, are on the communication path between the calling parties and are able to oversee the whole RTP stream.
-
Neither SS nor SR can influence the choice of the overt codec, which potentially leads to a decrease in the steganographic bandwidth.
-
There is a possibility to use aggregated VoIP traffic at the path between SS and SR, and thus significantly increase TranSteg’s steganographic bandwidth.
-
Neither SS nor SR are involved in the IP telephony conversation as overt calling parties. Thus, it is harder to detect hidden communication between the SS and SR comparing to the previously described scenarios (since neither is an initiator of the overt traffic).
-
The delays introduced to the RTP stream are the highest compared with the other presented scenarios (due to the two transcoding operations).
-
TranSteg requires certain protocol modifications in both: SS and SR; these involve modifications to the RTP and other network protocols (e.g. UDP or data link layer protocols).
-
Utilization of SRTP to secure the conversation makes the use of TranSteg impossible.
In all of the presented scenarios S1-4, RTP packet losses which are a natural phenomenon in IP networks can make the successful extraction of the steganogram at SR impossible. That is why an additional protocol in a hidden channel may be required to provide reliability. One solution is to use an approach proposed by Hamdaqa and Tahvildari [15] because it can be easily incorporated for TranSteg purposes. It provides a reliability and fault tolerance mechanism based on a modified (k, n) threshold based on Lagrange Interpolation and results demonstrated in that paper proves that the complexity of steganalysis is increased. Of course the “cost” for the extra reliability is always a loss of some fraction of the steganographic bandwidth.
TranSteg detection
As mentioned at the beginning of Section 3, we assume that during the TranSteg-based hidden communication there is a warden executing detection (steganalysis) methods. We further assume that the warden will not be able to “physically listen” to the speech carried in RTP packets because of the privacy issues related with this matter. This means that the warden will be capable of capturing and analysing the payload of each RTP packet but not capable of replaying the call’s conversation (its content).
However, it must be emphasised that, if the SRTP protocol had been used for securing a TranSteg conversation, the warden will fail to detect the presence of steganograms in the RTP stream, in any of the below-mentioned scenarios (with the exception for S4).
To perform hidden communication, TranSteg utilizes modifications to the PDUs (Protocol Data Units) as a carrier – more precisely to the RTP payload field. When compared with other steganographic VoIP methods that, e.g. influence the order of the RTP packets or the delays between them, TranSteg does not introduce any irregularities to the RTP stream.
The successful detection of TranSteg mainly depends on:
If the warden is capable of inspecting traffic solely in a single network, e.g. in the LAN (Local Area Network) of the overt transmitter or receiver, then the detection is hard to accomplish. The reason for the above is due to the fact that an RTP stream at a single traffic inspection point resembles legitimate streams. The remaining cases will be discussed below.
In scenario S1, there is no change of format of voice payloads during the traversing of the network. Thus, even if the warden would monitor traffic in different networks – the result would always be the same. Thus, the chances of TranSteg detection are very limited.
In scenarios S2 and S3 there is one transcoding operation, therefore the modification of the RTP packets’ payload can be detected if the warden is able to probe and compare traffic from two localizations: prior and post the transcoding. However, it must be emphasized that the same happens to other existing VoIP steganographic methods i.e. in these scenarios all data hiding is easier to be detected.
In scenario S4, there are three possible locations where the warden can inspect RTP traffic. They are marked as 1, 2 and 3 in Fig. 6. If a warden can monitor traffic in networks: 1 and 2 or 2 and 3 the detection capabilities are the same for scenarios S2 and S3.
In the case when the warden is able to inspect the RTP stream in networks 1 and 3, where the voice format should be the same, then, due to the transcoding operation, some slight differences can be noted. This case is further investigated in Section 4. It must be noted that this scenario potentially can induce the largest voice quality degradation due to the necessary two transcoding operations. However even in these circumstances TranSteg is superior to other VoIP steganography methods because it offers restoring voice data (by performing re-transcoding at SR) that is practically the same as originally sent one. Thus, all evidences of steganogram are wiped out. For other steganographic methods the steganogram can be extracted and removed but original data cannot be restored because it was erased while performing steganogram embedding. That is why, TranSteg is harder to detect.
Communication via TranSteg can be thwarted by certain actions undertaken by the wardens. The method can be defeated by applying random transcoding to every non-encrypted VoIP connection, to which the warden has access. Alternatively, only suspicious connections may be subject to transcoding. However, such an approach would lead to a deterioration of the quality of conversations. What must be emphasised, not only steganographic calls would be affected – the non-steganographic calls could also be “punished”. In section 4 we provide guidelines for pinpointing suspicious IP telephony connections: we investigate RTP payload byte values’ distribution as a possible indication of TranSteg utilization. It is worth noting that this approach will fail to succeed if SRTP protocol is applied.
Due to the above, it is necessary to explore other possibilities, which could facilitate the development of an efficient detection method fulfilling the constraints dictated by the VoIP’s real-time operation constraints. One promising research direction worth pursuing is the adoption of the method proposed by Wright et al. in [47], which can be utilized for SRTP encrypted payload. However, this technique is only applicable for variable bit rate codecs. The authors of this work discovered that the lengths of encrypted RTP packets can be used to identify phrases spoken within a call. Therefore, if extended, this approach can be applied to deduce the characteristics of the employed speech codec, which would increase the probability of detection of covert communication.
The summary and comparison of hidden communication scenarios with respect to TranSteg functioning and detection (described in Sections 3.1–3.3) is presented in Table 1.
Table 1 Comparison of hidden communication scenarios (S1-4)
Intentional attacks to remove steganogram when TranSteg is applied to VoIP call will likely be unsuccessful because steganogram can be spread across the payload field and “mixed” with voice data. However, one thing can be done to limit the TranSteg utilization i.e. performing intentional, real transcoding inside the network. This will lead to destroying the steganogram while it has little impact on non-steganographic VoIP users.