Skype fundamentals
A Skype network forms a hierarchical P2P topology with a single centralized element (login server) that is created using two types of nodes [7]: (i) Ordinary Nodes (ONs) that can start and receive a call, send instantaneous messages and transfer files; (ii) special-purpose nodes, known as Super Nodes (SNs), that are responsible for helping ONs find and connect to each other within the Skype network. The login server is responsible for the authentication of ONs and SNs before they access the Skype network. Historically, every user’s device with sufficient resources (in terms of CPU, bandwidth) and a public IP address could become an SN. However, it is believed that from 2012 Microsoft itself has hosted about 100,000 SNs [19], so the rest of the user nodes are ONs.
Typically, Skype offers two communication modes, namely: End-to-End (E2E) and End-to-Out (E2O). The former takes place between two Skype clients within the IP network, while the latter occurs if one of the endpoints is a Skype client within the IP network and the other is a PSTN phone (SkypeIn/SkypeOut services). In this paper, we focus solely on the Skype E2E mode.
As already mentioned, Skype is based on proprietary protocols and makes extensive use of cryptography, so that neither the signalling messages nor the packets that carry voice data can be uncovered. Moreover, it utilizes various obfuscation and anti-reverse-engineering procedures. Thus, all the information about its traffic characteristics, protocols and behaviours comes from numerous measurement studies (e.g., [5, 6]). However it must be noted that they mostly focus on characterizing voice-call traffic.
Typically, the preferred, first-choice transport protocol for Skype is the User Datagram Protocol (UDP), which the traffic analysis in [6] showed is being used for about 70 % of all calls. However, if Skype is unable to connect using UDP it falls back to TCP (this is true especially for audio streams). In this paper, we focus solely on UDP-based Skype video calls.
For TCP-based transport, the entire Skype message is encrypted, while in the case of UDP-based transport the beginning of each datagram’s payload possesses an unencrypted header called the Start of Message (SoM). It is necessary to restore the sequence of packets that was originally transmitted to detect a loss, and to quickly distinguish the type of data that is carried inside the message. SoM consists of the following two fields [5] (Fig. 1): (i) ID (2 bytes) that is used to uniquely identify the message, which is randomly selected by the sender query and copied in the receiver reply; (ii) Fun (1 byte), which describes the payload type. For example, values: 0 × 02, 0 × 03, 0 × 07 and 0 × 0f are typically used to indicate signalling messages (used during the login phase or for connection management). 0x0d indicates a DATA message that can contain encoded voice or video blocks, chat messages or fragments of files.
From the video traffic perspective, which is the core interest of the proposed steganographic method, after 2005 Skype utilized the True Motion VP7 codec [20] and from early 2011 version 5.5 moved to VP8. Both video codes were produced by On2 Technologies for Google. Thus, when Microsoft acquired Skype the video codec has been changed to H.264 [21].
It must be also noted that the Skype video call quality can be of the following standards (https://support.skype.com/en/faq/FA597/what-do-i-need-to-make-a-video-call):
-
Standard, which means that the utilized video resolution will be 320 × 240 pixels and the video signal will be sent at a rate of 15 fps.
-
High-quality, in which the resolution is 640 × 480 pixels and the frame rate is 30.
-
HD, which is characterized by the highest resolution of 1280 × 720 pixels and the same rate of 30 fps.
As for the typical network steganography method, the general rule is the more traffic there is, the better the potentially resultant steganographic bandwidth for Skype video traffic. Thus it can be expected to be higher than for existing information hiding methods for audio streams.
Skype video measurements test-bed and traffic analysis
An analysis of Skype video traffic was performed to prove that the proposed steganographic method is feasible. For this, an experimental test-bed was set up in order to analyse the traffic and then to evaluate YouSkyde (Fig. 2). The test-bed included two Skype clients and a Linux-based application designed and developed by authors that could intercept Skype packets before they reached (for the transmitting side) and after they entered (for the receiving side) the network interface. Two instances of this application were synchronized using the Network Time Protocol (NTP) and were responsible for the generation of reports regarding the video packets’ statistics.
The measurement procedure was as follows. First, on the transmitting side the virtual video device was created using v4l2loopback (https://github.com/umlaeute/v4l2loopback). This was then used as a video device for Skype and allowed to play the chosen video files as inputs. Second, Gstreamer (http://gstreamer.freedesktop.org), which is a virtual multimedia framework, was utilized as a server to transmit the video stream to a virtual video device created in the previous step. On the receiving side SimpleScreenRecorder (http://www.maartenbaert.be/simplescreenrecorder) was applied, which is an application that allows part of the computer screen to be captured. As we were unable to capture the received video signal directly from Skype, it was necessary to acquire the part of the screen where the video call was displayed. This of course meant the potential degradation of the calculated video quality. The captured traffic as well as the sent and received video sequences were then subjected to analysis. Video quality was assessed using MSU Video Quality Measurement Tool (VQMT) (http://compression.ru/video/quality_measure/info_en.html#start), which was designed for objective video signal quality evaluation. This tool was developed at the Graphics and Media Lab at M.V. Lomonosov Moscow State University, Russia. It allows a wide spectrum of video quality metrics to be determined that belong to one of two groups: mathematically defined metrics or metrics that have similar characteristics to the Human Visual System (HVS). For the purposes of this paper, we decided to calculate the Peak Signal-to-Noise Ratio (PSNR) that belongs to the first group and is currently the most popular and widely used metric, and the Multi Scale-Similarity Index Metric (MS-SSIM) and Video Quality Metric (VQM) as the representatives of the second group. VQM, especially, has been proven to be well correlated with subjective video quality assessment and has been adopted by ANSI as an objective video quality standard.
For the video testing sequence we decided to utilize video from the Video Trace Library of Arizona State University, USA (http://trace.eas.asu.edu/yuv/akiyo/). The sequence is of a news presenter speaking, which reflects the format of a typical Skype video call: a static background and an (almost static) upper part of the person in the middle of the screen. The video file was resized, replicated and merged to represent a few minutes of a videoconference call.
Using the test-bed presented in Fig. 2 and the abovementioned tools and quality metrics, a number of measurements were carried out on the Skype traffic. The averaged results for this study are presented below. It should also be noted that we limited our experiments to standard Skype mode. Therefore, the input video stream is transmitted to the virtual video device at a rate of 15 fps rate and 320 × 240 pixels resolution; the duration of the single video call was between 3 and 5 min.
For further study, it is vital to establish the reference quality of the received video signal in the test-bed presented in Fig. 2 without also applying the steganographic method. This will later allow the impact of the information hiding technique on the transmission quality from user’s perspective to be assessed. Table 1 presents the experimental results obtained for the chosen video quality metrics.
Table 1 Quality metrics for reference video measurements
For the chosen quality metrics the acceptable values are:
-
About 30 dB for PSNR.
-
For VQM the scale is between 0 and 5, and the lower value the better.
-
In the case of MS-SSIM, the values between −1 and 1 are achieved and the more similar the transmitted and received video sequence, the higher the value.
Therefore, it can be concluded that the Skype video call in the tested experimental setup was of good quality, and it can be further utilized to evaluate the proposed steganographic method. It must be also emphasized that the exact values of the quality metrics are not as important as the decrease in metric values caused by applying the information hiding technique.
Skype video traffic measurements
The measurements performed to present the basic characteristics of the Skype video call traffic were packet size and packet rate distribution.
During a typical Skype videoconference call two main streams are sent: the audio and video streams. Thus, it is first vital to identify the packets that form a video stream in the videoconference call. Figure 3 presents the packet size distribution of a typical Skype connection.
Analysis of Fig. 3a reveals that during the videoconference call three different streams can be distinguished. The first stream is characterized by a packet size of 31 bytes, sent at a fixed time interval every 20 s. This stream is responsible for maintaining Network Address Translations (NAT) on network intermediary devices. The second stream is formed by packets with a size of about 100 bytes and the third by sizes of between 200 and 1400 bytes. If we analyse the distribution of the packet size of the voice-only call (Fig. 3b) we clearly see that the second stream is related to voice stream and the third to video traffic transmission. By inspecting closely the data in Fig. 3b we can conclude that the voice packets are smaller than 180 bytes. Therefore, for the rest of the experiments we assumed that every DATA packet of a size greater than 180 bytes formed a Skype video stream.
Moreover, it was experimentally verified that the bitrate for what was assumed to be the audio stream is about 40 kbps, and for the video stream it is about 90 kbps. Additionally, the average total rate of the DATA packets was 65 pkt/s. For packets of a size less than 180 bytes, it was about 50 and 15 pkt/s for the audio and video stream, respectively. From previous Skype measurements it is known that the audio stream is 50 pkt/s ([1, 7]), thus confirming the correct selection of the DATA packet size for distinguishing the data streams. This also proves that each video frame is carried in a single packet in the video stream – hence the packet rate is equal to the frame rate (15 fps).
Next, we wanted to determine how Skype video traffic reacts when intentional packet losses are introduced. This allowed us to establish how Skype compensates for losses. The results of the experiments are presented in Fig. 4; for these measurements the loss level has been increased by a 1 % step every 10 s.
From Fig. 4a, it can be concluded that the change in packet losses influences the size of the packets – the average size is 766 bytes with a standard deviation 18.67 bytes; however, it is not correlated with the packet loss level.
Also, the packets’ frequency was subjected to measurements to determine whether Skype compensates for losses by increasing the data rate, as was observed for audio traffic [6, 7]. Figure 4b confirms that the video traffic also follows this rule. When no losses are introduced, the data rate, as expected, is 15 pkt/s. With an increase in overall losses up to 20 %, the frequency of the packets slowly rises to 20 pkt/s. Then, from 20 to 80 % of losses the data rate increases to about 30 pkt/s, and then it drops back again to about 20 pkt/s. The observed behaviour confirms the results reported in another research on Skype video traffic [22], where it was noted that for a loss level of 16 % the data rate increased by about 60 %.
If the size of the packets is not correlated with the loss level and the data rate, then it is most probable that the packets are intentionally duplicated and sent to the receiving side to compensate for the elevated loss level. Our measurement revealed that part of the packets were sent with a very small time interval of a few milliseconds (while the expected interval for 15 pkt/s is about 66.6 milliseconds due to video codec frames generation). It is also worth noting that the neighbouring packets’ size varies by only 5 or 6 bytes. The difference in size can be reasonably explained: the larger packet probably contains information about the replicated content.
From the designed information hiding methods perspective, the data rate fluctuations are a very important characteristic to consider. The difference in packet frequency can be utilized as an indicator for steganalysis (detection) purposes. Therefore, the next step is to analyse the number of probable packet duplicates for different loss levels.
In Fig. 5 we can observe that the number of probably replicated packets (the red curve) significantly increases when the 15 % loss threshold is reached, which is in line with the trend in Fig. 4b. The total number of “duplicates” is about 0.8 % of all Skype video traffic when no losses are introduced.
The last part of the study is the evaluation of how the resultant quality of the video signal changes with the increase in packet losses. Figure 6 presents the experimental results for the following metrics: PSNR, VQM and MS-SSIM for losses in range 0–65 %. It should also be noted that for higher loss levels, it was not possible to synchronize the transmitted and received video signal due to its significant degradation.
PSNR (Fig. 6a) first decreases for losses in range 5–20 %; then, it rises until a 50 % loss level is reached, and after that it decreases significantly. The increase of PSNR at about 20 % of losses is correlated with the increase in packet frequency and the utilization of probable duplicated packets (confirm by Fig. 4b). This result is in line with a previously performed measurement study [22], where the authors observed a significant degradation of video signal at 8 %, which is similar to PSNR results obtained at the 10 % loss level.
As expected, similar behaviour as that of PSNR was observed for other quality metrics: VQM and MS-SSIM (Fig. 6b and c, respectively). Therefore, it can be concluded that from the transmission quality perspective the worst packet loss levels are 10 % and above 45 %, and they should be avoided for the purposes of the steganographic method. It is better to keep the losses below 5 % or from 20 to 45 %. However, it must be also noted that for the latter the data rate increases, which is not beneficial from an undetectability point of view.
Another reason for the quite good video quality is relates to the way in which intentional losses were introduced. In this paper, until the level of 50 % losses is reached only every second packet is subject to dropping. This allows the burst losses to be omitted and preserves the transmission quality. Above 50 %, this rule, obviously, cannot be fulfilled. This also explains why the quality of between 20 and 50 % is almost as good as introducing losses at a very low level (<1 %); since there are a lot of probably duplicated video packets in this range, dropping them does not significantly affect video quality.
From the conducted measurements of Skype video traffic presented above for the proposed information hiding method, the most important requirement is to establish a baseline for the “clean” connection. This baseline can be later treated as a reference and can be compared with steganographically modified traffic. The three main and most important characteristics of the unmodified Skype video call are the following: the average data rate is 15.24 pkt/s with a standard deviation of 0.28, the average packet size is 741 bytes and about 0.8 % of total number of packets are probable duplicates.