In this section we will describe and analyze some of the different approaches that are available for web-based interactive live streaming. The first of them (image-refreshing and native M-JPEG based approaches) are often used by the architectures, use-cases and implementations of live streaming applications that can be found in the literature. Additionally, we include some novel approaches which are more rarely used, some of which have only recently become available due to their reliance on new or under-development Web standards.
The diagram in Fig 4 shows the generalized (and simplified) process that the interactive live streaming platforms follow from the time a frame is captured to the time it is displayed. The flow starts when the IP webcam captures a frame. Many different IP webcam models exist, and different models support different output formats. Some of the most common are JPG (discrete images), M-JPEG and H.264 streams. Through those formats, the output from the cameras is forwarded to the streaming platform. Depending on the source and target formats, it might or might not be necessary to transcode it into a different format. Transcoding can take a significant amount of processing power and adds some latency. The cameras server will briefly store these images, and will be responsible of serving them to the browser. Different server-client channel protocols are available. Once delivered, the browser will render the provided data to the user. Depending on the particular scheme, it will rely on a native component (such as for native M-JPEG videos or for image refreshing) or it will process, decode and render the data through JavaScript.
In the following subsections we will describe in more detail the different schemes. We will describe more thoroughly those schemes which best seem to meet the criteria described in Section 2, and which we will select for the experimental comparison. For most schemes, we will describe separately the server-side and client-side.
JavaScript-based image refreshing
This is a mature technique which is technically very simplistic, both in the client and the server-side. The server simply provides individual access to the current frame and the client repeatedly asks for a new frame using JavaScript. Despite its technical shortcomings it is used by many applications in the literature, sometimes as a fallback. Particularly, it is used by many remote laboratories for which a high FPS isn’t absolutely necessary, including most of the remote laboratories of WebLab-Deusto,Footnote 16 LabsLandFootnote 17 and RemLabNetFootnote 18 [34].
Client-side
In the client-side only a browser with JavaScript support is required. The HTML contains an <img> tag which points at the webcam image. Then, from JavaScript, the image tag’s src attribute is constantly modified. Under all modern browsers —and most legacy ones— when the src tag is changed the new image is loaded. To prevent some technical issues, generally some additional low-level considerations are taken:
-
To modify the webcam image’s URL in some way (such as by adding a random number to the GET query parameters) so that no cache issues occur.
-
To query for a new image only after the load event has fired, sometimes after a small delay, so that requests don’t accumulate on slow networks.
Server-side
Server-side this particular approach is also relatively simple. Most IP webcams provide an image URL that serves one frame, so that URL just needs to be made available either directly or through a standard web server. Some applications require several concurrent users and the hardware of webcams is sometimes not particularly powerful, so a cache server can be used. In that case, a cache server in the same network would continuously request frames to the camera. Then, whenever the web server asks for a frame, the last cached frame is served.
Motion JPEG
Motion JPEG, often known as just M-JPEG, is a video format with intraframe-only compression: each frame is essentially a separately compressed JPEG image. Because of this, its compression rate tends to be significantly lower than that of most modern interframe formats [4] (e.g., H.264/MPEG-4 AVC [19], H.265 [18], VP8 [2], VP9 [11]). However, it does have some significant advantages [21]: it is simple to implement, it requires little memory and processing power, it responds better than other formats to packet loss, fast image changes [4], and network jitter [28]. Many IP webcams provide native support for M-JPEG streaming, and M-JPEG is supported by most modern browsers, including the desktop and mobile versions of Google Chrome, Mozilla Firefox, Apple Safari and Microsoft Edge. It is not natively supported, however, by Microsoft Internet Explorer, and there have been significant issues with the implementations of most of these browsers. Examples of interactive live streaming applications that currently rely on M-JPEG are the RexLab.Footnote 19 Other example are laboratories based on the iSESFootnote 20 SDK [33], which support both browser-native M-JPEG and a JavaScript-based M-JPEG decoder which receives the frame from the server through WebSockets.
Natively, browsers and HTTP servers implement M-JPEG by sending the videos through special mimetypes such as multipart/x-mixed-replace and using the Chunked Based Coding feature of HTTP 1.1 (RFC2616 [7]), which is a data transfer mechanism that allows the transmission of dynamically generated content. When the request is started the total length does not need to be known. Instead, an undetermined number of chunks can be sent, and the request can be completed any time by a final zero-length chunk. In the case of a M-JPEG stream, servers keep a long-running request, sending the separate JPEG images that compose the video as separate chunks.
Limitations of browser-native M-JPEG implementations
Although the M-JPEG video format itself is relatively effective for live streaming [10, 21] and it is indeed a popular format for applications such as webcams, digital cameras or remote laboratories, there are in practise some technical and reliability issues with the native M-JPEG implementations of current browers.
Though most major browsers nominally support M-JPEG natively, it fails to work under certain versions of Google Chrome, Mobile Safari and Firefox. Also, when the stream is interrupted or fails, browsers do not recover. Under some circumstances, such as receiving at a higher FPS than the bandwidth allows, Chrome and Firefox often stop displaying the image but keep internally receiving them.
Motion-JPEG has no particular bandwidth-control or timing provisions. Browsers tend to render the frames just as they are received. This works as expected when the bandwidth is high enough for the FPS. The latency is even particularly low when compared to modern formats such as H.264 because there is no interframe compression or buffering delay. However, when the server is sending frames at a higher rate than the client can receive and display, the frames tend to accumulate and thus a significant and growing latency is introduced. Avoiding this is non-trivial and is not always possible, and the implementations observed in this work do not make any particular provision at this respect.
Client-side
The client-side is, at least in principle, straightforward for those browsers that natively support M-JPEG, because including an <img> tag is all that is needed. In practise, some implementations will provide a fallback mechanism (e.g., in [25]) for browsers that do not support it or that encounter issues, which is not uncommon.
As described in the previous section, reliability and bandwidth control is often an issue. Browsers provide no particular Motion-JPEG-related functionality or semantics through JavaScript, so detecting and recovering from failures is non-trivial. Under the tested browsers (Google Chrome, Mozzilla Firefox) no load or error events are raised on the <img> element that hosts the M-JPEG image when the stream is interrupted. In principle, the binary data of the image could be repeatedly accessed in JavaScript through the HTML5 APIs and explicitly compared to verify whether it has changed or not. In practise, however, this is very costly in terms of performance and is by default disallowed for externally hosted images due to CORS restrictions. Furthermore, browsers seem to have issues handling high FPS, with Chrome, for instance, soon using 100% CPU on that tab and showing a blank image.
Most observed implementations simply choose to stream at a fixed and relatively low FPS to partially avoid these issues, or let the user or administrator configure the FPS.
An alternative is to avoid relying on the native M-JPEG capability of the browsers (which tends to be flawed) and to receive and render the stream through JavaScript instead.
Server-side
Many IP webcam models support M-JPEG natively, so some implementations simply redirect that stream to the end-users. However, if the system should support several concurrent users, a caching server is required to achieve an acceptable Quality of Service. The server can try to adapt to available bandwidth by avoiding to queue frames and instead sending always the latest captured frame. Thus, theoretically, if the client received and displayed the latest frame just when it is received, the capture-display delay would be minimized and the server and client would be mostly synchronized. In practise, however, intermediate systems (including the browsers themselves) often buffer the TCP stream, so this scheme does not always work as well as expected.
To avoid relying on the (often flawed) native M-JPEG browser support, and to rely on JavaScript instead, the server will have to send the stream through an alternative means, such as Web Sockets, WebRTC or AJAX. This way, the previously mentioned glitches can be bypassed, and it becomes feasible to adapt to the available bandwidth. This comes at the cost of a higher client-side processing.
High-compression formats
Image-refreshing and M-JPEG, as discussed, provide relatively poor compression, but they are nonetheless often used for certain types of real-time interactive applications. That seems to be, mostly, because:
-
They require little client-side and server-side processing power.
-
They are simple to implement and maintain.
-
Encoding and decoding takes very little time so very little capture-display latency is added.
-
They can be used in almost any platform and browser.
In other contexts, however, such as non-interactive live streaming, or such as VoD, those formats are generally not used and would often be regarded as suboptimal formats. Instead, those contexts often favour formats and approaches that take higher processing time, require a more complex architecture, and add some latency; but that require lower bandwidth and provide higher quality. Example formats are, for instance: MPEG-1 [17], H.264/MPEG-4 AVC [19], H.265 [18], VP8 [2], or VP9 [11]. Today, some of those are natively supported by some browsers and systems —though not necessarily for live-streaming—, and some formats can be decoded, at a potentially high processing cost, through JavaScript. Therefore, in the current state of things, approaches that rely on this kind of codecs could today be an effective alternative for near-real-time live video streaming. Especially, since they offer a particularly high compression-rate and a relatively high quality [12].
Client-side
Client-side there are, again, two possibilities. First, in the most ideal case, the browser and hardware will support the format natively through the HTML <video> tag, and the support will also provide enough facilities for near-real-time live streaming of that format. Unfortunately, this is not always a valid approach due to the following:
-
The HTML5 standard provides the <video> tag but it does not include live-streaming support. Though that is likely to change in the future through the MediaExtensions API.
-
Each browser supports a different set of codecs, so the server needs to generate different streams to be truly cross-platform.
As an alternative, or as a fallback, it is possible to use JavaScript-based decoders for some of the formats. Such a system receives the stream data through Web Sockets, AJAX, or similar; decodes it through JavaScript, which depending on the format can require significant resources; and then renders it into an HTML5 canvas. Though more costly in terms of processing power, it is truly cross-platform and compatible with any modern browser.
Server-side
Server-side this approach tends to be significantly more costly in terms of resources, especially if several streams need to be provided. If a single stream is provided, it is also more costly than in the case of image-refreshing or M-JPEG due to the compression and intra-frame nature of these formats. A significant difference is also that for an user to be able to join an ongoing stream, initialization steps are required. That sometimes involves sending initialization packets through a secondary data channel, and/or waiting for specific periodic frames before being able to join.
Non-standard plugins
Traditionally, multimedia features in the browser have been limited. This has led many media-dependant applications to rely on non-standard plugins such as Adobe Flash or Java Applets. The remote laboratory described in [37], for instance, displays a webcam through the YawCamFootnote 21 Java Applet.
Adobe Flash, Java Applets and similar plug-ins have access to native TCP and UDP sockets, which makes it possible to use non-web streaming protocols such as RTSP [35]. Although this makes this approach particularly powerful, it also implies that it is less universal. The plugins themselves need to be previously installed, which requires administrator privileges that are not always available. Also, Java Applets, for instance, are not supported in mobile devices, and Flash support is very limited. Support in desktop browsers is better, but still, Chrome has dropped support for Java Applets, and Firefox is expected to drop it soon. Other browsers are likely to follow similar paths. These issues are also discussed in [30]. Furthermore, the plugins themselves and the usage of non-HTTP protocols have security implications. As a result, applications that rely on those would be unable to be deployed under many institutional firewalls and proxies without significant configuration and policy changes [9].
WebRTC
WebRTC (Web Real-Time Communication) is an API standard that is currently under development by the W3C. It is oriented for peer-to-peer multimedia communication. This technology can be very useful for certain applications, such as videoconferencing ones, which can benefit from being decentralized and peer-to-peer. However, it has certain constraints:
-
It still requires a server to handle the connection process and signalling; and to route all data for those cases where the NATs or firewalls of the clients prevent them from directly connecting to each other.
-
It is oriented towards peer-to-peer connections, so its usefulness is limited when the source of the content is a traditional centralized server.
-
It is not yet an accepted standard, and it is not yet supported by every major browser.
For these reasons, WebRTC-based approaches will not be considered for the purposes of this work, though it may be useful to also compare them in the future.
HLS and MPEG-DASH
HLS (HTTP Live Streaming) is a protocol created by Apple which is intended to answer some of the challenges described in this work. However, it is, at least for now, not natively supported by all browsers, and it is not standard, so universality is limited.
MPEG-DASH (Dynamic Adaptive Streaming over HTTP) is an adaptive bitrate streaming standard that is currently in draft form. Support is growing and in the future it is likely to be a very effective alternative.
Limitations
Though several approaches were described in the previous section, there are many more potential ones which could be feasible, and more are likely to appear in the future. Thus, it is noteworthy that the previous list or this comparison is not meant to be exhaustive.