The proposed annotation scheme uses two independent watermarks. The first one allows for synchronization with the original block grid in case of cropping. The second one carries the payload of the annotations and the necessary headers. The general idea of the considered scheme is shown in Fig. 1. The original and the watermarked images are denoted as x and x
*, respectively. w represents the auxiliary watermark and \(W_n^{(i)}\) the nth symbol of ith annotation’s watermark.
The first step of the encoder is to divide the image into blocks. Successive steps of the algorithm require different block sizes, which imposes a three-level block hierarchy, described in more detail in Section 2.2.
The first of the watermarks is embedded in the spatial domain using the additive spread-spectrum technique [3]. A correlation detector implemented in the decoder recovers the shift between the original and the current block grid. The details of this synchronization procedure are presented in Section 2.2.
The next step is to perform block-based forward Discrete Cosine Transform (DCT). This domain allows for straightforward selection of frequencies least affected by the prospective JPEG compression. Due to high capacity requirements, we have used the Distortion Compensation Quantization Index Modulation (DC-QIM) technique for embedding the main watermark [1, 4]. Each coefficient eligible for carrying the watermark is modified according to:
$$ \hat{x}_{i,j}^{*} = \left(1 - \gamma\right) \cdotp {{\rm sign}\left(\hat{x}_{i,j}\right) \cdotp \Delta \cdotp Q_m\left(\frac{|\hat{x}_{i,j}|}{\Delta}\right)} + \gamma \cdotp \hat{x}_{i,j} $$
(1)
where \(\hat{x}_{i,j}\) is the cover image coefficient, \(\hat{x}_{i,j}^{*}\) is the watermarked coefficient, Δ is the quantization step and γ is the distortion compensation parameter. \(Q_m(\cdotp)\) is a quantizer for message bit m:
$$ Q_{m}(x) = \begin{cases} 2 \cdotp \lfloor \frac{x}{2} + 0.5 \rfloor & {\text{if}} {m} = 0,\\ 2 \cdotp \lfloor \frac{x}{2} \rfloor + 1 & {\text{if}} m = 1 \end{cases} $$
For the purpose of generating the main watermark, the payload of each annotation is encoded by a fountain code [12] and supplemented with the necessary headers. The Medium Access Control (MAC) module assigns the available capacity and multiplexes the data streams from multiple annotations. In the last step, the modified spectrum is transformed back to the spatial domain using inverse DCT.
The decoder begins by detecting the spatial domain synchronization watermark. Based on the detected translation, it aligns the image to match the original block division grid. Then, the image is transformed to the DCT domain and the QIM watermark is recovered. After stripping and analyzing the headers with necessary configuration data, the detector begins to decode the streams of all identified annotations. The last step is to perform coordinate translation of the polygons’ vertices.
The operation of all of the relevant steps of the algorithm will be described in detail in dedicated sections.
Annotation transport architecture
The principle of the proposed annotation watermarking scheme is to deliver a layered architecture analogous to traditional packet networks. The payload of each annotation is divided into constant length symbols, which are encoded in order to introduce the necessary redundancy. For this purpose, we adopt the fountain coding paradigm [12]. Its fundamental assumption is that successful decoding is possible from arbitrary fragments of the symbol stream. The only requirements is that the decoder needs to receive a certain portion of the transmitted symbols. For an ideal code, the necessary portion would be exactly as long as the original message. Practical codes, however, introduce additional overhead.
Due to different properties of existing digital fountain codes, the proposed scheme allows for selection of the most appropriate one based on the needs of each particular message. We consider two basic codes: the random linear fountain (RLF) [12] and the LT code [11]. The latter can be configured to operate in different variants by choosing the most appropriate degree distribution and the preferred decoding algorithm. Thus, one can configure the system to achieve a good balance between the low overhead of the random linear fountain and the low decoding cost of the LT code.
A fountain code produces output symbols by calculating linear combinations of random input symbols. The number of combined input symbols is referred to as a degree of the output symbol. This degree can be chosen either uniformly, like in RLF, or according to a selected degree distribution, like in LT codes. The degree distribution has crucial impact on the properties of the code. The presented system implements three degree distributions: the Robust Solition Distribution (RSD) [11], the Revised RSD (RRSD) [2] and the optimized degree distribution (OPTD) [8].
The RSD is the distribution proposed in the original paper on LT codes, it balances the average degree so that the decoding process would proceed without interruptions and would not require an excessive number of operations per symbol. The main problem with this distribution is a large overhead for short messages. This limitation is mitigated by incorporation of the OPTD. The last of the considered distributions, the RRSD, allows for further improvement of the decoding performance at the cost of even higher overhead.
Another way to balance the overhead vs. decoding complexity trade-off is to select the appropriate decoding algorithm. In the described system, it can be chosen to be Gaussian elimination (GE), Belief Propagation (BP) or hybrid. The latter begins with BP and uses GE when BP can no longer proceed.
A fountain code is capable of delivering a limitless symbol stream. In the considered system, the output stream length is constant and stems from the size of the image. One output symbols is generated for every macro block of the image. The MAC module decides which of them are actually going to be used. In practice, the unnecessary output symbols do not need to be calculated. The fountain code, however, needs to explicitly generate null symbols for the sake of proper encoder-decoder synchronization.
The choice of the symbol length stems from experimental evaluation of the impact of JPEG compression on each of the transform coefficients. The described system uses 60-bit symbols with additional 16-bit hash for error detection. This choice can be easily adapted to the requirements of a particular application and it does not affect the performance of the utilized fountain codes. The described transmission architecture is shown in Fig. 2.
Block hierarchy and grid synchronization
Each of the steps of the proposed scheme uses a different block size. The designed three-layer block hierarchy is shown in Fig. 3. The lowest-level division is based on the 8 × 8 px grid used by JPEG. Thus, it allows for direct assessment of the impact of lossy compression on individual frequencies of the spectrum.
The capacity of an individual lowest-layer block is insufficient for the discussed application. Thus, for the purpose of embedding the watermark symbols W
n
, we group 4 lowest-layer blocks into 16 × 16 px macro blocks, which are capable of carrying one symbol of the watermark payload. Each 8 × 8 px block carries 19 bits of the watermark payload. The information is embedded into first 19 coefficients of the DCT spectrum in the zig-zag order. Due to excessive quality impact, the DC coefficient is not eligible for watermark embedding. First four coefficients from each block are used to embed the high-priority symbol hash.
The highest layer of the hierarchy groups 16 macro blocks into 64 × 64 px synchronization blocks. The function of synchronization blocks is twofold. Firstly, they impose a strict organization of the macro blocks. Two first macro blocks are reserved for the necessary system headers and are referred to as the scheme configuration block and the stream configuration block. The former carries fundamental properties of the scheme, i.e., the number of embedded annotation streams and the dimensions of the original image, which are necessary for synchronization of the fountain decoders and for translation of the annotations’ coordinates in case of cropping. This information is repeated in every such block in the whole image.
The second reserved block defines the properties of the embedded streams, i.e., their lengths and fountain code configuration. Due to capacity limitations, each stream configuration block describes the parameters of up to three annotation streams. By spatial multiplexing, it is possible to describe the necessary configuration data for all of the embedded streams.
The seconds use for the synchronization blocks is that they are used for embedding the auxiliary spread spectrum watermark. A uniform bipolar pseudo-random pattern \(w \in \lbrace -1,1 \rbrace ^{64 \times 64}\) is tiled to match the image size. We use the additive spread spectrum technique for embedding in the spatial domain (2).
$$ x^{*}_{i,j} = x_{i,j} + \alpha w_{i~mod~64,j~mod~64} $$
(2)
This auxiliary watermark allows for rapid resynchronization with the original blocking grid [9]. The detector calculates the correlation of an average synchronization block \(\overline{x}\) with the known watermark pattern w. The location of the watermark detection peak corresponds to the grid misalignment vector. This principal is illustrated in Fig. 4 which shows an exemplary misalignment between the original and the cropping-inflicted block division grid and the corresponding detector response.
For the sake of computational efficiency, the decoder calculates the correlation in the Fourier domain. The magnitude of the spectrum is discarded to increase the detection performance [9]. Hence, the decision is based solely on the angle between the inspected vectors, which essentially corresponds to the correlation coefficient detector [3]. The correlation matrix C is obtained by coefficient-wise multiplication of the image and watermark spectra:
$$ C = f^{-1}\left(\Phi\left(f\left(\overline{x}\right)\right) \cdotp \Phi\left(f(w)\right)\right) $$
(3)
where f(x) is the Fast Fourier Transform and Φ(x) is a magnitude discarding function:
$$ \Phi(x) = \begin{cases} \frac{x}{|x|} & \text{if $x \neq 0$},\\ 1 & \text{otherwise} \end{cases} $$
This detector is equivalent to the Symmetric Phase Only Matched Filtering (SPOMF).
Medium Access Control
The MAC module is responsible for multiplexing the data streams from multiple annotations, i.e., it assigns the macro-blocks to the concurring streams. It needs to take into account two principal factors. Firstly, the requirement that the description should be recoverable from a cropped version of the relevant image fragment. Secondly, the necessary overheads in order to determine the amount of actually needed symbols.
The determined assignment is not communicated to the decoder, which by validating the hash values of the embedded symbols, is capable of restoring the macro-block to data stream mapping. This map is supplemented incrementally, i.e., with each successive annotation, the decoder needs to check a quickly decreasing image area.
The operation of the MAC begins with initial assignment of the macro-blocks to the data streams by using the minimal-area bounding shape for each of the defined polygons on the image. The shape it determined with macro-block accuracy. The resulting assignment is highly susceptible to conflicts, even if there are no overlapping polygons involved in the process. Hence, the next step is usually the conflict resolution procedure. For this purpose the MAC builds an implicit hierarchy of the defined annotations, based on their mutual location and the overlap area. In case the conflict is related to a parent-child relation, the conflict is resolved in favor of the child polygon. The parent will get compensation from it’s surroundings. In case the conflicting regions are both children of a common parent, the decoder takes into account the necessary overhead for the associated descriptions.
When all of the conflicts have been resolved, the MAC estimates the necessary overhead for all of the defined annotations. If there is any shortage with respect to this criterion, the problematic region is expanded iteratively with a dilation-like operation. Expansion of child regions is allowed only within the parent regions. The whole supplementation process is performed iteratively, starting from the leafs of the built implicit hierarchy.
In the final step, the MAC repeats the dilation-like expansion of the top-most regions to fill the remaining unassigned area of the image. This process is illustrated in Fig. 5: (a) shows the defined polygons overlaid on a tinted cover image, (b) shows the initial assignment of the available capacity, (c) shows the result of conflict resolution process and (d) shows the final assignment after the final top-level expansion.
Content adaptation
One of the advantages of the proposed approach is straightforward support for content adaptation. This functionality is implemented solely in the encoder, which needs to assess the necessary embedding parameters in order to guarantee successful message decoding on the receiver side. Analogous to the capacity assignment by the MAC module, the decoder does not need to be aware of any content adaptation mechanisms.
The adaptation is carried out by proper adjustments of the distortion compensation parameter γ, which can be used to mitigate the embedding distortion for solid image areas. A Human Vision System (HVS) model can be used to decide which macro blocks are more suitable for information embedding. This is a challenging problem in digital information hiding, as the decoder needs to estimate the original HVS model to recover the selection channel [3]. This often leads to sub-optimal perceptual models, which tend to be robust against prospective content modifications. There also exist techniques for non-shared selection channel communication which select the best embedding variant from the ones that will be recovered as the desired message in the decoder. These techniques are more frequently adopted in digital steganography and the selection criterion usually stems from certain payload detectability estimates [5].
In this study, we use a simple model which adapts the distortion compensation γ for a whole macro-block based on the standard deviation of it’s normalized pixel values σ:
$$ \gamma = \frac{1}{4}\left(1-\left(1+e^{-20\left(\sigma - 0.25\right)}\right)^{-1}\right) $$
This mapping function is shown in Fig. 6. The compensation is higher for solid image blocks, where the embedding artifacts become visible more easily. The embedding is not eliminated completely, in order not to create visible boundaries between the eligible and non-eligible image blocks.