1 Introduction

Optical Music Recognition (OMR) is the field of research that investigates how to computationally decode music notation in document images. It aims at converting the large number of existing written musical sources into a codified format that allows for its computational process [2]. There are many documents in private and public archives, sometimes hidden from public, waiting for being digitized. There are also many digitized funds in specialized portals that are only available as images, without the possibility of study or content search. Since music typesetting is a tedious and expensive process, OMR represents an alternative to efficiently deal with this scenario.

Traditional approaches to OMR [2, 14] are based on the usual pipeline of sub-tasks that characterizes many artificial vision systems, adapted to this particular task: image pre-processing, individual music-object detection, reconstruction of the music semantics by using specific knowledge of the field, and output encoding in a suitable symbolic format.

Recent advances in Machine Learning—namely Deep Learning (DL)—which have achieved great results in similar tasks such as text recognition, allow us to be optimistic about developing more accurate OMR systems. The current trends in these fields is the use of end-to-end (or holistic) systems, that face the process in a single stage without explicitly taking into account the necessary sub-steps. To develop these approaches, only training pairs are needed, consisting of problem images, together with their corresponding transcript solutions [3, 6].

Due to design reasons, these approaches—typically based on Recurrent Neural Networks or Hidden Markov Models—are only able to formulate the output of the system as one-dimensional sequences. This perfectly fits for natural language tasks (text or speech recognition, or machine translation), since their outputs mostly consists of character (or word) sequences. However, its application to musical notation is not so straightforward due to the presence of different elements sharing horizontal position and skipped relationships. The vertical distribution of these elements disrupts the linear flow of the time line (see Fig. 1). This fact is not trivial to codify and can cause important difficulties in the performance of the recognition systems that make use of the temporal relationships of the recognized elements.

Fig. 1.
figure 1

The recognition process does not follow a linear left-to-right flow.

Although the problem can be drastically simplified by considering that the process will work with each staff independently from the others—a process that could be analogous to the text recognition systems that decompose the document into a series of independent lines—we still have to deal with elements that take place simultaneously in the “time” line, like the notes that make up a chord, irregular groups, or the expression marks, to name a few.

Within the range of music score complexities, one possible simplification of the problem that applies to many sheet music is to assume a homophonic music context. In that case, there are multiple parts but they move in the same rhythm. This way, multiple notes can occur simultaneously, but only as a single voice. This way, all the notes starting at the same time last the same, so the score can be segmented in vertical slices that may contain one or more music symbols (see Fig. 2).

Fig. 2.
figure 2

In homophonic music, all the notes starting at the same time last the same.

Even in this simplified context, there is a need of a clear and structured output coding that avoids the ambiguities that the representation of a linear output can show in presence of vertical structures in the data (see Fig. 3).

Fig. 3.
figure 3

Ambiguities appear when symbols are stacked. When two notes appear together, that must be played at the same time, a linear symbol sequence without specific marks can be interpreted in more than one way.

There already exist a number of structured formats for music representation and coding, like XML-based music formats [8, 11] that are focused on how the score has to be encoded to properly store all its content. That application makes it inappropriate for adopting them as output for an optical recognition system, because the code is plenty of irrelevant marks for the system to generate when it is recognizing the score content (mainly which symbols and where are they in the score).

Due to that, we have designed a specific language to represent an appropriate output for end-to-end OMR, based on serializing the music symbols found in a staff of homophonic music. The sequential nature of music reading must be compatible with the representation of the vertical alignments of some symbols. In addition, this representation has to be easy to generate by the system, which analyzes the input sequentially and produces a linear series of symbols.

The rest of the paper is structured as follows: Sect. 2 describes the recognition framework based on DL, including the ad-hoc serializations for homophonic music; Sect. 3 introduces the experimental setup followed to validate the approach; Sect. 4 shows the results obtained in a controlled scenario, as well as qualitative results with real homophonic sheet music; finally, Sect. 5 concludes the present work, along with some ideas for future research.

2 Recognition Framework

To carry out the OMR task in an end-to-end manner, we use a Convolutional Recurrent Neural Network (CRNN) that permits us to model the posterior probability of generating output symbols, given an input image. Input images are assumed to be single staff-sections, analogously to text recognition that assumes independent lines [15]. This is not a strong assumption, as staves can be easily isolated by means of existing methods [7].

A CRNN consists of one block of convolutional layers followed by another block of recurrent layers [16]. The convolutional block is responsible for learning how to process the input image, that is, extracting relevant image features for the task at issue, so that the recurrent layers interpret these features in terms of sequences of musical symbols. In our work, the recurrent layers are implemented as Bidirectional Long Short Term Memory (BLSTM) units [9].

The activations of the last convolutional layer can be seen as a sequence of feature vectors representing the input image, \(\mathbf x \). These features are fed to the first BLSTM layer and the unit activations of the last recurrent layer are considered estimates of the posterior probabilities for each vector:

$$\begin{aligned} P(\sigma |\mathbf {x},\textit{f}), \; 1 \le \textit{f} \le \textit{F}, \; \sigma \in \varSigma \end{aligned}$$

where \(\textit{F}\) is the number of feature vectors of the input sequence and \(\varSigma \) is the set of considered symbols, that must include a “non-character” symbol required for images that contain two or more consecutive instances of the same musical symbol [9].

Since both convolutional and recurrent blocks can be trained through gradient descent, using the well-known Back Propagation algorithm [17], a CRNN can be jointly trained. However, a conventional end-to-end OMR training set only provides, for each staff image, its corresponding transcription, not giving any type of explicit information about the location of the symbols in the image. It has been shown that the CRNN can be conveniently trained without this information by using the so called “Connectionist Temporal Classification” (CTC) loss function [10]. The resulting CTC training procedure is a form of Expectation-Maximization, similar to the backward-forward algorithm used for training Hidden Markov Models [13]. In other words, CTC provides a means to optimize the CRNN parameters so that is likely to give the correct sequence given an input. The use of the aforementioned “non-character” symbol to indicate a separation between symbols is considered essential for adequate CTC training [10].

Once the CRNN has been trained, an input staff image can be decoded into a sequence of music symbols \(\mathbf {\hat{s}} \in \varSigma ^{*}\). First, the most probable symbol per frame is computed:

$$\begin{aligned} \hat{\sigma }_{i} = \arg \max _{\sigma \in \varSigma } P(\sigma |\mathbf {x}, i), \; 1 \le i \le F \end{aligned}$$

Then, a pseudo-optimal output sequence is obtained as:

$$\begin{aligned} \mathbf {\hat{s}} = \arg \max _{s \in \varSigma ^{*}} P(\mathbf s |\mathbf x ) \approx \mathcal {D}(\hat{\sigma }_1, \ldots , \hat{\sigma }_F) \end{aligned}$$

where \(\mathcal {D}\) is a function that first merges all the consecutive frames with the same symbol, and then deletes the “non-character” symbol [9].

This framework is equal to the one used in text recognition tasks [16], whose expressiveness could be sufficient when working with simples scores where all the symbols have a single left-to-right order. However, as introduced above, we want to extend these approaches so that they are able to model richer scores such as those of homophonic sheet music. In such case, issues like chords may appear, where several symbols share a vertical position. As seen in Fig. 3, a one-dimensional sequence is not expressive enough for this. That is why in the next section we describe our coding proposal to perform end-to-end OMR for homophonic scores.

2.1 Serialization Proposals

Our current research involves the study of four different deterministic, unambiguous and serialized representations to encode the kind of scenarios that happen in homophonic music so that the OMR system becomes more effective when recognizing complex music score images. For that we propose four different types of music representations that differ not in the encoding of the musical symbols themselves but in the way horizontal and vertical distributions of the musical symbols are represented. The grammar for these musical codifications must be deterministic and unambiguous, allowing to analyze a given document in only one way.

Our representation does not make assumptions about the musical meaning of what is represented in the document being analyzed, that is, the elements are identified in a catalog of musical symbols by the shape they have and where they are placed in the score. This has been referred to as “agnostic representation”, as opposed to a semantic representation where music symbols are encoded according to their actual music meaning [5].

As aforementioned, the only difference between the four proposed musical codes is how to represent the horizontal and vertical dimensions. Each one of the four codes will have one or two characters that indicate whether, when transcribing the score, the system should move forward, that is, from left to right, or upwards, from bottom to top.

The four different codes proposed are described as follows:

  • Remain-at-position character code: when transcribing the score, the different musical symbols are assumed to be placed left to right, except when they are in the same horizontal position. In that case they are separated by a slash, “/”. This acts as a remain-at-position character, meaning that the system does not advance forward and it has to advance upwards (see Fig. 4a). This behaviour is similar to the backspace of the typewriters. The carriage advances after typing and if we want to align two symbols we need to keep the carriage in a fixed position (by moving it back one position).

  • Advance position character code: this type of codification uses a “+” sign to force the system to advance forward. This way, when that sign is missing, the output does not move forward and a vertical distribution is being coded (see Fig. 4b).

  • Parenthesized code: when a vertical distribution appears in the score, the system outputs a parenthesized structure, like vertical.start musical_symbol ...musical_symbol vertical.end (see Fig. 4c).

  • Verbose code: this last codification is a combination of the two first ones. It uses the “+” sign as the advance position character to indicate that the system has to move forward, and the “/” sign as the remain-at-position character to indicate that the system has to advance upwards (see Fig. 4d). So, in this codification, every two adjacent symbols are explicitly separated by a symbol indicating whether the system must remain at the same horizontal position or has to advance to the next one.

Note that the four codes are unambiguous representations of the same data, so they are interchangeable and can be translated among them.

Fig. 4.
figure 4

Musical excerpt presenting a number of different situations where vertical alignments occur and its transcription using the proposed codifications. From top to bottom: (a) remain-at-position character coding, (b) advance-position character coding, (c) parenthesized coding, and (d) verbose coding.

3 Experimental Setup

In this section, the experimental setup, including both the corpora and the evaluation protocol considered, will be described.

3.1 Corpus Generation

As introduced above, the current tendency for the development of OMR systems is to use machine learning techniques which are able to infer the transcription from correct examples of the task, namely, set of pairs (image, transcription). Given the complexity of music notation, for these techniques to produce satisfactory results it is necessary to use a set of sufficient size. To achieve this, a system of automatic generation of labeled data has been developed [1] by using algorithmic composition techniques [12]. The developed system provides two outputs: on the one hand, the expected transcription of the generated score in any of the encodings described above; on the other hand, the score image in PDF format. With both outputs, the necessary pairs for the machine learning algorithm are obtained.

For the generation system, three different methods of algorithmic composition were implemented, allowing to obtain compositions with disparate musical features. The technical details of those composition methods are out of the scope of the present paper (see [1]), but it is important to know that the operation of each one makes them to produce heterogeneous music scores. Therefore, the three of them have been equally used when generating the training set, so that it is not biased in favour of any particular style.

3.2 Evaluation Protocol

A corpus of \(8\,000\) tagged scores, each consisting of a single staff, has been generated using the system for automatic generation of labeled data for OMR research explained in the previous section. Our aim is to evaluate to what extent the CRNN is able to learn the non-linearities in the time line, and which type of encoding yields the best results in the recognition task. This corpus will be used to train the end-to-end neural network described in Sect. 2. Each sample will be a pair composed of the image with a rendered staff and its corresponding representation with the format imposed by one of the four musical encodings proposed, like in the example shown in Fig. 4.

We consider the following evaluation metrics to measure the recognition performance:

  • Sequence Error Rate, Seq-ER (%): ratio of incorrectly predicted sequences (the sequence of recognized symbols has at least one error).

  • Symbol Error Rate, Sym-ER (%): computed as the average number of elementary editing operations (insertions, deletions, or substitutions) necessary to match the sequence predicted by the model with the ground truth sequence.

The Seq-ER gives us a fairer evaluation because it does not depend on the amount of symbols needed for coding, that could bias the Sym-ER in favour of the less verbose encodings. On the other hand, the Seq-ER is a much more pessimistic estimate of the performance, because one single error ruins the output. A model can have a very low Sym-ER, e.g. 1%, but if the wrong symbols are equally distributed we can have a very high Seq-ER.

Due to that, in the next section, the Seq-ER will be used for comparing the performance of the neural model using the different encodings (although Sym-ER are also studied), and then the best representation will be used for a qualitative evaluation of the OMR approach on some selected real music fragments.

4 Results

Four different models have been trained, each corresponding to one of the four coding proposals, using the corpus of \(8\,000\) scores generated. From the whole set, \(7\,000\) have been used for training and the remaining \(1\,000\) scores for validation. Note that the main purpose of this experimentation is to compare the different encodings, so we are interested in their error bounds.

First, the convergence of the models learned is shown. That is, how many training epochs the models need for tuning their parameters appropriately. This gives some insights about the complexity for each encoding to be learned by the CRNN. The curves obtained by each type of encoding are shown in Figs. 5 (Seq-ER) and 6 (Sym-ER). From the curves we can observe that the four models converge relatively quickly, needing less than 20 epochs for the elbow point.

Fig. 5.
figure 5

Convergence analysis: accuracy over the validation set with respect to the training epoch of the deep neural network in terms of Sequence Error Rate. Remain stands for remain-at-position character coding, Advance stands for advance position character, Parenth. stands for parenthesized coding, and Verbose stands for verbose coding.

Fig. 6.
figure 6

Accuracy over the validation set with respect to the training epoch of the deep neural network in terms of Symbol Error Rate. Logarithmic units have used of the error representation. Remain stands for remain-at-position character coding, Advance stands for advance-position character coding, Parenth. stands for parenthesized coding, and Verbose stands for verbose coding.

Observing Figs. 5 and 6, there is a clear correlation between the Sym-ER and the Seq-ER for the four coding proposals: the encoding with the highest Seq-ER is also the one with the highest Symb-ER, and vice versa. This way, we can discard that the Sym-ER is biasing the evaluation depending on the verbosity of the used coding, and it can be used for a proper performance evaluation.

It is also noted that the encoding of the output for this OMR task does have an impact on the training, and consequently, on the recognition performance. The advance-position character coding achieved the best results for both metrics: it attains the lowest Sym-ER and Seq-ER. The results were very encouraging, since around 70% of the test scores were error-free recognized, and the symbol recognition error rate is less than 1% of the symbols predicted.

Table 1. Error-bound analysis: best accuracy attained by each coding over the validation set.

4.1 Qualitative Evaluation

All the scores used above are synthetic. An additional experiment has been carried out for transcribing the content of real scores taken from a repository of historical scores (RISM, Répertoire International des Sources MusicalesFootnote 1). Two musical incipits from RISM were presented to the system as independent (unseen) images for qualitative assessment of the proposed approach. The images of this experiment have been distorted to deal with the challenges of a real scenario, as in [4].

Figures 7 and 8 show the images and the predicted sequences using the Advance model. From them, we can say that the results obtained (Sym-ER = 5.3% and 9.5%, respectively) are rather accurate even taking into account that the images belong to a different (previously unseen) database than those used for training, and they were rendered using different methods and present distortions and lower quality. These facts explain the higher error values compared to those presented in Table 1, but the performance still shows a high precision in the recognition that can be improved by adding distorted images to the training set.

Fig. 7.
figure 7

Qualitative evaluation of the OMR approach for incipit RISM ID no. 110003911-1_1_1, yielding a SER of 5.3%.

Fig. 8.
figure 8

Qualitative evaluation of the OMR approach for incipit RISM ID no. 000136642-1_1_1, yielding a SER of 9.5%.

As it can be observed in Fig. 8, in the last measure there is a three dotted quarter note chord which the model interprets as two clearly differentiated groups: the three quarter notes share the same horizontal space and so do the three corresponding dots, but these are placed at the right of the chord notes, meaning a horizontal advance, which is correctly codified by the model placing the ‘+’ (advance) character between these two groups. Consequently, this leads us to the conclusion that the model is able to interpret the vertical and horizontal relationships in the score and learns how to code them.

5 Conclusions

In this work, we have studied the suitability of the use of the neural network approach to solve the OMR task in an end-to-end way through a controlled scenario of homophonic synthetic scores, presenting and analyzing four different encodings for the OMR output.

As reported in the experiments, our serialized ways of encoding the music content prove to be appropriate for DL-based OMR, as the learning process is successful and low Symbol Error Rate figures are eventually attained. In addition, it is shown that the choice of the encoding have some impact on the lower bound of the Error Rates that can be achieved, which almost directly correlates with the tendency of the learning curves. These facts reinforce our initial claim that the encoding of the output for OMR deserves further consideration within the end-to-end DL paradigm.