1 Introduction

Music amounts to a language used and understood worldwide. It is an art that has been crossing borders since its inception, being one of the main cultural manifestations of the human being. It is for this reason that over the centuries there has been a need to preserve the content in the best possible way, whether in cathedrals, libraries, or historical archives. However, access to these documents is often limited since continued use may end up damaging them irretrievably.

There exist multiple projects and organizations whose purpose is comprehensively documenting extant historical sources of music all over the world, such as the following ones:

All of them are making a great effort to digitize musical scores into images, allowing their collections to be accessible as images through the Internet. But for those musical documents to be truly accessible, they must be transcribed into a digital format that enables tasks such as indexing, editing, or critical publication. This process is often done manually and is costly and tedious. Score editing tools are complex to use, which makes the process prone to introducing errors; thereby, several rounds of review are needed to approve a transcript as a good one. In certain scenarios, such as those related to ancient musical documents, there may not even be suitable tools for that. All of this entails a great deal of work that is not feasible on a large scale. That is why it is very important to find technologies capable of revaluing all the existing musical heritage.

A promising alternative that would overcome the previous challenge is the use of automatic recognition techniques. Here is where Optical Music Recognition (OMR) comes in. OMR is the field of research that investigates how to computationally read music notation in documents [9, 16].

Fig. 1
figure 1

By using OMR techniques, the content in a digitized image can be encoded in a symbolic format

As seen in Fig. 1, a digitized image can be converted into encoded content automatically by means of OMR. This encoded content is the digital transcription (in terms of music notation symbols) of the score. Thus, an effective OMR system enables the study of existing musical documents for digital humanities. And not only that, but it is the only alternative capable of doing so in reasonable time and cost.

OMR has been an active research field for decades [4, 32]. Traditional approaches to OMR are based on the usual pipeline of sub-tasks that characterizes many artificial vision systems, adapted to this particular task: document pre-processing [7, 29]—including staff-line removal [10, 15]—symbol classification [28, 31], reconstruction of the music notation [27, 30], and output encoding in a suitable symbolic format.

In recent years, there has been a paradigm shift toward the use of Machine Learning (ML) techniques. These techniques make it possible to design flexible and versatile OMR systems capable of solving a wide variety of problems. This is due to the relationship between the purpose of both fields: ML studies how to make machines learn to perform certain tasks, which is exactly what OMR seeks, to teach machines to perform the task of reading musical scores.

Recent advances in ML—namely Deep Learning (DL)—which have achieved great results in several visual challenges [24], allow us to be optimistic about developing more accurate and effective OMR systems. The current trend is the use of end-to-end (or holistic) systems that treat the process as a single step, instead of explicitly performing the sub-tasks. Using this approach, training pairs only have to contain the input image and its complete transcription [5, 13], bypassing especially the need to annotate the exact positions of individual symbols.

These approaches typically rely on Convolutional Recurrent Neural Networks (CRNN), which are only able to formulate the output as one-dimensional sequences. This perfectly fits natural language tasks (text or speech recognition, or machine translation) since their outputs mostly consist of character (or word) sequences (see Fig. 2a). However, its application to music notation is not so straightforward due to the presence of different elements sharing the same horizontal position and long-term dependencies. The vertical distribution of these elements disrupts the linear flow of the timeline (see Fig. 2b). This fact is not trivial to encode and can cause significant difficulties in the performance of recognition systems that make use of the temporal relationships between the recognized elements.

Fig. 2
figure 2

Differences in reading between text and music

The problem can be drastically simplified by considering that the process will work with each staff independently from the others—a process that could be analogous to the text recognition systems that decompose the document into a series of independent lines [21]. This is not a strong assumption as there are successful algorithms for identifying staves [17]. Even so, we still have to deal with elements that take place simultaneously in the “time” line, like the notes that make up a chord, irregular groups, or expression marks, to name a few.

Within the range of music score complexities, one possible simplification of the problem that applies to many sheet music is to assume a homophonic music context. In that case, there are multiple parts, but they move in the same rhythm. This way, multiple notes can occur simultaneously, but only as a single voice. Therefore, all the notes starting at the same time last the same, so the score can be segmented into vertical slices that may contain one or more music symbols (see Fig. 3).

Fig. 3
figure 3

Homophonic music: All the notes starting at the same time have the same duration

Even in this simplified context, there is a need for a clear and structured output coding that avoids the ambiguities that the representation of a linear output can show in presence of vertical structures in the data (see Fig. 4). This has also been stated in some previous works [6].

Fig. 4
figure 4

Example of ambiguities that might appear when music symbols belong to a vertical distribution. The two notes that appear together, must be played at the same time, but a linear symbol sequence without specific marks cannot be interpreted unambiguously

Fig. 5
figure 5

Graphical scheme of the CRNN considered for the end-to-end approach. The network is trained by using the CTC loss function

There already exist several structured formats for music representation and coding, like XML-based music formats [18, 22] that are focused on how the score has to be encoded to properly store all its content. This application makes it unsuitable to adopt them as output for an optical recognition system because the code is full of irrelevant markings for the system to generate when graphically analyzing the score. An OMR system is primarily interested in what symbols are there and where they are. Furthermore, these XML-based music languages do not represent sequential data, but rather hierarchical structures, so they are not suitable output formats for DL-based OMR.

Due to that, we have designed a specific coding language to represent the output of end-to-end OMR, based on serializing the music symbols found in a staff of homophonic music. The sequential nature of music reading must be compatible and unambiguous with respect to the representation of vertical distributions. In addition, this representation has to be easy to generate by the system, which analyzes the input sequentially and produces a linear series of symbols.

Preliminary research has been carried out to validate whether this represents a feasible research avenue [2]. However, the serialized ways of encoding the music content have never been tested with real data. This work aims to solve that and properly evaluate the problem. Also, we study and evaluate the possible existing boundaries of learning with synthetic data when recognizing real music scores.

The rest of the paper is organized as follows: Sect. 2 overviews the state-of-the-art recognition framework based on DL, including the ad hoc serializations for homophonic music; Sect. 3 introduces the experimental setup; Sect. 4 shows the results obtained with real homophonic sheet music; finally, Sect. 5 concludes the present work, along with some ideas for future research.

2 Recognition framework

To carry out the OMR task in an end-to-end manner, we follow the state-of-the-art approach based on Convolutional Recurrent Neural Networks (CRNN). These neural architectures permit us to model the posterior probability of generating output symbols, given an input image. Input images are assumed to be single-staff sections, analogously to text recognition that assumes independent lines [21]. As mentioned before, staves can be easily isolated by means of existing methods [17].

A CRNN consists of one block of convolutional layers followed by another block of recurrent layers [33]. The convolutional block is responsible for learning how to process the input image, that is, extracting relevant image features for the task at issue so that the recurrent layers interpret these features in terms of sequences of musical symbols. In this work, the recurrent layers are implemented as Bidirectional Long Short-Term Memory (BLSTM) units [19].

The unit activations of the last convolutional layer can be seen as a sequence of feature vectors representing the input image, \({{\textbf {x}}}\). These features are fed to the first BLSTM layer, and the unit activations of the last recurrent layer are considered estimates of the posterior probabilities for each vector:

$$\begin{aligned} P(\sigma \mid {\textbf{x}},f), \; 1 \le f \le F, \; \sigma \in \Sigma \end{aligned}$$

where \(\textit{F}\) is the number of feature vectors of the input sequence and \(\Sigma \) is the set of considered symbols. Note that \(\Sigma \) must include a “non-character” symbol \(\epsilon \) that acts as a separator when two or more instances of the same musical symbol appear consecutively [19].

Since both convolutional and recurrent blocks can be trained through gradient descent, using the well-known Back Propagation algorithm [34], a CRNN can be jointly trained. However, a conventional end-to-end OMR training set only provides, for each staff image, its corresponding transcription, not giving any type of explicit information about the location of the symbols in the image. It has been shown that the CRNN can be conveniently trained without this information by using the so-called Connectionist Temporal Classification (CTC) loss function [20]. The resulting CTC training procedure is a form of Expectation-Maximization: CTC provides a means to optimize the CRNN parameters so that it is likely to give the correct sequence given an input [20].

Once the CRNN has been trained, an input staff image can be decoded into a sequence of music symbols \({\varvec{\hat{s}}} \in \Sigma ^{*}\). First, the most probable symbol per frame is computed:

$$\begin{aligned} \hat{\sigma }_{i} = \arg \max _{\sigma \in \Sigma } P(\sigma \mid {\textbf{x}}, i), \; 1 \le i \le F \end{aligned}$$

Then, a pseudo-optimal output sequence is obtained as:

$$\begin{aligned} \varvec{\hat{s}} = \arg \max _{s \in \Sigma ^{*}} P({\textbf{s}} \mid {\textbf{x}}) \approx {\mathcal {D}}({\hat{\sigma }}_1, \ldots , {\hat{\sigma }}_F) \end{aligned}$$

where \({\mathcal {D}}\) is a function that first merges all the consecutive frames with the same symbol and then deletes the \(\epsilon \) symbols [19].

A graphical scheme of the framework explained above is given in Fig. 5.

This framework follows the architecture first applied in the work of Shi et al. [33] and later tuned by Calvo-Zaragoza et al. [11]. As stated in the latter work, its expressiveness could be sufficient when working with simple scores where all the symbols have a single left-to-right order. However, we want to extend these approaches so that they are able to model richer scores such as those of homophonic sheet music. In such a case, issues like chords may appear, where several symbols share a horizontal position. As seen in Fig. 4, a one-dimensional sequence is not expressive enough for this situation. That is why in the next section we describe our representation proposal to perform end-to-end OMR for homophonic scores.

2.1 Serialization proposals

The research developed in this paper involves the study of four different deterministic, unambiguous, and serialized representations to encode the kind of scenarios that happen in homophonic music so that the OMR system becomes more effective when recognizing complex music score images. For that, we propose four different types of music representations that differ not in the encoding of the musical symbols itself, but in the way horizontal and vertical distributions of the musical symbols are represented. The grammar for these musical codifications must be unambiguous, allowing us to analyze a given document in only one way.

Our representation does not make assumptions about the musical meaning of what is represented in the document being analyzed; that is, the elements are identified in a catalogue of musical symbols by their shape and where they are placed on the staff. This has been referred to as “agnostic representation,” as opposed to a semantic representation, where music symbols are encoded according to what they represent in terms of music notation [11]. This difference is illustrated in Fig. 6.

As mentioned before, the only difference between the four proposed musical codes is how to represent the horizontal and vertical dimensions. Each one of the four codes has one or two characters that indicate whether, when transcribing the score, the system should move forward, that is, from left to right, or downward, from top to bottom. These characters are referred to as separators.

The four different codes proposed are described as follows:

  • Remain-at-position character code when transcribing the score, the different musical symbols are assumed to be placed left to right, except when they are in the same horizontal position. In that case, they are separated by a slash, “/”. This acts as a remain-at-position character, meaning that the system has to advance downward (see Fig. 7b). This behavior is similar to the backspace of typewriters. The carriage advances after typing, and if we want to align two symbols, we need to keep the carriage in a fixed position (by moving it back to one position).

  • Advance-position character code this type of coding uses a “+” sign to force the system to advance forward. This way, when that sign is missing, the output does not move forward and a vertical distribution is being coded (see Fig. 7c).

  • Parenthesized code when a vertical distribution appears in the score, the system outputs a parenthesized structure, like vertical.start musical_symbol \(\dots \) musical_symbol vertical.end (see Fig. 7d).

  • Verbose code this last coding is a combination of the two first ones. It uses the “+” sign as the advance-position character to indicate that the system has to move forward, and the “/” sign as the remain-at-position character to indicate that the system has to advance downward (see Fig. 7e). So, here, every two adjacent symbols are explicitly separated by a symbol indicating whether the system must remain in the same horizontal position or has to advance to the next one.

Note that the codes represent the data unambiguously; thus, it is possible to deterministically translate from any encoding to any other.

Fig. 6
figure 6

Semantic and agnostic representations, respectively, of the two eighth notes forming a beam group (highlighted in red). The agnostic representation only provides graphic information—that is, the line or space of the staff on which the note is placed, or the direction of its beam—as opposed to the musical information (pitch and type of note) of a semantic representation (colour figure online)

Fig. 7
figure 7

Musical excerpt presenting a number of different situations where vertical distributions occur and its transcription using the proposed codifications

3 Experimental setup

In this section, we present the synthetic training dataset and the real RISM test dataset, the neural model architecture, and the evaluation protocols used.

3.1 Synthetic data: corpus generation

As introduced above, the current trend for the development of OMR systems is to use DL techniques able to infer the transcription from correct examples of the task, namely, set of pairs (\({{\textbf {x}}}\), \({{\textbf {s}}}\)). Given the complexity of music notation, for these techniques to produce satisfactory results, it is necessary to use a set of sufficient size. To achieve this, a system of automatic generation of labeled data has been developed [1] by using algorithmic composition techniques [26]. The developed system provides two outputs: the expected transcription of the generated score in any of the encodings described in Sect.  2.1, and the corresponding score image artificially distorted as in [12]. With both outputs, the necessary pairs for the DL algorithm are obtained.

For the generation system, three different methods of algorithmic composition are used to obtain compositions with diverse musical features. Furthermore, the range of pitches that can be coded according to the clef is limited in order to achieve a score with as much musical coherence as possible. In the end, in music scores, according to the clef, there is a range of pitches that is more common than another. A range of 22 different pitches is chosen for each clef (see Fig. 8). The pitch series defined for each clef are major or minor diatonic scales in some keys. The clefs that can be encoded by the system are: G1-clef, G2-clef, F4-clef, C1-clef, C2-clef, C3-clef, and C4-clef.

Fig. 8
figure 8

Range of the 22 pitches coded for the 3 system clefs considered

The score generator creates a musical event one at a time. Such an event might be sound (by default 90% of the time) or silence (10% the time). The type of sound or silence event is chosen from a catalog of possible music symbols (from sixteenth to whole notes, as well as silences, beamed notes, or chords comprising up to a maximum of 3 notes, triplets, etc.) following a random process. The pitch of the sound events is conditioned to one of three possible algorithmic composition methods described below. Only a general outline of the score generator is given, as more specific details are beyond the scope of this work. For complete details about the implementation, please refer to [1].Footnote 8

3.1.1 Random generation according to the normal distribution

The normal distribution, Gaussian distribution, or Laplace-Gauss distribution, is a probability distribution of continuous variable. It is of great application in the fields of engineering, physics, and social sciences because it allows us to model numerous natural, social, and psychological phenomena.

The normal distribution is defined in Eq. 4, where 0 and N represent the extreme values of the domain, being, in our case, 0 and 21, respectively—integers associated with specific pitches depending on the clef used. This way, the range of pitches associated with each clef links each pitch to an integer in [0, 21] so that 0 maps to the lowest pitch of the range and 21 to the highest pitch.

$$\begin{aligned} f_{[0,N]}(x) = N(\mu ,\sigma ) = \frac{\exp \left( {-\frac{(x-\mu )^2}{2\sigma ^2}}\right) }{\sqrt{2\pi }\sigma } \end{aligned}$$

We use the N(10.5, 6.5) distribution that provides a symmetric distribution centered on the mean, which corresponds to the space surrounding the central line of the staff. The mean defines the location of the peak for normal distributions, but, in our case, since the mean is a decimal value, it is not associated with any pitch, and therefore the highest probabilities are given to both rounded up (third line of the staff) and rounded down (third space of the staff) values of the mean. The probability is minimum at the extreme values, being \(f_{[0,N]}(0)=f_{[0,N]}(N)=0.0166\), i.e., 1.66% probability of occurrence for those pitches. This is illustrated in Fig. 9.

Fig. 9
figure 9

Gaussian distribution for the automatic generation of music data

As we can see in Fig. 9, the graph of its density function has a bell shape and is symmetric with respect to the average. This curve is known as Gaussian bell and is the graph of a Gaussian function.

Fig. 10 shows an example of a musical excerpt created by the automatic generation system when the normal distribution composition method is used.

3.1.2 Random walk

A random walk is a mathematical formalization of the trajectory that results from making successive random steps. In this system, the random walk always starts at the central pitch of the pitch range determined for the system. There are three possible random steps (all equally likely) after emitting pitch:

  1. 1.

    One-step forward: The pitch that follows is the next higher in the defined pitch series.

  2. 2.

    One-step backward: The pitch that follows is the next lower in the defined pitch series.

  3. 3.

    No step: The pitch that follows is the same as the current one.

There are situations in which moving forward or backward will not be allowed because the pitch to be coded would be outside the established range. In these situations, two solutions are given:

  • Reflective limit as its own name indicates, it works like a mirror, making the step to take to be the reflection of what was initially intended to be taken. If the movement was intended to be forward, now it will be backward and vice versa. That is, the pitch of the current note is the second of the range starting either from the upper limit or from the lower one, as appropriate.

  • Absorbing limit the pitch of the current note is that corresponding to the upper or lower limit, as appropriate, of the pitch range.

The solution is chosen randomly, both being equal of probable.

Fig. 11 shows an example of a musical excerpt created by the automatic generation system when the random walk composition method is used.

3.1.3 Sonification of the logistic equation

The logistic equation is defined by Eq. 5:

$$\begin{aligned} x_{n+1}=r\; x_n (1-x_n) \quad {where} \; n = 0, 1, 2, 3,... \end{aligned}$$

This equation defines an iteration, where \(x_0\) is equal to 0 and the parameter r is a value between 0 and 4. The resulting value will always be in [0, 1].

When this method is used, the value for r should be between 3.5 and 4, since it is the range of values for which the most interesting note sequences are generated. The sequences for \(3 \le r \le 3.5\) produce repetitions of 2 or 4 pitches, with no variability. Values for \(r <3 \) generate constant pitch (unison) sequences after a short transition period.

Fig. 12 shows an example of a musical excerpt created by the automatic generation system when the sonification of the logistic equation composition method is used.

Fig. 10
figure 10

Music snippet created by the automatic generation system using the normal distribution (N(10.5, 6.5)) composition method

Fig. 11
figure 11

Music snippet created by the automatic generation system using the random walk composition method

Fig. 12
figure 12

Music snippet created by the automatic generation system using the sonification of the logistic equation (\(r=3.75)\)) composition method

The first and the last methods generate skipwise or disjunct melodic motions, characterized by frequent skips between notes, whereas the second method produces stepwise or conjunct melodic motions, where all the intervals will never be greater than four semitones. These differences lead us to consider whether the pitch generation method used to generate the training set affects the way the DL model used for the OMR task learns. Since what we want is to find the most possible optimal scenario for the OMR learning task, we consider this to be an issue that needs further investigation and decide to approach it together with the coding proposals in the next section.

3.2 Real data

We aim to find the most favorable scenario for the OMR learning task when training with homophonic synthetic scores and testing with homophonic real scores. For a proper evaluation, we consider a sufficiently large set of real data taken from the RISM repository that contains the same symbols that our score generator is able to produce. The selected corpus contains \(1\,954\) real music staves of homophonic incipits.Footnote 9 For each incipit, an image with the rendered score with artificial distortions—the same as those used for the synthetic data—as well as the expected transcription in any of the encodings described in Sect.  2.1 is provided. Figure 13 depicts an example of a particular staff from this corpus. It must be noted that the Camera-based Printed Images of Music Staves (Camera-PrIMuS) database [12] is not suitable for the present work since it contains only monophonic RISM-based music scores.

Fig. 13
figure 13

Staff sample from the selected RISM corpus. Incipit RISM ID no. 000139189111

Table 1 Layer-wise description for the CRNN architecture considered. Notation: Conv\((f,w\times h)\) stands for a convolution layer of f filters of size \(w\times h\) pixels, BatchNorm performs the normalization of the batch, LeakyReLU\((\alpha )\) represents a Leaky Rectified Linear Unit activation with a negative slope value \(\alpha \), MaxPool\((w\times h)\) represents the max-pooling operator of \(w\times h\) dimensions and striding factors, and BLSTM(nd) denotes a bidirectional Long Short-Term Memory unit with n neurons and d dropout value parameters

The selected RISM set amounts to a total \(63\,011\) music symbols, representing 341 different classes. From the total number of symbols, \(15\,699\) belong to a vertical distribution—two or more symbols that share the same horizontal position. There are \(7\,546\) vertical distributions.

3.3 Neural network configuration

As mentioned in Sect.  2, the neural model considered in this work is based on the architecture by Calvo-Zaragoza et al. [11]. In this sense, while the configuration is broadly described in Sect.  2, the actual composition of each layer is depicted in Table 1.

The model is trained following the backpropagation method provided by CTC for 300 epochs using the ADAM optimizer [23] with a fixed learning rate 0.001 and a batch size of 16 elements.

3.4 Evaluation protocol

Concerning evaluation metrics, there is an open debate on how to evaluate the capabilities of OMR systems [8, 25]. In this work, OMR is simply understood as a pattern recognition task, so we shall consider metrics that allow us to draw reasonable conclusions from the experimental results. Due to that, the performance of the recognition schemes presented is assessed by considering the symbol error rate (SER, %) as utilized in previous works addressing end-to-end transcription tasks [12]. This figure of merit is computed as the average number of elementary editing operations (insertions, deletions, or substitutions) necessary to match the sequence predicted by the model with the ground truth sequence, normalized by the length of the latter.

Table 2 Results obtained for each encoding when models are tested on the \(1\,954\) selected RISM incipits. Remain stands for remain-at-position character coding, Advance stands for advance-position character coding, Parenthesized stands for parenthesized coding, and Verbose stands for verbose coding. Best results are highlighted in bold type

The number of separators used in the transcription to deal with simultaneities in the horizontal dimension (the “time” line) is different for the four proposed encodings. This fact might have implications for learning. Hence, we also report the results restricted to the symbols that are not separators: We refer to the SER metric as NonSep-SER. To obtain more insights into the system’s performance on simultaneities as opposed to monophonic segments, we also show the proportion of NonSep-SER caused by the editing operations that happen in regions (i) that belong to a simultaneity in the ground truth, defined as Sim-SER, and (ii) that are monophonic in the ground truth, denoted as NonSim-SER.

4 Results

This work aims to provide insights into (i) which serialized ways of encoding the music content presented in Sect.  2.1 are more suitable for recognizing real homophonic music scores and (ii) how the use of synthetic training data affects the transcription of the previously mentioned data. For that, we specifically consider two evaluation cases: a first one, denoted as Best Encoding Experiment, devoted to finding the most suitable code out of the four proposed in Sect.  2.1 for the OMR output; and a second one, named Best Algorithmic Composition Method Experiment, that studies the most appropriate composition method for the OMR learning task.

It must be noted that for all the considered scenarios the evaluation set refers to the real homophonic scores collected from RISM. Hence, the synthetic data generated are created in a way that it contains a similar number of measures per staff, symbols per staff, and vertical distributions per staff—all of them on average—as the selected RISM corpus. Moreover, the generated data are distorted in the same way as the RISM data are.

4.1 Best encoding experiment

The experiment aims to determine which of the four serialization proposals works best as the output for the OMR task in a homophonic music scenario. To do so, a corpus of \(1\,500\) labeled scores, each consisting of a single staff, is generated using the system for automatic generation of labeled data explained in the previous section. Each sample is a pair composed of the image with a rendered staff and its corresponding representation with the format imposed by one of the four musical encodings proposed, like in the example shown in Fig. 7. As explained above, the operation of the three composition methods of the automatic generation system makes them produce heterogeneous music scores. That is why the three of them are equally used when generating this corpus so that it is not biased in favor of any particular style. We refer to this procedure as the mix composition method.

We derive two non-overlapping partitions—train and validation—corresponding to \(60\%\) and \(40\%\) of the data, respectively, following a fivefold cross-validation scheme. Each fold is tested on the selected RISM corpus described in Sect.  3.2. The results obtained in terms of the SER metric—the figures provided represent the average values for the test partition in which the validation data achieve its best performance for each of the considered cases—are presented in Table 2.

An initial remark is that the results depicted in Table 2 indicate that the neural network is indeed learning from synthetic data but, as seen in previous efforts [2], the encoding of the output for this OMR task plays an important role in the training, and consequently, in the recognition performance. The advance-position character coding achieves the best results for the NonSep-SER. The remain-at-position character and the parenthesized codes follow closely, both with similar results, while the verbose code places the last. This order is altered when the results are observed from the SER metric side: Wordy encodings are favored, leading to deceiving insights. In other words, a more wordy code, e.g., the verbose encoding, could predict all the separators symbols correctly while missing the remaining symbols—the ones that would really matter to the user—and still achieve a lower error rate than another less verbose encoding, e.g., the remain encoding.

We believe that the fact that the advance encoding performs better is due to how the CTC loss function works: The system reads vertical slices and outputs the symbols present on them. In the case of contiguous input frames containing only staff lines, the system could output either nothing or an “empty-output” symbol, such a symbol being the “+” advance separator. On the opposite side, we find that the remain and parenthesized codes overload the learning process as we are forcing the system to output more symbols in the same situation. This idea is reinforced by the verbose encoding’s results. Therefore, neither the remain-at-position character code nor the parenthesized code nor the verbose code is a suitable choice for transcribing the content of homophonic scores with the CTC objective function.

As the last remark, we would like to point out that the results obtained suggest that simultaneities do not present a recognition problem by themselves. When decomposing the NonSep-SER into its two component fractions, Sim-SER and NonSim-SER, Table 2 reports that the highest proportion of errors occurs in monophonic zones.

To support the relevance of those statements, we shall now assess the results in terms of statistical significance. For that, we resort to the nonparametric Wilcoxon signed-rank test [14]. This analysis considers that each result obtained for each fold constitutes a sample of the distributions to compare. Considering this assessment scheme, the results obtained are reported in Table 3.

Table 3 Statistical significance analysis of the different presented encoding schemes considering the Wilcoxon signed-rank test with a significance value of \(p < 0.05\) for the symbol error rate metric when the corresponding separators symbols are excluded from the computation, i.e., NonSep-SER. Symbols <, >, and \(=\) represent that the error of the method in the row is significantly lower than, greater than, or no different to that in the column, respectively

The results obtained with a significance value of \(p < 0.05\) show that the advance-position character coding has significant differences with respect to the other representations and therefore it will be used in all further experiments.

4.2 Best algorithmic composition method experiment

This experiment is addressed to identify which algorithm(s) for the synthetic generation of data works better for real data. For this purpose, we also generate a corpus of \(1\,500\) labeled scores for the three remainder composition methods, given that the one for the mix algorithm is already generated. The results obtained in terms of the different considered metrics are presented in Table 4. Note that the figures provided represent the average values for the test partition, in which the validation data achieve its best performance for each of the considered cases.

Table 4 Results obtained for each composition method when models are tested on the \(1\,954\) selected RISM incipits. Normal stands for normal distribution method, Random stands for random walk method, Logistic stands for sonification of the logistic equation method, and Mix stands for mix method (it refers to the equal use of the three previous methods in the data set). Best results are highlighted in bold type

The results reveal the following conclusions: (1) The sonification of the logistic equation method by itself does not work well since about 40% of the symbols predicted are wrong; (2) the normal distribution method halves the SER compared to the previous method; and (3) the random walk method and mixing data from all methods (mix method) further enhance that improvement, with both methods depicting similar values. Such a trend is visible in all the figures of merit considered. To support the relevance of those statements, we also assess the results in terms of statistical significance following the nonparametric Wilcoxon signed-rank test introduced in Sect.  4.1. The analysis is reported in Table 5.

Table 5 Statistical significance analysis of the different presented composition methods considering the Wilcoxon signed-rank test with a significance value of \(p < 0.05\) for the symbol error rate metric. Symbols <, >, and \(=\) represent that the error of the method in the row is significantly lower than, greater than, or no different to that in the column, respectively

The results obtained with a significance value of \(p < 0.05\) show that random and mix methods significantly outperform the other composition strategies while showing no significant differences between them. This implies that the two composition algorithms are most suitable for generating synthetic training data.

We would like to reduce the error figures on the selected RISM corpus by exploiting the fact that we have an “infinite” generator; that is, thanks to the automatic generation system we will always be able to generate new data that the neural network has not probably seen before. To gain insights into this issue first, we compute the greatest achievable performance with real training data—the lower bound that we want to surpass. For that, we derive three non-overlapping partitions—train, validation, and test—corresponding to \(60\%\), \(20\%\), and \(20\%\) of the \(1\,954\) selected RISM incipits, respectively, following a fivefold cross-validation scheme. We train a model using those partitions. We compare it with those trained with \(1\,500\), \(3\,000\), \(15\,000\), and \(150\,000\) samples generated using the random walk method and the mix method, respectively, when evaluated over the same RISM test partition. Table 6 reports the results obtained.

Table 6 Results obtained for each composition method when models are trained using different set sizes and tested on the same RISM samples as the real-only model. Best results are highlighted in bold type

The results reveal an exponential decay in the various figures of merit considered: while going from \(1\,500\) to \(15\,000\) improves SER by 5 points, multiplying the size of the data by 10 for the second time achieves an improvement of less than 1 point. It suggests that in the second zone, we reach the plateau of the curve (see Fig. 14). In other words, there exists a glass ceiling when recognizing real scores with synthetic-trained models. The nonparametric Wilcoxon signed-rank test introduced in Sect.  4.1 reinforces the finding by stating that the error rates are not significantly different.

Fig. 14
figure 14

Average results obtained on the RISM data for different sizes of the synthetic training corpus. The synthetic procedure follows the mix method since it has achieved, on average, lower error rates (see Table 6). Nevertheless, the conclusions can be extrapolated to the random walk method. Nonlinear least squares are used to fit the data

Fig. 15
figure 15

Average symbol error rate (%) attained for each scenario with respect to the number of randomly selected RISM training staves, L

Table 7 Results obtained in terms of the symbol error rate for each scenario with respect to the number of randomly selected RISM training staves, L. Best results are highlighted in bold type

The lower error bound found, of around \(12\%\), might be due to the underlying (musical) language model of the composition algorithms. The generated synthetic corpora are created in a way that contains a similar number of measures per staff, symbols per staff, and vertical distributions per staff—all of them on average—as the selected RISM corpus as well as the same graphical appearance. However, it might not be enough as the synthetic data distribution fails to capture all the characteristics of the real data. Table 6 shows that training with real homophonic scores yields an error rate of \(3.3\%\). This is less than one-third of the errors made by the best model trained on synthetic data only.

We would like to stress again how simultaneities do not represent a problem. When the training data correctly capture the distribution of the test data, the error rate in the simultaneities and in the monophonic segments is roughly the same (see RISM column in Table 6). At the same time, this reinforces our idea about the glass ceiling. For the rest of the models, the error rate is bigger in monophonic regions since the synthetic distribution is not properly modeling that of RISM data. The underlying problem is one of out-of-distribution learning, for which no satisfactory solution is known at this time, at least for the CRNN-CTC framework.

To validate our intuition about the cause of the glass ceiling, we start from the premise that one possible solution would be to add scores from the test distribution to the training set. For that, we derive three non-overlapping partitions—train, validation, and test—corresponding to L, L, and \(1\,954 - 2L\) of the \(1\,954\) selected RISM incipits, respectively, where L is the number of randomly selected samples. We add the train and validation sets to the corresponding synthetic train and validation partitions of the random and mix corpora of size \(150\,00\), respectively, and use the test set to evaluate such synthetic-with-real models. We compare those models with a model trained only with the aforementioned RISM partitions. We want to see (i) whether the glass ceiling can be broken by using real samples and (ii) how many of them are needed if (i) proves to be the case. It must be noted that a fivefold cross-validation scheme is followed for this experiment to ensure that the results are not conditioned to the randomly selected samples. The results obtained in terms of the SER metric for the contemplated scenarios are presented in Fig. 15 and Table 7.

First, it is necessary to state that the glass ceiling caused by the synthesis process can be broken by incorporating real scores into the synthetic train partition. This outcome supports the initial premise that the lower error bound attained with synthetic-only models was due to the synthesis process. We only need 50 real scores to decrease the \(17\%\) error rate to \(10\%\) and 100 to halve it—when combined with synthetic data. If used on their own, the real-only model is not able to solve the transcription task. Such a model starts to do so after 150 samples, and even so, using either only synthetic data or combining it with real data still yields better results. This implies that manually labeling some samples is compensated as combining synthetic and real data bring out synergies that help reduce the baseline error rate set by the only-real model. Moreover, as analyzed in [3], the posterior correction of errors of the synthetic-and-real models is more than offset by the time saved when compared with the real model, as the latter needs more manually transcribed training samples.

Regarding the synthesis method, it can be seen in both Fig. 15 and Table 7 that even though their performance is quite similar for all cases, the mix method tends to achieve slightly lower error rates.

5 Conclusions

In this work, we have studied the suitability of the state-of-the-art end-to-end neural approach to recognize real homophonic music scores by presenting and analyzing four different encodings for the OMR output. Throughout the research, we have trained the neural network with synthetic scores, created by the system of automatic generation of labeled data. This makes the use of an infinite music data generator helpful in dramatically reducing the costs of acquiring scores for training OMR systems.

As reported in the first part of the experiments, our serialized ways of encoding the music content prove to be appropriate for our DL-based OMR, as the learning process is successful, and low SER figures are eventually attained. In addition, it is shown that the choice of the encoding has some impact on the lower bound of the error rates that can be achieved: the advance-character position code is the one that most benefits the learning process in the recognition of vertical structures found in a homophonic music environment. These facts reinforce our initial claim that the encoding of the output for OMR deserves further consideration within the end-to-end DL paradigm.

It has also been possible to demonstrate that the algorithmic composition method used in the creation of synthetic music has a strong influence on the recognition results, being the random walk method and the mix method the most suitable algorithmic composition techniques. However, although the learning process was successful, there exists a glass ceiling when recognizing real scores: the Sym-ER never decreased below 11%, regardless of largely increasing the size of the training set. To break the glass ceiling, the use of real sheet music is necessary. It indicates that there is a part of the learning process that is not related to the graphical aspects of the scores but to the underlying (musical) language model. We believe this opens up new avenues for research. For example, modeling more intelligent systems of automatic generation of labeled data. It might be convenient to first learn some characteristics of the language model of the music that will be recognized at a later time in order to generate music scores that follow such a particular style. Then, the glass ceiling could be broken and the advantage of having an infinite data generator could be exploited to the fullest.