1 Introduction

Transcribing the content of musical documents to structured formats brings benefits to digital humanities and musicology, as it enables the application of algorithms that rely on symbolic music data and makes musical score libraries more browsable. Given the price of manual transcription, it is unaffordable to transcribe large historical archives manually. In this scenario, the reading of music notation invites automation, much in the same way as modern technology in the fields of Optical Character Recognition (OCR) or Handwritten Text Recognition (HTR) has enabled the automatic processing of written texts. The field of Optical Music Recognition (OMR) covers the automation of this computational reading in the context of music [1].

Holistic approaches, also referred to as end-to-end approaches, have begun to dominate the fields of sequential labeling, with notable examples such as HTR or Automatic Speech Recognition. In OMR, these approaches have proved successful in those contexts in which music notation retrieval can be easily expressed as a sequence. This applies to monophonic scores, or legacy music-notation languages in which different voices were written individually. However, the scores of many compositions are written using grand staves, i.e., a combination of two staves put together, such as those used for the piano (see Fig. 1). In the related literature, this kind of scores is also referred to as pianoform [1,2,3]. However, no end-to-end system that has attempted to recognize the content of this type of scores is known to date.

This work proposes the first end-to-end recognition approach for pianoform scores. This constitutes a first step in the application of holistic models to the full spectrum of OMR applications. We consider a neural approach inspired by state-of-the-art full-paragraph HTR research, with which the OMR problem shares some of its challenges. This approach provides a serialization of the scores based on textual encodings of music notation. Likewise, since it is the first attempt to address this problem, this work also introduces the GrandStaff dataset, a large corpus of isolated grand staves rendered from real symbolic data. In order to introduce more variability, the images are provided both in perfect condition and augmented by computer vision techniques so as to resemble the possible distortions of a real optical capturing process.

In our experiments, we consider (1) various neural schemes that differ as regards the way in which they process the sequential character of the input, (2) several means of encoding the output sequence, and (3) different scenarios according to the graphic quality of the samples. All of the above enables our work to establish the first baseline for end-to-end pianoform OMR, along with a solid benchmark for future research.

Fig. 1
figure 1

Example of a grand staff for the piano, which consists of the combination of two staves that are played simultaneously when interpreting the music score

The remainder of this work is structured as follows: Sect. 2 provides a brief review of how OMR has been addressed in the recent past. The GrandStaff dataset is then presented in 3, in which we define the representation of corpus music notation and detail the process applied in order to generate its samples. The proposed end-to-end OMR approach is then presented in Sect. 4. The experimental setup—in which all the implementations and evaluation metrics are defined—is described in Sect. 5, while the results attained are analyzed in Sect. 6. Finally, the conclusions of this work, along with future research avenues, are discussed in Sect. 7.

2 Background

Given its complexity, the OMR process has traditionally been divided into several stages that are tackled independently [4]. Fundamentally, there is a first set in which the basic symbols such as note heads, beams, or accidentals (usually referred to as “primitives”) are detected. This involves processing the input image in order to isolate and categorize these components, which is not straightforward owing to the presence of artifacts such as staff lines and composite symbols [5]. In the second set of stages, the syntactic relationships among different primitives are inferred so as to recover the structure of the music notation. These stages have traditionally been solved by employing a combination of image processing techniques with heuristic strategies based on hand-crafted rules [6]. More recently, these same stages have been approached independently through the use of Deep Learning. This has greatly improved the performance of each of the individual tasks [7, 8], but has not, in turn, contributed equally to the advancement of the field of research itself. Multi-stage solutions have, in general, proved to be insufficient [1, 2].

Deep Learning has also diversified the way in which OMR is approached as a whole: there are now alternative pipelines with their own ongoing research that attempt to confront the whole process in a single step. This holistic paradigm, also referred to as end-to-end formulation, has begun to dominate the current state of the art in other applications, such as the recognition of text, speech, or mathematical formulae [9,10,11]. However, the complexity of inferring music structures from the image currently makes it difficult to formulate OMR as an end-to-end learnable optimization problem. While end-to-end systems for OMR do exist, they are generally limited to monophonic music notation [12,13,14].

Some approaches have recently managed to extend end-to-end formulations in order to deal with scores of higher complexity, such as homophony [15, 16] and single-staff polyphony [17]. However, having a universal OMR end-to-end transcription system that can deal with all kinds of notations, including pianoform scores, is still a challenge to be met.

3 The GrandStaff dataset

Several efforts have been made to create datasets for OMR. On the one hand, there are corpora, such as DeepScores [18] and the MUSCIMA dataset [19], that contain a wide variety of annotated music documents, including subsets of pianoform scores. Despite providing interesting samples, they have not been conceived to train end-to-end OMR solutions and do not contain ground truths in a standard digital music notation format. On the other hand, there are corpora—such as PrIMuS [13], Il Lauro Secco [20], Capitan [21] or FMT [22]—that are specially labeled for end-to-end OMR transcription. However, practically all of them lack polyphonic and pianoform samples, as they mainly contain monophonic or homophonic music excerpts, which makes them unsuitable for the objective of this study.

Given this gap, we have designed a dataset focused on the task of end-to-end pianoform transcription: the GrandStaff corpus.Footnote 1

The term “grand staff” is used in music notation to represent piano scores [23]. It consists of two staves that are joined by a brace at the beginning, and whose bar lines cross both staves (see Fig. 1).

The dataset introduced in this work consists of 53,882 synthetic images of single-line (or system) pianoform scores, along with their digital score encoding.

In this section, we introduce the encoding representations of the musical scores in this dataset, as they are key aspects of the approach proposed in this paper, and we detail the way in which the corpus itself was created.

3.1 Ground-truth encoding

Since the goal of this dataset is for it to be a useful resource for the OMR community, we decided to generate a corpus based on standard digital notation documents. These output files can be then applied to other domains, such as graphic visualization software or the indexing of digital libraries.

First, it is necessary to analyze which encoding is most beneficial as regards being the endpoint of an end-to-end OMR system. The first options that can be considered are the most widespread musical encodings in libraries and musicology contexts: MEI [24] and MusicXML [25], which represent the components and metadata of a musical score in an XML-based markup encoding. Despite being extended formats, these music representations have a major drawback when considering their use in OMR systems, as they are too verbose. This is not convenient for OMR systems, since it would be hard to align input images with their correspondent notation representation.

In this paper, we use the text-based **kern encoding format, which is included in the Humdrum tool-set [26] and is hereafter referred to simply as kern. This music notation format is one of the representations most frequently utilized for computational music analysis. Its features include a simple vocabulary and easy-to-parse file structure, which is very convenient for end-to-end OMR applications. Moreover, kern files are compatible with dedicated music software [27, 28] and can be automatically converted to other music encodings, such as those mentioned above, by means of straightforward operations.

A kern file is basically a sequence of lines. Each line is, in turn, another sequence of columns or spines that are separated by a tab character. Each column contains an instruction, such as the creation or ending of spines, or the encoding of musical symbols such as clefs, key signatures, meter, bar lines, or notes, to name but a few. When interpreting a kern file, all spines are read simultaneously, thus providing the concept of polyphony to the format. That is, a line in a kern document should be read from left to right—interpreting all the symbols that appear simultaneously—and then from top to bottom, advancing in time through the score.

In conceptual terms, the design of a kern file resembles a music score that has been rotated to appear top to bottom rather than left to right (see Fig. 2). A basic example of how the encoding works is presented in Fig. 3, in which the word (clefG2) denotes a treble clef in the second line of the music staff and the symbol (8cc#) indicates that the note has a duration of an eighth note (8), has a pitch of C5 (cc), and comes with an accidental sharp (#), which alters the pitch of the note one semitone up. Thanks to its compactness—which eases score-representation alignment during transcription—and its compatibility with other music encodings and tools, the kern format represents an excellent choice for end-to-end OMR approaches.

Fig. 2
figure 2

Example of a kern score (left) aligned with its rendered music document (right)

Fig. 3
figure 3

Example of a simple excerpt of music. The corresponding kern notation: clefG2 \n 8cc#

However, the fact of being such a highly compact format has some drawbacks for machine learning approaches, the most important of which is that the same visual structure can be encoded in different ways depending solely on personal preferences. This is owing to the fact that the token components can be ordered differently and the visual result is the same. For example, as shown in the note token in Fig. 2, the ending of a beam is encoded using the ‘J’ character. In kern, it is valid to encode the whole symbol as in the figure, i.e., with 8e-J denoting an eighth note (‘8’) of pitch E (’e’) altered by a flat accidental (‘−’), but also as J8e-.

Another problem is that, as observed in Table 1, the kern music notation produces an extensive vocabulary size (unique symbols). We believe that this may hinder the performance of neural network-based approaches, signifying that a simplification of this music notation base would be convenient.

In this work, we, therefore, introduce an extension of this format that corrects the aforementioned. We have denominated it as **bekern, which is the abbreviation of “basic extended kern” and is referred to simply as bekern in the remainder of this paper. We allow just one “canonic” encoding of each feature, which is why we have denominated it as “basic.” In order to avoid different encodings for the same visual result, the ordering of token components has been restricted to a single one. The alphabet has been reduced by decomposing tokens into components delimited by means of a separator ‘.’. This signifies that the last token in Fig. 2, 8e-J, despite having originally being encoded as J8e-, is encoded only as 8.e.-.J in bekern format.Footnote 2

The grammar and definition of this encoding are detailed in Appendix A.

3.2 Dataset building process

The dataset was constructed using the following steps:

  1. 1.

    All piano scores (those containing the ‘*Ipiano’ signifier) from Kern ScoresFootnote 3 were downloaded.

  2. 2.

    Those files that contained more than two staves, or that had any kern parsing errors, were removed. Finally, 474 full-length scores were maintained. These comprised piano sonatas, mazurkas, preludes, and other compositions by Scarlatti, Mozart, Beethoven, Hummel, Chopin, and Joplin.

  3. 3.

    Three different pitch transpositions of the original pieces were used in order to augment data for training: major second, minor third, and major third. Each of these transformations moves the notes vertically, in addition to introducing new accidentals, and in many cases forces note stems and beams to have a different appearance (see Fig. 4).

  4. 4.

    For all the pieces obtained, the whole composition was randomly split into segments of 3 to 6 measures in order to obtain single-system scores (i.e., scores that are composed of just one system of two staves, like those in Fig. 4).

  5. 5.

    All dynamics, expression slurs, lyrics, and nongraphic information tokens were removed from the scores in order to generate what we have denominated as bekern.

  6. 6.

    These excerpts were then rechecked so as to retain only those that were valid kern scores.

  7. 7.

    All the excerpts retained were used as the basis on which to generate new files with file extension.bekrn in the bekern format.

  8. 8.

    The music images were obtained by employing the Verovio digital engraver [28], which generates an SVG file from kern. These input kern files were obtained from the bekern by simply removing the dot separator. JPG images were then obtained from the SVG files through an automatic process. The variability of the engraved scores was increased by using randomly different parameter values of the Verovio tool in the range permitted by its documentation. Namely, we altered the parameters: all-line thickness, maximum beam slopes, the slur curve factor, the grace note factor, repeat bar line dot separation and font family.

  9. 9.

    Two versions of each image file were generated: the JPG file from the previous step, and a distorted version of the image that resembles a low-quality photocopy or print (see Fig. 5). The method used to distort images is described in [29].

  10. 10.

    Finally, all those samples for which Verovio generated an error and that did not generate the image were deleted.

Information regarding image properties is found in Table 2, along with the kern features depicted in Table 1. These data make it possible to observe that they are large images containing quite varied transcription lengths, thus making it particularly difficult to align information, a challenge that is related not only to OMR, but also to general document analysis.

Fig. 4
figure 4

Example of transposition. The transposition has not only moved the position of notes but also the accidentals, note stems, and beam positions accordingly

Fig. 5
figure 5

Example of a distorted image

Table 1 Transcription features for both of the proposed encodings
Table 2 Summary table of the image features for both the GrandStaff corpus and the camera distorted version

4 Neural approach

In this section, we briefly describe how end-to-end OMR has traditionally been addressed and why pianoform musical scores cannot follow this formulation. We then describe the proposed solution with which to tackle the pending challenge.

As in previous works, input images are assumed to have undergone a previous layout analysis stage that leaves single-system sections [30], in the same way that end-to-end HTR works on single-line text sections [31].

4.1 End-to-end OMR

State-of-the-art OMR seeks the most probable symbolic representation \(\hat{\textbf{s}}\)—encoded in the \(\Sigma _a\) music notation vocabulary—for each staff-section image x:

$$\begin{aligned} \hat{\textbf{s}} = \arg \max _{\textbf{s} \in \Sigma _{a}} P(\textbf{s} \mid x) \end{aligned}$$
(1)

Neural networks approximate this probability by training with the Connectionist Temporal Classification (CTC) loss [32]. This alignment-free expectation–maximization method forces the network to maximize the sum of the probability of all the possible alignments between a ground-truth sequence s and the input source x. Since our input is an image, we treat x as a sequence of frames from this source. This is formalized as:

$$\begin{aligned} P(\textbf{s} \mid x) = \sum _{\textbf{a} \in {\mathcal {A}}\textbf{s},x} \prod _{t=1}^{T} p_t(\textbf{a}_t \mid x) \end{aligned}$$
(2)

where \(\textbf{a}\) is an auxiliary variable that defines a label in the output vocabulary at frame t. This variable belongs to the set \(A_{s,x}\), which groups all the possible valid alignments between the image x and sequence s. Since \(\textbf{a}\) is a sequence that has length t, CTC implements a many-to-one map function \(\mathcal {B}(\cdot )\) that compresses \(\textbf{a}\) to retrieve the transcription output [32]. To determine if \(\textbf{a}\) is a valid alignment, \(\mathcal {B}(\textbf{a}) = s\). This sum marginalizes our solution for all the valid combinations that are within the space between s and x sequences (defined as \(\mathcal {A}_{(s,x)}\)), since we understand the probability of a sequence to be the sequential combination of the probability of all its time steps.

The output of the network consists of a posteriorgram, which contains the probabilities of all the tokens within the vocabulary \(\Sigma _{a}\). To allow for the possibility of no prediction at a given timestep, CTC provides an extra blank token (\(\epsilon \)). Therefore, the output vocabulary of the network becomes \(\Sigma '_{a} = \Sigma _{a} \cup \{\epsilon \}\).

At inference, OMR methods resort to a greedy decoding, from which the most probable sequence is retrieved given an input image x. This can be decomposed as retrieving the most probable token at each timestep and applying \(\mathcal {B}\) to retrieve the output sequence

$$\begin{aligned} \begin{aligned} \hat{a}&= \arg \max _{\hat{a} \in \Sigma '} \prod ^{T}_{t=1} P(a_t \mid x). \\ \hat{s}&= \mathcal {B}(\hat{a}) \end{aligned} \end{aligned}$$
(3)

The formulation presented treats the transcription task as a sequence retrieval problem, and the output of the network is, therefore, always a character sequence. A sequence of this nature is obtained from an image by converting the image domain \({\mathcal {R}}^{h \times w \times c}\)—which is defined by the width w, height h and number of channels c of the image—into a sequence domain \({\mathcal {R}}^{l, \sum ^{'}_a}\), where l stands for the output sequence length and \(\sum ^{'}_a\) is the aforementioned music notation vocabulary. CTC-based methods specifically define a reshape function \(h:{\mathcal {R}}^{h \times w \times c} \rightarrow {\mathcal {R}}^{l, \sum ^{'}_a}\) based on the vertical collapse of the feature map, as symbols can be read from left to right and framesFootnote 4 always contain information about the same symbol in this case.

4.2 The challenge of polyphony

The methodology described above is able to solve single-staff music transcription problems and is currently the basis of the state-of-the-art systems in OMR for both printed and handwritten notation music scores.

Despite this, the end-to-end transcription of polyphonic and piano form scores is still a challenge (see Sect. 2). As stated in Sect. 3.1, pianoform music scores follow a particular reading order during interpretation, since there are staves that are read simultaneously. Rather than performing a line-by-line reading from top to bottom and left to right, interpretation is tied to staff groups, in which all elements are read simultaneously from left to right.

This increase in simultaneous events in a score is challenging, since the principle that a frame contains the information of a single music symbol is not satisfied, as there are multiple vertically aligned notes. When complexity does not grow significantly, as is the case of homophonic scores,Footnote 5 some vocabulary-based approaches can be employed. For example, in the work of Alfaro et. al [16] a special token is defined in order to differentiate between whether a note is played along with the previous one or belongs to the next time step. This approach could also be extended to polyphonic transcription at the cost of greatly increasing the ground truth sequence length, as simultaneous events are very frequent in these scores. However, as samples grow in size—e.g., full page-sized polyphonic music scores—this approach is no longer effective, as vertical collapse cannot produce sufficient frames to transcribe the complete music representation of the score.

It would, therefore, appear to be more convenient to search for new approaches or adaptations beyond the state-of-the-art single-staff music transcription formulation, as we require a more robust and scalable approach with which to address this challenge.

4.3 End-to-end polyphony transcription

In this section, we present a reading interpretation that aligns grand staves with their corresponding ground truth representation. We then provide details on a methodological approach with which to perform end-to-end transcription in order to solve its associated challenge.

4.3.1 Aligning polyphonic scores with their music representation

Although the current formulation cannot properly handle piano form notation, these scores can be interpreted in such a way that end-to-end transcription becomes applicable.

Upon closely studying the kern and bekern encoding formats—which is found in Sect. 3.1 and Appendix A—it will be noted that each text line represents a specific timestep in the music score. That is, all the symbols in a kern line are played at the same time, as they belong to different spines. The reading order of these documents is from top to bottom and left to right. This matches the left-to-right reading of the musical score. It is, therefore, possible to obtain a graphic alignment between them by rotating the source image \(90^\circ \) clockwise. When applying this transformation—as exemplified in Fig. 2—it is observed that both image and transcription are read in the same order. This is due to the nature of the kern spines, which represent single musical staves aligned in the same way as displayed in the image.

By following this interpretation, we obtain both a document and a ground-truth text representation that are read like a text paragraph. This consequently makes it possible to reformulate the solution inspired by segmentation-free multiline transcription approaches.

4.3.2 Score unfolding approach

Segmentation-free multi-line document transcription is a text methodology whose objective is to transcribe document images that contain more than one line without the need to perform any previous line detection processes. Although it is a recent research topic, several works with which to address this challenge have been proposed. The most relevant approaches found are those based on attention [33,34,35,36], which perform a line-by-line or token-by-token transcription process by means of an attention matrix or self-attention, and those of a document unfolding nature [37, 38], in which the model learns to unfold text lines in order for them to be read sequentially in their corresponding reading order.

The attention-based methods apply backpropagation when all the lines in a sample are processed, which is not, in our case, convenient owing to the large number of lines that kern files typically contain. In this paper, we have, therefore, employed a document unfolding approach and were specifically inspired to do so by the work of Coquenet et. al. [38], as document unfolding is learned without an input image size constraint.

Here, rather than concatenating frame-wise elements along the height axis (h) during the vertical collapse, we reshape the feature map by concatenating all of its rows (w) to subsequently obtain a \((c, h \times w)\) sequence, in which c is the number of filters used by the convolutional layers of the model. From a high-level perspective, this method can be understood as a pairwise polyphonic region concatenation process—as illustrated in Fig. 2. This operation is performed from top to bottom of the image. Graphic visualization of this method is depicted in Fig. 6.

This means of processing the feature map obtained allows the transcription of the musical score in its original kern format, as labels are aligned in the same way—from top to bottom and left to right. However, some symbols have to be introduced into the vocabulary in order to produce correct kern sequences. These are the line breaks, as this is mandatory in order to know where music timesteps are separated, the tab token, as this indicates spine jumps, and the space token, which identifies homophonic symbols.

Fig. 6
figure 6

Graphical scheme of the proposed reshape method used to transcribe polyphonic music scores. It should be noted that this reshaping is performed in a feature space, not on the image itself, and visualization of it has, therefore, been included for the sake of clarity

5 Experiments

In this section, we define the environment designed in order to evaluate the performance of the end-to-end polyphonic music recognition method presented in the corpus of this paper.Footnote 6

5.1 Implementations considered

Three different implementations have been proposed for study purposes. All of them contain a convolutional block, which acts as an image encoder that extracts the most relevant features from the input. The implementation of [38] is followed, which consists of a network of ten stacked convolutional layers with pooling operators, which eventually produce a feature map of size (bch/8, w/16), where h and w are the height and the width of the input image, c are the filters in the last convolutional layer, and b is the batch size. An illustration of this encoder architecture is provided in Fig. 7. It must be mentioned that all the considered implementations have similar parameters, around 23 M, where the majority of the weights are located in the convolutional encoder. The decoding architectures proposed to process the sequence obtained after using the reshaping method are, therefore, the following.

Fig. 7
figure 7

Scheme of the encoder architecture implemented in all the models evaluated. The model input is a 90\(^\circ \)-rotated polyphonic image of height h, width w, and c channels. It outputs a feature vector of \(\frac{w}{16} \times \frac{h}{8}\) frames and 512 features

5.1.1 Recurrent neural network

We implemented the decoder from the original Convolutional Recurrent Neural Network (CRNN) single-staff transcription model from the work of Calvo-Zaragoza et al. [21], in which the reshaped feature map is fed into a single Bidirectional LSTM (BLSTM) layer and fed into a fully connected layer that converts from the RNN feature space into the output vocabulary one. We specifically implemented a BLSTM with 256 units. This decoder implementation is depicted in Fig. 8b.

5.1.2 The transformer

The base model of OMR decoders typically implements Recurrent Neural Networks (RNN) in order to process the reshaped feature map as a sequence. However, there is a recurrent-free model that has gained popularity in Natural Language Processing (NLP): the Transformer [39]. This model replaces the RNN architecture by implementing sequence modeling through the use of attention mechanisms and position learning. This model solves some common issues related to RNN—such as processing long sequences—at the cost of requiring more data in order to converge. As noted in the reshaping step and the kern format for polyphonic music scores, we believe that the model would have to process significantly long output sequences, something that Transformers tend to handle better than RNN. In previous works, the Transformer has been studied as regards its use to perform transcription tasks in both OMR and HTR [40, 41]. This research has shown that the Transformer model is a promising architecture for performing both OMR and HTR tasks. Although it currently does not yield better performance than traditional RNNs —if no support synthetic data or training techniques are provided—in these two areas, relevant improvements have been found in the OCR field [42]. We, therefore, propose an implementation that replaces the recurrent layer of the CRNN model with a Transformer encoder module in the same way it is done in [40]—shown in Fig. 8c—which is referred to as CNNT in what remains of the paper. We specifically implemented one encoder layer with an embedding size of 512 units, a feed-forward dimension of 1024, and 8 attention heads.

5.1.3 Encoder-only model

As mentioned previously, the proposed methodology with which to transcribe polyphony is based on analogous works for multi-line transcription in the HTR field [37, 38]. These works are based on convolutional-only architectures—in which no sequence processing decoders are implemented, as the solution lies in preserving the prediction space in two dimensions, and applying backpropagation directly to the feature map retrieved before being reshaped. In order to carry out our study on the architecture, we implemented an encoder-only network. As it is based only on fully convolutional layers, it will be referred to as FCN in the results section. This implementation is depicted in Fig. 8a.

Fig. 8
figure 8

Architecture schemes of different implementations of the decoder used in this paper

5.2 Sequence codification

In this paper, we have used two encodings to represent polyphonic musical scores. These are Humdrum kern—which is the semantic encoding chosen to represent the digital music documents of the GrandStaff corpus—and its basic encoding (bekern)—which is a semantic-based tokenization performed in order to dramatically reduce the number of unique symbols of kern. The utility of this proposed encoding was assessed by evaluating the performance of the transcription method. This was done by outputting a third kern vocabulary reduced using a non-semantic-aware tokenization method, which would be the first approach to employ if there were no prior knowledge of music encodings.

We shrank the kern vocabulary from the GrandStaff dataset by employing the Sentence Piece strategy [43], which is a standard utility in the Machine Translation field when performing vocabulary compression. This text tokenizer provides a set of unsupervised methods based on sub-word units, such as the Byte Pair Encoding algorithm [44] and the Unigram Language Model [45]. This tool was chosen not only because it is a standard utility, but also because it allows the specification of a vocabulary size, which is ideal for comparison purposes, since we can create a vocabulary that is equal in size to that obtained with the bekern encoding. This new vocabulary is referred to as the kern-sp encoding in the remainder of this paper.

5.3 Evaluation procedure

The GrandStaff dataset provides two data splits, a training one and a test one. The test split consists of all the real musical scores extracted in order to create the corpus, as we believe that test results should be provided by means of real samples. The training and validation splits consist of all the altered musical scores, the preparation of which is detailed in Sect. 3.2. We specifically train and validate on 46,221 samples and perform tests on 7661 samples.

5.4 Metrics

One issue that may be encountered when evaluating OMR experiments is that of correctly assessing the performance of a transcription model, as certain features must be taken into account in music notation. However, OMR does not have a specific evaluation protocol [1]. In our case, it is convenient to use text-related metrics to evaluate the accuracy of the predictions. Three metrics have been proposed in order to evaluate the performance of the models implemented. All of these measures are based on the normalized mean edit distance between a hypothesis sequence \(\hat{\textbf{s}}\) and a reference sequence \(\textbf{s}\) in the form of:

$$\begin{aligned} \mathcal {E}(\hat{S}, S) = \frac{\sum ^n_{i=0} d(\textbf{s}_i, \hat{\textbf{s}}_i)}{\sum ^n_{i=0} |\textbf{s}_i|} \end{aligned}$$
(4)

where \(\hat{S}\) is the hypotheses set, S is the ground-truth set, \(d(\cdot , \cdot )\) is the edit distance between the tokens of each paired hypothesis and ground-truth sequences \((\textbf{s}_i, \hat{\textbf{s}}_i)\), and \(|\textbf{s}_i|\) is the length of the reference sequence in tokens.

Table 3 Average CER, SER, and LER (%) obtained by the studied models on the test set for both the perfectly printed and the distorted versions of the GrandStaff dataset
Table 4 Encoding of the original piano staff shown in Fig. 9 and its transcription with CNTT. Italics represent elements wrongly predicted by the transcription model (deletions). Bold in the original sequence represents missing tokens in the prediction (insertions)

As will be observed in the operation determined by Eq. 4, the edit distance-based error \(\mathcal {E}\) depends on what is defined as a token in the codification. This is used as the basis on which to compute the Character Error Rate (CER), which tokenizes sequences at a bekern character level, as detailed in Appendix A. The second metric is the Symbol Error Rate (SER), which computes the edit distance between complete kern symbols.Footnote 7 Finally, as the problem is solved using a multi-line transcription approach that rotates the music score and attempts to align each kern line with the input image by predicting a line break token flag, we compute the Line Error Rate (LER), which makes it possible to assess the amount of error produced while retrieving lines—by setting them as complete tokens when calculating \(\mathcal {E}(\hat{S}, S)\). We consider that this metric is particularly interesting for both the paper and the polyphonic music transcription problem. kern files rely heavily on these text structures to represent this music notation, as it easily indicates the notes to be played and the temporal sequentiality of the score. Since correctly predicting and differentiating all the lines of a given document are a key aspect, the overall quality of the outputted kern files can be assessed using this metric.

6 Results

Table 3 shows the results obtained by the methodology proposed in this paper on the test set for both the perfectly printed and the distorted datasets. Note that no reference/baseline results are shown, as the state-of-the-art end-to-end methods [1, 15] failed to converge during training for this specific dataset. This was caused by the issues described in Sect. 4.2.

From an overall perspective, the results show that the score unfolding method was able to allow the neural network to solve the problem with fair results, with the best SER values being 5.8% and 6.5% with the bekern encoding using the Transformer. These error values directly scale depending on the image complexity, as the distorted version of the corpus contains features that make recognition more difficult and, understandably, have an impact on the overall performance of the models. This impact can be clearly seen in the encoder-only implementation of the model, with a drop in performance of approximately 16%. This value shows that sequence processing modules are, indeed, necessary in order to perform polyphonic transcription, as they provide stability against increasing corpora complexity, with a drop in performance of 12% in the best-case scenario.

Fig. 9
figure 9

Visualization of the transcription produced by the CNNT model for a test pianoform staff. The corresponding kern encoding is shown in Table 4. Errors are displayed in the prediction image, where red boxes highlight missing symbols and red notes indicate wrong predictions. In this particular case, the CER obtained is 2.8%, the SER is 4.1% and the LER is 16.2%

Upon comparing sequence processing implementations, the results show that, for polyphonic music transcription, the combination of a CNN with a Transformer Encoder—when outputting bekern vocabulary—provides the best transcription results. Table 1 supports the idea that, on average, the bekern encoding produced longer sequences than the Kern one, in exchange for having a significantly narrower vocabulary. By replacing recurrence with self-attention and position encoding, the transformers improved computation time and accuracy at the cost of requiring more data in order to converge. Indeed, Transformers literature reports relevant improvements in terms of sequence length limitations, being able to process longer sequences than RNNs. In this case, it would appear that the GrandStaff dataset creates a scenario that is ideal for Transformer-based models, as there is a large amount of available data and, on average, long sequences to be transcribed. Indeed, RNN-based decoders provided the best performance results when transcribing raw kern sequences.

In terms of output sequence tokenization, the results showed that a reduction in vocabulary improves the results of the model, since the number of parameters to be optimized in the last layer is significantly reduced. We observe, depending on the model, some variant gaps between the semantic-based tokenization method—bekern—and the unsupervised learned one, in our case, Sentence Piece. It seems that vocabulary selection may be an ad-hoc decision when implementing a model. However, from the best results obtained in these experiments, it seems that the bekern format provides better performance, as it is a semantic encoding based on prior knowledge of music notation.

Finally, we should highlight the LER obtained by the implemented methods. As described in Sect. 5.4, the LER metric indicates how well the model aligns the input image with the output transcription in terms of complete kern lines. The results show an overall LER performance of 16.26% and 17.53%. This means that the error produced by the model is mostly intra-line and that the proposed methodology was, therefore, able to correctly align the rotated music image with its kern transcription. This result proves that our results can be easily exported to practical applications that deal with kern files and that errors should principally be corrected by reviewing line content, not the overall format of the document (Fig. 9).

6.1 Evaluation on monophonic scores

The method proposed in this paper for music transcription involves aligning input images with their corresponding kern ground truth notation by approaching it as a multiline endeavor. This method is not limited to polyphonic music—since it relies on visual-text alignment—and can be applied to other kern-encoded music scores, including monophonic ones, which have been the main target of existing end-to-end OMR techniques.

To complete the analysis of the methodology considered in this work, we conducted additional experiments to evaluate its effectiveness for monophonic music score transcription. We trained our models with the camera version of the “Printed Images of Music Staves” (Camera-PrIMuS) dataset [29], which is a well-known benchmark for end-to-end OMR.

The results of our experiments, which compared the performance of the implemented models in this work using both the state-of-the-art reshape approach (vertical collapse) and the unfolding method considered in this paper, are presented in Table 5.

Table 5 Average SER (%) obtained by the studied architectures and reshape methods on the test set for the monophonic camera priMuS dataset

Our experimental results indicate that the unfolding method is able to successfully perform end-to-end monophonic transcription. However, this approach reports lower accuracy compared to the vertical collapse approach. This performance is mainly obtained thanks to the convolutional architecture implemented within, where it is able to improve 1% SER the state-of-the-art results.

It is important to note that we also conducted one additional experiment to directly transcribe monophonic scores using the networks trained with GrandStaff. However, the results of this case showed that the models were unable to retrieve barely accurate predictions. All our empirical outcomes, therefore, suggest that our methodology can effectively perform transcription for both monophonic and polyphonic tasks, but it has yet to be performed by training independent task-specific models.

7 Conclusions

This work shows a proposal for the first end-to-end OMR approach with which to solve the transcription of pianoform musical scores. This solution extends state-of-the-art staff-level transcription methods and was inspired by multi-line document transcription. We specifically take advantage of a standard digital music notation system, Humdrum **kern (kern), and implement a neural network that learns to unfold a rotated pianoform system and align it with its corresponding transcript. This method is trained with weakly annotated data, as it requires only pairs of images and their digital document representation, without any geometric information, such as staff positions or symbol locations in the image.

In addition to this approach, we also present the GrandStaff Dataset for use in experiments. This dataset consists specifically of a collection of \(53\,882\) polyphonic single-line pianoform scores extracted from the KernScores repository and rendered using the Verovio tool. This dataset provides two music encodings for each score: the original kern document and the Basic **kern (bekern) notation sequence, which consists of a simplification of the base encoding that reduces vocabulary size in order to ease the work of the transcription systems.

The evaluation results obtained show that the proposed method successfully transcribes pianoform music systems with fair error rates. This represents a clear advance as regards attaining effective end-to-end OMR systems. Our work also provides baseline results for future work addressing the same challenge.

As future work, this paper opens up several research avenues. In this paper, we propose an output sequence constructed with semantic music grammar. However, most of the results in OMR are framed in graphic-based vocabularies as the output of their systems. A comparative study between using this approach or a joint transcription and machine translation pipeline could be performed—as occurs in [46]. Moreover, the proposed approach is limited to simultaneous-only music staves. That is, this method can be extended only to full pages that contain completely simultaneous music, but not sequentially structured polyphonic staves, as we stick to a specific reading order that is not followed in those cases. Future efforts should, therefore, focus on how to extend transcription systems in order to address the full-page polyphonic music score recognition topic, as is also occurring in the HTR field with full-page documents [36, 47]. Finally, this work demonstrates that the implemented method is able to transcribe both polyphonic—pianoform—and monophonic music images by rotating and aligning them vertically with their digital music representation, thanks to the kern format. However, given the reported results, networks have to be trained as separate tasks to do so. The general application of this method to other musical score types could also be explored, thus leading to research toward universal OMR solutions.