Keywords

1 Introduction

Music constitutes one of the main vehicles of cultural heritage. A large number of musical manuscripts are preserved in historical archives. Occasionally, these documents are transcribed to a digital format for its easier access and distribution, without compromising their integrity. However, in order to make these heritage really useful, it is necessary to transcribe the sources to a structured format such as MusicXML [3], MEI [11], or MIDI. So far, this has been done manually by experts in early music notation, making the process very slow and costly. Conveniently, Optical Music Recognition (OMR) techniques can help automating the process of reading music notation from scanned music scores [1].

Typically, the transcription of historical music documents is treated differently with respect to conventional OMR methods due to their particular features; for instance, the use of certain notation systems, or the state of preservation of the original document. Although there exist several works focused on early music documents transcription [4, 8], the specificity of each type of manuscript, or its overall writing style makes it difficult to generalize these developments.

In this context, a music transcription system is one that performs the task of obtaining a digital structured representation of the musical content in a scanned music manuscript. The workflow to accomplish such a task could be summarized as follows: First, a document layout analysis step isolates document parts containing music, mostly music staffs. Then, a OMR system detects music symbols contained in these parts, typically producing a sequence or graph of music symbols and their positions with respect to the staff. From this representation, a semantic music analysis step assigns musical meaning to each symbol, as this often depends on the specific location of the symbol in the sequence. Finally, this intermediate music representation is translated by a coding stage into a structured representation in the desired output format.

Unlike other domains, in the particular case of music notation the symbols to be classified have two components: glyph and position in the staff. Traditionally, OMR systems use supervised learning to predict the glyph [2, 6, 9], whereas the position is determined by heuristic strategies [13]. Since these OMR systems usually perform a pre-process that normalizes the input images (binarization, deskewing, and so on), these heuristic strategies tend to be quite reliable. However, other approaches might use different pre-processing steps, and so traditional heuristics might not work correctly. This is why we propose to deal with the identification of the glyph position by means of supervised learning as well. To this end, we study the best approach to perform classification of a pre-segmented symbol image into its pair of glyph and position. Specifically, we propose different deep learning architectures that make use of Convolutional Neural Networks (CNN) in order to fully classify a given music symbol. We aim to analyze which architecture gives us the best performance in terms of accuracy and efficiency.

The rest of this paper is organized as follows: Sect. 2 presents the methodology considered for the aforementioned task; Sect. 3 describes our experimental setup; Sect. 4 presents and analyzes the results: Sect. 5 concludes the present work and introduces some ideas for future research.

2 Methodology

Our work assumes a segmentation-based approach, in which the locations of symbols that appear in the input music score have already been detected in a previous stage. This can be achieved under an interactive environment where the user manually locates the symbols [10], but can also be automated with object detection techniques [7]. Either way, the next task is to fully classify the symbols in order to retrieve their musical meaning.

We consider that all musical symbols are defined by two components: glyph and position with respect to the staff lines. This is obvious in the case of notes, as these components indicate the duration and the pitch, respectively. We can generalize this to any type, as all symbols are located in a specific position with respect to the lines of the staff. Let \(\mathcal {G}\) be the label space for the different glyphs and \(\mathcal {P}\) the label space for the different positions. A music symbol is therefore fully defined by a pair \((g,p), s \in \mathcal {G}, p \in \mathcal {P}\). A graphical example is given in Fig. 1.

Fig. 1.
figure 1

Example of handwritten music symbols in white Mensural notation, showing its glyph, position, and combined label. Note that position labels refer to the vertical placement of a glyph: Ln and Sn denote symbol positions over or between staff lines, respectively.

Considering the above, the complete classification of a music-notation symbol consists in predicting both its glyph and its position. This opens up several possibilities as regards this dual process, given that the two components are not completely independent. As a base classification algorithm, we resort to Convolutional Neural Networks (CNN), since they represent the state of the art for image classification. These neural networks consist of one or more convolutional layers, which benefits from local connections, tied weights and pooling operations in order to learn a suitable data representation for the task at hand [14]. The convolutional layers are typically followed by one or more fully-connected layers that perform the final prediction stage.

The classical use of CNN is to consider a single image as input, that must be associated with a single class label. However, since we want to know the glyph of a symbol as well as its position within the staff simultaneously, we shall consider different architectures with shared layers, and multi-output and multi-input models, in order to determine which is the best way to obtain the corresponding full symbol classification.

Below we first introduce how to represent the input for classification, after which we propose the neural architectures that allow us to perform the combined glyph and position classification of music symbols.

2.1 Input Scheme and Preprocessing

Two region-based image inputs are used in this approach. Figure 2 shows a part of a music staff from a Mensural notation score. These images are pre-segmented, either manually or automatically, by defining a bounding box around each music symbol in the staff, as shown in Fig. 3 (left). Thus, each bounding box defines an image instance containing a music symbol. These appropriately annotated images can be used to train and evaluate a music glyph classification model. However, in general these instances do not span vertically as to contain all staff lines. Therefore, they do not convey information about the music symbol position. For example, the symbol labeled as ‘A’ in Fig. 3 (left) is indistinguishable from symbol ‘B’ in the same image in spite of appearing at different staff positions. This type of images will be referred as glyph inputs.

In order to produce models able to correctly classify music symbol positions, a second image set is constructed by enlarging the bounding box frame vertically to a fixed height large enough to contain all staff lines, as shown in Fig. 3 (right). This type of images will be referred as enlarged inputs in the following sections. It has been shown that these enlarged images contain enough information for estimating the vertical position of a music symbol within the staff [5].

Fig. 2.
figure 2

A sample of a Mensural notation staff.

Fig. 3.
figure 3

Left: glyph bounding boxes. Right: enlarged bounding boxes.

2.2 CNN Architectures

Given the aforementioned inputs and targets, we intend to classify every region as one of the available symbols. To accomplish this, we try different approaches that perform the different classifications either simultaneously or independently.

Independent Glyph and Position Models. Our first approach would be to create two different CNN models: one processes a glyph input \(x_g\), and tags it with a glyph label \(g \in \mathcal {G}\). The other one processes a enlarged inputs \(x_p\), and tags it with a position label \(p \in \mathcal {P}\). These two models are depicted in Fig. 4.

Fig. 4.
figure 4

Top: independent glyph classification model. Bottom: independent position classification model.

Category Output Model. Another approach uses a single enlarged input \(x_p\), and tags it by considering as the label set the Cartesian product of \(\mathcal {G}\) and \(\mathcal {P}\). We shall refer to the combined label of a symbol as its category, denoted by \(\mathcal {C}\). Therefore the model tags each input \(x_p\) as a pair \(c = (g,p)\), such that \(g\in \mathcal {G}\) and \(p\in \mathcal {P}\). This approach is depicted in Fig. 5.

Fig. 5.
figure 5

Category output model: enlarged inputs are provided as input and the model must predict a label from the Cartesian product of glyphs and positions.

Category Output, Multiple Inputs Model. This model uses both glyph and enlarged input images, yet predicting directly the combined category c (Fig. 6).

Fig. 6.
figure 6

Category output with multiple inputs: both glyph bounding box and enlarged images are provided as input and the model must predict a label from the Cartesian product of glyphs and positions.

Multiple Outputs Model. This model takes enlarged inputs and predicts the glyph g and position p labels separately, as shown in Fig. 7. The model shares the intermediate data representation layers as input to both final fully-connected classification layers.

Fig. 7.
figure 7

Multiple outputs model: enlarged images are provided as input and the model must predict both the glyph and the position separately.

Fig. 8.
figure 8

Multiple inputs and outputs model: both glyph bounding boxes and enlarged images are provided as input and the model must predict both the glyph and the position separately.

Multiple Inputs and Outputs Model. Our last model takes both glyph and enlarged inputs \(x_g\) and \(x_p\), and predicts glyph g and position p labels as two different outputs, as depicted in Fig. 8.

3 Experimental Setup

3.1 Network Configuration

Given the proposed general topologies, this section describes the CNN architectures for each model. We have selected a base architecture by means of informal testing. Since this is designed to be used in an interactive scenario, the network must be light to allow for real-time processing.

The input to a convolutional layer is an \(m \times n \times r\) image, where m is the height, n is the width of the image and r is the number of channels, being in this case \(r = 3\) since we are working with RGB images. Models with one input share the same feature extractor and classification stage. However, they have different inputs and outputs. Their architecture consists of the repeated application of two \(3 \times 3\) convolutions with 32 filters, each one followed by a rectified linear unit and a \(2 \times 2\) max-pooling operation with stride 2 for downsampling, then followed by a dropout of 0.25. At the next stack of convolutional layers and max-pooling, we double the number of filters. Finally, after flattening, a fully-connected layer with 256 units followed by a dropout of 0.25 and a softmax activation function layer is used for the classification stage.

Models with two inputs have parallel feature extractor for each input, and they are concatenated in their corresponding last stack, after the flattening operation. When using models with two outputs in their classification stages, the network is forked into two different outputs after the last fully-connected layer.

3.2 Dataset

For our experiments, three different manuscripts of handwritten music scores in Mensural notation were available with symbol-level annotation. A summary of the number of samples, and glyph and position classes is shown in Table 1. Combined classes are the result of the Cartesian product of glyph and position classes, as stated above. It is worth noting that most combinations of glyph and position do not appear in our ground-truth, so they have been removed from the set of combined classes.

A major drawback of this ground-truth dataset is the high ratio of label imbalance. Therefore, we took this into account while training our models by weighting the loss for each label according to its relative frequency in the dataset.

Furthermore, since the symbol images can vary drastically in size, and for the sake of creating a dataset suitable for training CNN models, the input images are resized to a fixed size. We resized images to \(40 \times 40\) pixels for glyph inputs, and to \(40 \times 112\) pixels for enlarged inputs.

Table 1. Distribution of classes and samples in the Mensural notation symbols dataset.

3.3 Evaluation

In order to evaluate our proposed models, we conducted a 5-fold cross-validation scheme for the six models that were considered. In each fold, 80% of the available data is used for training and 20% for validation. Also, each fold was created by the class imbalance of the original dataset.

Moreover, a grid search was carried out in order to tune hyperparameters. Consequently, we trained every architecture for 15 epochs and a batch size of 32. Additionally, a RMSprop optimizer was used [12].

Given that our purpose is to evaluate the correct recognition of music categories, we should consider our models’ accuracy taking into account that the glyph and position of a given input are being labeled correctly at once. More precisely, given an input series \(x = \{x_{1}, x_{2}, \ldots , x_{n}\}\) we need to calculate the accuracy of the three different outputs of a symbol: position as \(p = \{p_{1}, p_{2}, \ldots , p_{n}\}\), glyph as \(g = \{g_{1}, g_{2}, \ldots , g_{n}\}\) and category as \(c = \{c_{1}, c_{2}, \ldots , c_{n}\}\), for every model.

Another important factor to consider is the complexity of each model since we aforementioned the necessity of good effectiveness and efficiency in the application of these models in a real scenario. In order to provide a value of efficiency that does not depend on the underlying hardware used in the experiments, we consider the number of (trainable) parameters of the neural model as a measure of its complexity.

4 Results

Table 2 presents average and standard deviations achieved by the different classification schemes for the five folds, as well as the complexity (number of trainable parameters) of each model.

The best results in terms of accuracy are obtained by the model with multiple inputs and outputs, then without almost no significant difference, by the model with multiple outputs, followed by the model with independent glyph and position. Moreover, the worst results are obtained by those that consider the full category as output, especially when using multiple inputs.

Table 2. Error rate (average ± std. deviation) and complexity with respect to the neural architecture considered for music symbol classification. The complexity of each model is measured as millions of trainable parameters.

Another factor for evaluation is the correlation between complexity and accuracy. The models with multiple inputs share the same complexity as well as models with only one input. The best model, with multiple inputs and outputs, has high complexity compared to the second best one, which is relevant, as both perform comparably well.

Contrary to expectation, it is interesting to point out that models with multiple outputs are performing better. Since glyph and position are dependent features for most instances, we could expect that models predicting combined classes would produce better results. However, we are aware that this outcome might be produced by the possibility of not having enough data to train.

5 Conclusions

In this work, we proposed different segmentation-based approaches to recognize glyph and position of handwritten symbols, since traditional OMR systems mostly focuses on recognizing the glyph of music symbols, but not their position with respect to a staff. Therefore, we propose to classify glyph and position in the same step. Our approach is based on using different CNN architectures where we predict glyph, position, or their combination by training independent models, multi-input and multi-output models.

Experimentation was presented by using a dataset of handwritten music scores in Mensural notation where we evaluated each model by their accuracy on labeling glyph, position, and their combination, as well as their complexity in order to estimate the best model for our purpose.

The results suggest that in order to obtain the best accuracy, models should predict glyph and position labels separately, instead of predicting the combination of both as a single class. We can conclude that interesting insight has been gained with regard to achieving a complete system for extracting the musical content from an image of a score.

As a future work, we plan to consider the use of data augmentation for boosting the performance. However, this has to be designed carefully as traditional data augmentation procedures might not work correctly as regards the positions of the symbols. We also plan to reuse the outcomes of this paper for segmentation-free recognition, which does not need a previous localization of the symbols in the image [10].

As an another step towards the goal of this research, these models must be integrated on a fully-automated transcription system that uses semantic analysis tools to assign actual musical meaning to the output of the considered models.