1 Introduction

Cardiac arrhythmia is a large class of cardiovascular diseases (CVDs) in clinical, and it always endangers human health [1, 2]. There are many kinds of arrhythmias, which almost occupy more than half of the patients in the diagnosis of the surface electrocardiogram (ECG). The painless and unobvious natures of silent myocardial ischemia (SMI) are responsible for many sudden deaths, which creates the need for long-term monitoring of the specific patient [3]. Continuous monitoring of patients with certain CVDs, including post-myocardial infarction (MI), tachycardia, arrhythmias, left ventricular dysfunction, etc., gives benefits to assess the condition of cardiovascular health effectively [4].

For the implementation of dynamic monitoring, the primary task is to collect ECG in real time. Nowadays, there are many approaches to measure/record ECG. Da Silva et al. [5] provided a taxonomy of state-of-the-art ECG measurement methods: in-the-person, on-the-person and off-the-person. The majority of devices used for ECG measurements belong to the on-the-person category. ECG signal can be acquired from multiple configurations of sensors on the body and is easily affected by body positions, reflecting health information of different parts of the heart. In 1961, Holter [6] introduced techniques for continuous recordings of ECG from ambulatory subjects over periods of many hours; the long-term ECG (Holter recording), typically with a duration of 24 h, has since become the standard technique for observing transient aspects of cardiac electrical activity [7]. Recorded signals are then analyzed off-line using dedicated diagnostic systems [3, 8,9,10,11]. As many arrhythmias are characterized by intermittent and short-lasting episodes that are not usually found early and therefore, have a very poor detection rate [12]. The diagnostic mechanism of Holter limits its applications in homecare which has raised extremely high demands in early and quick diagnosis of ECG arrhythmia. Therefore, wearable real-time ECG for long-term monitoring has potential to be the most useful non-invasive device for assessing cardiac health.

De Chazal [13] demonstrated that similar effectiveness for ECG arrhythmia classification can be obtained at a lesser computational cost when using only one-lead, compared with methods using multiple leads [14]. Bipolar limb lead is part of the most utilized method which has the ability to display three of the most important waves: P wave, QRS complex and T wave. These waves correspond to the field induced by the electrical phenomena occurring on the heart surface, denominated atrial depolarization (P wave), ventricular depolarization (QRS complex) and repolarization (T wave) [15]. The patterns provoked by arrhythmias can deeply change these waves. Therefore, a wearable device with analog limb leads (lead I, II, III) is mainly used to monitor arrhythmias highlighted in this paper, such as, (1) sinus arrhythmia: sinus bradycardia, sinus tachycardia and sinus arrest; (2) atrial arrhythmias: premature atrial contractions (PACs), non-sustained/sustained atrial tachycardia (NSAT/SAT), and atrial fibrillation (AF); (3) ventricular arrhythmias: premature ventricular contractions (PVCs) and non-sustained/sustained ventricular tachycardia (NSVT/SVT).

Unfortunately, one of the fundamental problems associated with measuring dynamic ECG is the decrease in signal quality due to the unexpected environmental disturbances [16, 17]. Artifacts from physiological and nonphysiological sources are common. Wearable ECG monitoring devices are commonly based on single-lead measurements with dry metal plate [18], resulting in much smaller signal amplitude and noisy waveforms compared to wed adhesive electrodes. In the field of cardiology, there is an urgent need for such databases as they will play a great role in allowing manufacturers to design systems and hospitals to measure the performance of their systems against manufacturers’ claims. In addition, several standard ECG databases are available to evaluate algorithms for different test purposes [19]. The most commonly used databases on published researches for arrhythmia are the MIT-BIH Arrhythmia Database, QT Database, CSE Database, and AHA Database [20]. Although classical they are, the characteristic of non-wearable makes them not perfect for dynamic automatic analysis algorithm designing [21,22,23].

The aim of this paper is to present two ECG databases suitable for development and testing of ECG classification methods. Signal quality database contains three different degrees of signal quality. Arrhythmias database contains kinds of subcategories, corresponding to various arrhythmias mode.

2 The Structure of Database

2.1 Data Acquisition

In China, patient’s personal information and disease information are stored in the servers and hard disk devices of their respective hospitals. Subject to confidentiality agreement of collection equipment manufacturers and hospital ethics permission, retrieving existing recording is not desirable. Therefore, self-collection would be made the only feasible method.

Under the permission of the ethical certification of the First Affiliated Hospital of Nanjing Medical University, Southeast University jointly carried out the work of collecting ECG data with a wearable wireless ECG monitor, which has passed the FDA certification. The recordings acquired by the device are all 6-lead ECGs, digitized at 400 samples per second per channel with 12-bit resolution over a frequency response bandwidth of 0.05–40 Hz. More than 200 individuals with arrhythmias were tested, aged between 18 and 82. All subjects were trained to wear the wireless ECG monitor without other’s help and retain at least 24 h even to 72 h to cover all possible onsets of arrhythmias. According to the agreement we reached, the data can be freely used, while the patient’s identification should be kept anonymous. The specific operation process is shown in Fig. 1.

Fig. 1
figure 1

The flow chart of data acquisition and processing

2.2 Annotation Workflow

For many bio-signal processing applications, the performance of algorithms and systems must be evaluated against reference or ‘gold standard’ annotations. It is common to have one or multiple experts’ annotators to evaluate the data when this ground truth in not readily available. Thus, an annotation platform was developed by co-operation of automatic classification algorithms and three cardiologists. First, a huge amount of ECG recordings stored in cloud platform were uploaded to the annotation platform in the standard ECG drawing format. Then, an automatic step was applied to generate coarse annotation with commonly used algorithms. After that, two clinical cardiologists independently corrected the automatic labels. The third expert finally checked the results and identified the labels with different opinions and made a determination.

2.3 ECG Cloud Platform

The construction of an annotated open access database is a long-lasting task. Thus, it is wise to operate with an assistant platform. Herein, a cloud platform composed of five parts as followings was developed.

  1. (1)

    Information management platform: dealing with the personal information, such as account information, patient’s identification medical history of doctors and users;

  2. (2)

    Communication management platform: transmitting real-time abnormal information generated by wearable monitor and feedbacks from doctor;

  3. (3)

    Storage management platform: unified management of raw data storage, modification, verification, search and forwarding, etc.;

  4. (4)

    Crowd-sourcing labeling platform: unified management of the ECG labeling works. Recordings stored in the storage clouds with no more than three different annotations will be released to this platform in a form of fee-for-service-based, to attract cardiologists and experts to annotation work. Double-blind measures are taken among experts to improve the reliability of labeling results. And yet, after an automatic annotation comparison, controversial issue will re-entry the label platform, others will be stored in another storage clouds;

  5. (5)

    Cloud computing platform: an automatic ECG analysis framework to make diagnosis, draw conclusions and generate diagnostic reports. A preprocessing step was brought in for denoising, mainly including power frequency interference, baseline drift, myoelectricity interference, motion artifacts, and electrode contact noise [24]. The annotation of QRS complex is the most important task, where an algorithm based on the Pan and Tompkins has a preferable performance [23, 25, 26]. Features are used to represent patterns with minima loss of important information. Then a neural network classifier based on CNN and LSTM is chosen for the classification procedure [27, 28]. Eventually, a long-term ECG report will be exported in a standard format with the results of physician-assisted diagnosis.

2.4 Data Schema

Committed to signal quality assessment (SQA) and arrhythmia analysis on rhythm changes, two separate databases were constructed elaborately by several clinical cardiologists based on their diagnostic experiences and the criterion of ECG diagnosis. Signal quality was divided into three levels empirically, more detailed rules could be seen in Sect. 2.4.1. The arrhythmia database contains three categories, including sinus rhythm (category ‘N’), atrial rhythm (category ‘A’), and ventricular rhythm (category ‘V’), and each one is composed by a variety of arrhythmias specifically interpreted in Sect. 2.4.2.

2.4.1 Signal Quality Database

In recent years, analysis and evaluation of various physiological signals, especially ECG signal quality, have been a hot topic [29,30,31]. The PhysioNet/Computing in Cardiology Challenge (CinC) 2011 [32] aimed to develop an efficient algorithm able to run in near real-time within a mobile phone [17], which could provide useful feedbacks to a layperson in the process of acquiring ECG recordings. Due to the poor signal quality caused by the dry electrodes [33, 34], SQA is considered as a main target. Besides, fewer leads means that there is a high probability of synchronous contaminations between different channels, preventing joint diagnosis with multi-leads. Herein, a specialized database contains 300 recordings lasting for 10 s was designed (see Table 1), divided into three categories of signal quality: good signal quality (Type ‘A’), medium signal quality (Type ‘B’) and poor signal quality (Type ‘C’). Typical examples of ECG waveforms are shown in Fig. 2.

Table 1 Specification for signal quality division
Fig. 2
figure 2

Typical examples of the three signal quality categories. a Good signal quality; b medium signal quality; c poor signal quality. ECGs of good signal quality have clear and distinct P-QRS-T morphologies accompanied by slightly noise or artifacts occasionally. ECGs of medium signal quality show obvious rhythmical characteristics, but with distinct signal noises and cannot be used for morphology diagnosis. ECGs of poor signal quality are totally unacceptable recordings, due to the large proportion of noise

2.4.2 Arrhythmias Database

Based on the above classification rule of signal quality, the arrhythmia database was set up by screening lead I ECG signal of type ‘A’. The ANSI/AAMI EC57:1998/(R) 2008 standard [35] specifies that records of patients using pacemakers should not be considered. In addition, segments of data containing ventricular flutter or fibrillation (VF) were also excluded from the analysis. Finally, according to the classification of rate, mechanism or duration, a database containing three major categories and 18 subcategories was completed. Noted that, subject to the small sample of heart disease, several sub-types only contain a few recordings (see Table 2, followed by the corresponding sample image and the diagnostic mode). Sample pictures are presented in the form of standard ECG drawings lasting for 10 s extracted from the complete signal (Figs. 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1314).

Table 2 The structure of the arrhythmias database
Fig. 3
figure 3

Sample graph of sinus tachycardia, where the black asterisks represent the position of sinus P wave. HR equals to 120 bpm, and PR interval is 0.16 s. The PR interval tends to shorten as heart rate increases because of increased sympathetic tone speeds conduct through the atrioventricular (AV) node

Fig. 4
figure 4

Sample graph of sinus bradycardia, where the black asterisks represent the position of sinus P waves. HR equals to 40 bpm

Fig. 5
figure 5

Sample graph of sinus pause, where the black asterisks represent the position of sinus P waves. There is a 2.2-s interval before the heart resume between the 5th and 6th peak, indicating a sinus pause

Fig. 6
figure 6

Sample graph of single PAC, where the black asterisks ‘*’ represent the position of sinus P waves, while the black ‘+’ locates in the position of atrial P wave, and the black arrow points to atrial QRS complex. An abnormal (non-sinus) P′ wave (labeled with ‘+’) is followed by a QRS complex. The P′ wave typically has a different morphology to the sinus P waves (labeled with black asterisks ‘*’) due to the premature beat initiates outside the sinoatrial (SA) node. As shown in the Fig. 5, the SA node would reset when PAC reaches and depolarizes it, resulting in an incomplete compensatory before the next sinus beat arrives

Fig. 7
figure 7

Sample graph of a atrial bigeminy and b trigeminy, where the black asterisks represent the position of sinus P waves, the black ‘+’ locates in the position of atrial P wave, and the black arrows point to atrial beats. PACs often occur in repeating patterns, and bigeminy means every other beat is a PAC, while every third PAC occurring means trigeminy. PACs arise from the premature atrial ectopic pacemaker. The low atrial ectopic pacemaker activates the atria retrogradely, producing an inverted P wave with a relatively short PR interval (see P waves in A12_026). PACs arriving early in the cycle may be conducted aberrantly, usually with a RBBB morphology, as the right bundle branch has a longer refractory period than the left (see QRS wave in A13_003). Furthermore, all abnormal P waves keep the same morphology, which means unifocal premature atrial contractions occurred

Fig. 8
figure 8

Sample graph of atrial couplet, where the black asterisks represent the position of sinus P waves, the black ‘+’ locate in the position of atrial P waves, and the black arrows point to atrial QRS complexes. PACs may occur frequently or sporadically. Two PACs occurring consecutively are referred to as an atrial couplet

Fig. 9
figure 9

Sample graph of paroxysmal atrial tachycardia, where the black asterisks represent the position of sinus P waves, the black ‘+’ locate in the position of atrial P wave, and the black arrows point to atrial QRS complexes. ECG strip between the 1st and 5th QRS complexes is featured with a regular sinus rhythm at 62 bpm. There is a narrow QRS complex tachycardia at 110 bpm between 6th and 13th QRS complexes; each QRS complex labeled with black arrow is preceded by an abnormal P wave (black ‘+’). Atrial P wave morphology is abnormal when compared with sinus P wave due to ectopic origin

Fig. 10
figure 10

Sample graph of atrial fibrillation with irregular baseline undulation. Irregularly irregular rhythm; absence of an isoelectric baseline and P waves. Coarse fibrillatory waves are visible with a high rate

Fig. 11
figure 11

Sample graph of single PVC where the black ‘V’ represents the position of PVCs and the black inverted triangles present abnormal P waves. Sinus rhythm with PVCs of same morphologies (‘V’). With a full compensatory pause, the next normal beat arrives after an interval that is equal to double the preceding RR interval; Note the appropriately discordant ST segment/ T waves. The ectopic impulse is conducted retrogradely through the AV node, producing atrial depolarization. Inverted P waves labeled with inverted triangles (‘retrograde P wave’) occurring after the QRS complexes

Fig. 12
figure 12

Sample graphs of a bigeminy, b trigeminy and c interpolated PVC, where the black ‘V’ represent the position of PVCs and the black inverted triangles present abnormal P waves. Ventricular bigeminy means every other beat is a PVC, and every third PVC is called ventricular trigeminy. In the third figure of Fig. 11, both two PVCs are interpolated, because they are sandwiched between two normal sinus beats without the compensatory pause that typically follows a PVC

Fig. 13
figure 13

Sample graph of ventricular couplets, where the black ‘V’ represents the position of PVCs and the black inverted triangles present abnormal P waves. PVC couplet is two PVCs in a row, the same morphology meaning they are unifocal. The 7th beat is an interpolated PVC.

Fig. 14
figure 14

Sample graphs of non-sustained ventricular tachycardia, where the black ‘V’ represent the position of PVCs. The four consecutive ventricular beats keep a HR of 150 bpm

2.4.2.1 Sinus Rhythm

For normal sinus rhythm, ECG appears periodically with a stable PR interval shorter than 0.2 s. Normal P waves in lead II usually show upright wave in morphology and consistent lasting time in duration. Bradycardia is the situation that resting heart rate (HR) less than 60 bpm, on the contrary, the situation that resting HR greater than 100 bpm is called tachycardia. The condition is referred to as sinus pause if only one or two beats are missed and sinus arrest if more than two beats are missed. In this situation, the sinus node ceases to generate the electrical impulses for a variable period of time.

2.4.2.2 Atrial Rhythm

Premature beat initiates outside the sinoatrial node. Atrial premature beat, also called PAC, is ectopic beat that originates in the atria. Typically, atrial impulse propagates normally through the AV node into the cardiac ventricles, resulting in a normal or narrow QRS complex. Atrial premature beat is associated with an incomplete compensatory pause, meaning that the interval between the preceding and following sinus beats is less than twice the complete cycle.

Single PAC clearly manifests a regular underlying rhythm, but there is a premature beat which can be identified by irregular P wave with different size and shape. Atrial bigeminy is an abnormal pulse that each sinus beat is coupled to a premature atrial complex followed by a slight post-ectopic pause. PACs may occur frequently or sporadically. Two PACs occurring consecutively are referred to an atrial couplet. Paroxysmal atrial tachycardia has a high regular rate of about 140–250 bpm. AF has an atrial rate of more than 400 bpm and is distinguishable due to its haphazardly irregular ventricular rate.

2.4.2.3 Ventricular Rhythm

PVCs are premature ectopic beats arising from the right ventricle (RV) or left ventricle (LV) that can occur in a variety of patterns and can occasionally cause uncomfortable symptoms. PVCs are characterized by premature and bizarre shaped QRS complexes that usually last long (typically > 120 ms). These complexes are not preceded by a P wave, and the T wave is usually large and oriented in a direction opposite the major deflection of the QRS. Ventricular tachycardia (VT) is an ectopic ventricular rhythm with wide QRS complex (120 ms or grater), rate faster than 100 bpm, lasting for at least three beats that spontaneously resolves in less than 30 s.

2.5 Database Scale Expectation

The arrhythmias database would construct a complete standard annotated ECG database in the future. Every year the completed annotation work will be released to public, more information could be seen in this URL (http://www.shelab.cn/Data).

3 Discussion and Further Work

The size and diversity of databases play a more important role in machine learning than the learning algorithm and employed techniques. One of the obstacles in the research on fully automatic analysis in ECG is the insufficient quantity of available databases. Standard ECG database is created for validating algorithms and testing instruments on feature detection and disease diagnosing. ECG databases published in the PhysioNet platform basically collected with high quality in clinical environment, which is the first choice for major research. Researches of wearable devices proceed by painful lurches compared to many classical methods in literature, due to the unbalanced development of traditional databases and dynamic databases. The wearable monitoring of non-emergency arrhythmias raises a high demand on dynamic databases of signal quality and arrhythmias. This study organized database classification and annotation work carefully and put forward it freely. Signal quality database contains 300 recordings lasting for 10 s, sampled at 400 Hz, evenly divided into good signal quality, medium signal quality and poor signal quality. Arrhythmias database consists of 2000 single-channel arrhythmias ECG records, and each is 30-s long and sampled at a rate of 400 Hz. The database contains three categories: sinus arrhythmia, atrial arrhythmia and ventricular arrhythmia. Such a database helps greatly in training algorithms of annotation and classification.

Unfortunately, only few data were obtained in several category of arrhythmias database, due to insufficient diversity of diseases among subjects and the rarity of particular disease. More patients with particular heart disease should be tested to cover the existing deficiencies. More cardiologists are needed because the recordings acquired in dynamic conditions are always very long (24 to 72 h), causing a formidable task with the beat-by-beat annotation. Demographics distributions are another important issue in disease prediction that uncommonly available. For all this, an ECG labeling crowd-sourcing platform will be released to accelerate the process. Furthermore, we look forward to more disclosure of annotated data and novel advances in existing databases.