A Text Mining Pipeline Using Active and Deep Learning Aimed at Curating Information in Computational Neuroscience

The curation of neuroscience entities is crucial to ongoing efforts in neuroinformatics and computational neuroscience, such as those being deployed in the context of continuing large-scale brain modelling projects. However, manually sifting through thousands of articles for new information about modelled entities is a painstaking and low-reward task. Text mining can be used to help a curator extract relevant information from this literature in a systematic way. We propose the application of text mining methods for the neuroscience literature. Specifically, two computational neuroscientists annotated a corpus of entities pertinent to neuroscience using active learning techniques to enable swift, targeted annotation. We then trained machine learning models to recognise the entities that have been identified. The entities covered are Neuron Types, Brain Regions, Experimental Values, Units, Ion Currents, Channels, and Conductances and Model organisms. We tested a traditional rule-based approach, a conditional random field and a model using deep learning named entity recognition, finding that the deep learning model was superior. Our final results show that we can detect a range of named entities of interest to the neuroscientist with a macro average precision, recall and F1 score of 0.866, 0.817 and 0.837 respectively. The contributions of this work are as follows: 1) We provide a set of Named Entity Recognition (NER) tools that are capable of detecting neuroscience entities with performance above or similar to prior work. 2) We propose a methodology for training NER tools for neuroscience that requires very little training data to get strong performance. This can be adapted for any sub-domain within neuroscience. 3) We provide a small corpus with annotations for multiple entity types, as well as annotation guidelines to help others reproduce our experiments. Electronic supplementary material The online version of this article (10.1007/s12021-018-9404-y) contains supplementary material, which is available to authorized users.


Directives -entity imbrication
When an entity is a specific case of a more inclusive entity within an imbricated structure, both the specific entity and the inclusive entity should be annotated.

Directives -lists
Lists of entities should be treated as separate entities.

Directives -split annotations
Annotations should be split in segments when necessary. Generally this applies to lists

Definition
Any phrase describing an area of the brain. This includes for example a cortical layer. This includes area mentioned by their function (e.g., the somatosensory area ) but exclude the mention of the system (e.g., the somatotosensory system) or of a "representation" (e.g., the shoulder representation).

Examples
Low-threshold Ca2+ spikes (LTS) are an indispensable signaling mechanism for neurons in areas including the cortex , cerebellum , basal ganglia , and thalamus .
The inhibitory sources in the thalamic nuclei are local interneurons and neurons of the thalamic reticular nucleus .

Definition
An ion current is the influx and/or efflux of ions through an ion channel. (Source: Wikipedia ) We do not annotate excitatory/inhibitory postsynaptic currents as ionic currents (e.g. "NMDAR current").

Examples
A hyperpolarization-activated cation conductance contributes to the membrane properties of a variety of cell types.
The steady-state conductances of depolarizing Ih ( hyperpolarization-activated cationic current ), IT ( low-threshold calcium current ), and INaP ( persistent sodium current ) move the membrane potential away from the reversal potential of the leak conductances.

Definition
An ion channel is a transmembrane molecule allowing under certain conditions the exchange of some ions between the intra and the extracellular environment. These can be referred to either by the name of their molecule or by the name of the gene coding this molecule. Reference to only sub-domain of an ion channel should not be annotated (i.e., only reference to the channel as whole are annotated), unless they refer share a clear relationship with a type of ion channel. For example alpha1G (a type of alpha subunit) is used as synonym of Cav3.1 ion channels ( http://channelpedia.epfl.ch/ionchannels/85 ). In general, alpha sub-units can give their name to the channel type, not beta sub-units (at least for soduim and calcium). In case of genetically modified animals referred to as gene_name-/-where the gene_name is associated with a knock-down ion channel, we annotate gene_name as (a reference to an) an ion channel (see examples).

Examples
Unexpectedly, however, we found that both WT and KO mice for CaV3.1 , the gene for T-type Ca2+ channels in TC neurons [...].

Definition
The conductance associated with the electrical model of an ion channel or of a ionic current. We also use this entity to annotate ionic resistances, since these two concepts are directly related (resistance = 1/conductance). When annotating conductances, we do not annotate gmin or gmax because these do not refer to the conductance itself but to the minimal or maximal value of this conductance.

Definition
An electrically excitable cell that processes and transmits information through electrical and chemical signals. (Source: Wikipedia ) Neurons are polarized cells with defined regions consisting of the cell body, an axon, and dendrites, although some types of neurons lack axons or dendrites. (Source: NeuroLex ) When the neuron is names as "neuron type X of region Y" the part "of region Y" should be included in the neuron name.

Examples
Our data demonstrate that key somatodendritic electrical conduction properties are highly conserved between glutamatergic thalamocortical neurons and GABAergic thalamic reticular nucleus neurons and that these properties are critical for LTS generation.
The inhibitory sources in the thalamic nuclei are local interneurons and neurons of the thalamic reticular nucleus .

Model Organism / Species
Definition A term referring to a species' name. These are typically used in experiments and may be referred to via an informal name as well as the formal latin name. We use the term "species" in its more generic case and, accordingly, it should also be used to annotate more specific entities like strains. Similar, it should also be used to annotate class of species (e.g., rodent). Note that some annotations concerns the species in a somewhat implicit manner. This is the case when talking about some strain and its wild-type counterpart; wild-type is not in itself a species but is referring to species entity (i.e., the wild-type counterpart of an experimental strain previously mentioned). For the same reason, "normal rat" will be annotated (including the "normal" qualificative) when it is use to contrast to another strain (e.g., GAERS) and is implicitly used as a synonym for "wild type".

Examples
We found that in both Wistar rats and GAERS , the proportion of interneurons was significantly higher in the LGN than in the VPM and VPL.
Unexpectedly, however, we found that both WT and KO mice for CaV3.1, the gene for T-type Ca2+ channels in TC neurons, exhibit typical waxing-and-waning sleep spindle waves at a similar occurrence and with similar amplitudes and episode durations during non-rapid eye movement sleep.

Scientific Values
Definition A quantifiable number (including the unit, if any are present), denoting a value. This may occur as a range (e.g., -100 to -40 mV). It might also happen in case of list of measurements all related to the same entities (i.e., they are repeated measurements). However, two values separated by an "and" referring to two different entities (see the e.g. with "0.13 and 0.20 ms respectively" below, referring to measurements in two different cell types) should be annotated as separated values. Sample size should not be annotated. In text where the ~ symbol is used as a synonym as "approximately" (e.g., ~20 mm), the ~ symbol should be included in the annotation.

Examples
In the VPM, the proportion of interneurons was 4.2% in Wistar and 14.9% in GAERS; in the VPL the values were 3.7% for Wistar and 11.1% for the GAERS.
Whereas the time course of Na+ channel activation ( -30 to +40 mV ) was similar, the deactivation kinetics ( -100 to -40 mV ) were faster in BCs than in PCs.

Definition
Units which describe a scientific quantity, possibly as defined by the SI . These will often occur as abbreviations. Of note, a unit can in some case be a noun that is not typically an entity, as in "20 neurons ".

Examples
In the VPM, the proportion of interneurons was 4.2 % in Wistar and 14.9 % in GAERS; in the VPL the values were 3.7 % for Wistar and 11.1 % for the GAERS.