Determination of composition of mixed biological samples using laser-induced fluorescence and combined classification/regression models

Laser-induced fluorescence (LIF) provides the ability to distinguish organic materials by a fast and distant in situ analysis. When detecting the substances directly in the environment, e.g., in an aerosol cloud or on surfaces, additional fluorescence signals of other fluorophores occurring in the surrounding are expected to mix with the desired signal. We approached this problem with a simplified experimental design for an evaluation of classification algorithms. An upcoming question for enhanced identification capabilities is the case of mixed samples providing different signals from different fluorophores. For this work, mixtures of up to four common fluorophores (NADH, FAD, tryptophan and tyrosine) were measured by a dual-wavelength setup and spectrally analyzed. Classification and regression are conducted with neural networks and show an excellent performance in predicting the ratios of the selected ingredients.


Introduction
Environmental monitoring of biological agents is a key issue in the safety and security fields. It aims at avoiding the natural, accidental and intentional spreads of hazardous agents, with dangerous or even lethal consequences for the local community. Nowadays, the environmental biomonitoring is frequently performed by sampling and successive laboratory analyses, which make the monitoring discontinued, sporadic and usually ineffective (especially timeconsuming). Living organisms ubiquitously contain numerous fluorophores such as coenzymes and amino acids that tend to fluoresce when excited by electromagnetic radiation of a certain wavelength [1,2]. Hence, agents of different fluorophore compositions emit different fluorescent spectra. This property makes laser-induced fluorescence (LIF) spectroscopy a a e-mail: marian.kraus@dlr.de (corresponding author) valid technique for the detection and classification of biological agents like pathogenic bacteria, allowing developing a specific instrument for the continuous biomonitoring by standoff detection. Since not all fluorescent matter is dangerous, a proper classification of the agents must be achieved by a spectroscopic approach, in order to distinguish between dangerous and safe cases.
When detecting biological threats without prior sample preparation, fluorescence from other organic material occurring in the environment is expected. Although the classification of pure samples of bacteria [3] and viruses [4] is highly performant, algorithms able to classify the presence of different classes in a mixed sample have not been deeply investigated. The capability to discriminate agents in mixtures or in the environment with fluorescent background is fundamental to develop an instrument which works under realistic conditions, avoids false alarms and guarantees a high sensitivity [5]. Because biological agents frequently have similar spectra, their discrimination may be challenging, even in pure samples [1][2][3]6,7]. In mixed samples, the measured spectrum is a combination of the emission of all fluorescent matter and possible interactions (e. g. quenching), making the discrimination of single classes even more challenging. In the last decades, machine learning approaches have become powerful tools for problems and are the state-of-the-art technology to analyze complex systems with large quantities of data. For LIF spectroscopy, machine learning algorithms, such as neural networks and decision trees, have been successfully applied to the classification of pure biological agents. High classification performances (more than 90%) have been obtained in the most of applications, remote sensing included. On the contrary, no results and detailed investigations about the concentration measurements and classification of mixture samples are available in the literature [4,[8][9][10].
The present work is a resumption of a previous article [11] where samples of different bacteria could be distinguished by using their LIF signals. In this context, two of the most frequently asked questions address the ambiguity in handling either mixed samples or substances on unknown surfaces. A solution of the latter one has already been applied for a patent [12]. And here, seizing the first question, the classification and prediction of concentrations of different agents in mixed samples is investigated with LIF data. The algorithm for mixture classification is based on a neural network (NN) approach [13][14][15][16][17][18][19]. The algorithm is tested with four bio-fluorophores in different ratios and concentrations: nicotinamide adenine dinucleotide (NADH), flavin adenine dinucleotide (FAD), tyrosine (TYR) and tryptophan (TRP). To avoid metabolic variations of living biological samples, these defined substances have been chosen for this first investigation. To the authors' knowledge, the results have validated for the first time the applicability of LIF to classify and predict the relative concentration of each agent contained in a mixed sample.

Materials and methods
Fluorescence spectra were recorded with a dual-wavelength setup described by Gebert et al. [11], providing with alternating pulses of a Nd:YAG laser at 266 nm and 355 nm with a repetition rate of 100 Hz and pulse lengths of 0.7 ns. Pulse energies are adjusted for each sample up to 250 µJ depending on the fluorescence intensity. A Newtonian telescope with diameter of 400 mm is used to collect the fluorescence emission from 22 m standoff distance. The signal is guided to a spectrometer with 32 channels covering the optical range from 250 nm to 680 nm. Each record includes 100 times the spectral response for both excitation wavelengths. All measurements (see Fig. 1) were taken starting with stock solutions of TYR, From the acquired data, channels with misleading information (e.g., spectral regions which are below excitation or blocked by optical filters) are discarded. For both excitation wavelengths, spectra are scaled from 0 to 1 for a comparative usage and as a necessary for later data processing. The overall data set consists of 88100 spectra containing information of 43 fluorescence features.
The algorithm developed in this work consists of an ensemble of feed forward neural networks. Being the algorithm supervised, the spectra have been divided into primary training set and "primary" test set. The primary training set is the effective training set of the ensemble. For each neural network, it is randomly divided into "secondary" training set, validation set and test set (Fig. 2). Each neural network solves a regression problem and is trained to predict the relative concentration of each agent. Then, the mean and the standard deviation of each relative concentration are calculated using the output of each neural network. The mean concentration is the concentration predicted by the algorithm, and the standard deviation is its uncertainty. The presence of one agent (classification problem) is then calculated by a hypothesis test (t-test), where the null hypothesis is that there is not the agent. Thus, the t-score is calculated as the mean concentration divided by its standard deviation. If the t-score is higher than a specific threshold (which is optimized according to the acceptable type-1 error), the algorithm rejects the null hypothesis and classifies the agent as "present." Then, the ensemble is applied to the primary test set (which has never been included for training the model) as required. Figure 3 shows the combined results after classification and regression of the test data set. The scatter plots in the top panels of each subfigure show for each substance the relative concentration as function of the measurements sorted in ascending concentration. The bar plots (bottom) show the false positives (red) and the false negatives (blue). The generated models are applied to the primary test set. The grids show the combined results after classification and regression. The measurement indices are sorted according to the true ratios. Blue lines represent the actual, and the red dots are the predicted values. For each substance, the coefficient of determination R 2 is found to be larger than 0.99. The bar plots (bottom) show the false positives in red and the false negatives in blue. The goodness of the regression fit and the classification performance given by sensitivity and specificity are given in Table 1. The average goodness of the regression fit is given by the coefficient of determination R 2 and found to be higher than 99.4 %, and the calculated classification performances in terms of sensitivity and specificity are 99.3 % and 99.65 %, respectively.

Summary
In this evaluation, substances have been selected with clearly different fluorescence signatures (Fig. 1). It is to be expected that a discrimination of samples with higher similarity in their fluorescence features will become more challenging. However, the approach in this work, combining classification and regression, demonstrates the ability to predict the presence of an agent in a mixture of substances and to predict ratios of mixed fluorophores by LIF. These findings are of high importance in the development of an instrument for biological detection. For that, one may distinguish firstly between hazardous and non-hazardous agents. Moreover, the disturbing influence of background materials may be strongly decreased. Since it is expected that the classification performances of one agent will strongly drop if its fluorescence intensity is in the order of the background signal, a sensitivity analysis is recommended regarding the maximum of the fluorescence signal ratios. Detailed analyses and discussions about performance, potential and limitation are still in progress and will be fully presented in a future work. Funding Open Access funding enabled and organized by Projekt DEAL.
Data Availability Not applicable.

Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.