Main

Optical information processing leverages the unique properties of light, such as its parallelism, which allows the simultaneous processing of multiple data streams, as well as low energy consumption1,2,3,4,5. Moreover, light possesses a vast frequency spectrum, enabling ultrahigh bandwidth and data throughput4,5,6,7. By exploiting these characteristics, optical information processors have the potential to unlock new levels of performance, scalability and energy efficiency, which could transform the landscape of information processing in the optical domain4,5. It has enabled new applications when coupled with existing optical instruments, such as imaging system8 to enhance the performance.

However, the full potential of optical processors can only be realized by overcoming certain challenges, one key requirement being optical nonlinear mapping6,9. Nonlinear mapping is essential to approximate arbitrary function and has been a powerful element in neural networks as it allows models to recognize complex patterns and approximate any given function10. It plays a vital role in representation and feature learning, as it facilitates the discovery of higher-level, more informative and discriminative features for a task11,12. The application of nonlinear mappings allows the extraction of abstract and nonlinear features, thereby enhancing the input data representation13,14. In existing optical computing platforms, optical nonlinear mapping has been primarily achieved using nonlinear optical materials, which provide a nonlinear relationship between the input and output fields15,16,17,18,19,20,21,22. Optical nonlinearity often requires intense pumping and high peak power, which can be energy demanding, necessitates design and engineering of nonlinear or active materials and is generally restricted to lower-order nonlinear mapping with limited tunability6,9. Alternatively, the conversion of signals from optical to electrical and back to optical is used for the nonlinear processing of optical data, but with limited speed.

Here we propose to exploit the passive nonlinear optical mapping inside a multiple-scattering cavity23, akin to the steady state of a reservoir computer, for rapid optical information processing. High-order nonlinearity fosters the generation of low-dimensional latent feature space and facilitates strong data compression. Previously, propagation through a multiple-scattering material has been exploited to perform linear optical random projections24, followed by an intensity detection. It can be regarded as a single-random-layer neural network, and has been used for multiple machine learning tasks25,26,27,28, but remains limited in performance by its intrinsic linear mapping behaviour. By introducing multiple scatterings in a cavity design, we enable multiple bounces on the same input pattern, effectively creating an optical nonlinear transformation of the input data, without the need of nonlinear optical materials or optical–electrical–optical conversion typically used for nonlinearity in optical information processing. We demonstrate high computing performances across tasks from classification, image reconstruction, keypoint detection and object detection, with the optically compressed output fed into a digital decoder. In particular, we show that our system exhibits high performance even at a mode compression ratio (defined by the input macropixel numbers on a digital micromirror device (DMD) to output the number of speckle grains on the camera) of ~3,000:1 for high-level computing tasks, as evidenced in real-time pedestrian detection with bounding box generation. Our work illuminates the role of varying nonlinear orders in optical data compression based on mutual information analysis, and paves the way for tunable optical nonlinear mapping and energy-efficient computing.

Results

Nonlinear random mapping with tunable nonlinearity

Introducing nonlinearity has long been a challenge and simultaneously a necessity in optical computing platforms. Nonlinearity is a key element for enabling complex operations and boosting computational power4,5. It is particularly important for approximating arbitrary functions—a task critical in machine learning. In this study, we present a novel approach to address this challenge by utilizing nonlinear mapping provided by multiple linear scatterings of light within an optical cavity23. We constructed the multiple-scattering cavity using an integrating sphere (Fig. 1a), which features a rough inner surface that scatters light. A continuous-wave laser operating at low power is injected into the cavity via the first port, resulting in an output speckle pattern from the second port. The third port integrates a DMD to display the input patterns. In general, a Born series can be used to describe the scattering process in the cavity:

$${E}_{{\rm{out}}}={{{\bf{T}}}}{E}_{{\rm{in}}}=\left[{{{\bf{V}}}}+{{{\bf{V}}}}({{{{\bf{G}}}}}_{{\rm{o}}}{{{\bf{V}}}})+{{{\bf{V}}}}{({{{{\bf{G}}}}}_{{\rm{o}}}{{{\bf{V}}}})}^{2}+\ldots \right]{E}_{{\rm{in}}}.$$
(1)
Fig. 1: Concept of using a multiple-scattering cavity as a passive, tunable nonlinear optical information processor.
figure 1

a, Experimental setup in which the key component for creating the passive nonlinear random mapping is a DMD mounted on an integrating sphere. The output of the cavity produces a fully developed speckle pattern, with its response being nonlinear in the geometric configuration of the DMD. b, Representative figure showing that the cavity essentially encodes the input pattern on the DMD by optically mixing different areas of input through multiple bounces to create a highly nonlinear feature—a speckle recorded by a camera (input pattern is adapted from the MNIST dataset62). c, Mathematical representation of a nonlinear mapping process that transforms a set of input elements on the DMD into a collection of nonlinear features in the output speckle pattern. Multiple scatterings in the cavity generate mixed terms of input values at different pixels with various high nonlinear orders, which provide rich nonlinear features that can be optimally trained to enhance performance in complex computational tasks. f(x) denotes the operation of scaling the configuration of a DMD macropixel xi,j.

Here matrix T represents a linear mapping from the input optical field Ein to the cavity to the output field Eout. V is the matrix that denotes the scattering potential inside the cavity, and Go is Green’s matrix representing light propagation within the cavity in between bounces off the boundary. The notation (GoV)n represents the matrix GoV multiplied by itself n times. The final intensity image formed on the camera is given by Icam = ∣Eout2, where ∣∣2 represents an element-wise operation. The T expansion begins with a term indicative of single scattering and subsequent terms indicate multiple scatterings in the cavity. In the cases where a single scattering is the dominant event, the mapping from V to Eout is predominantly linear. In our case with multiple scatterings, the relation between the scattering potential configuration V and output field Eout becomes nonlinear. The Born series can also be reformulated as \({{{\bf{T}}}}={{{\bf{V}}}}\mathop{\sum }_{m = 1}^{\infty }{({{{{\bf{G}}}}}_{{{{\bf{0}}}}}{{{\bf{V}}}})}^{m-1}={{{\bf{V}}}}\mathop{\sum }_{m = 1}^{\infty }{{{\bf{U}}}}{{{{\boldsymbol{\Lambda }}}}}^{m-1}{{{{\bf{U}}}}}^{-1}\), where G0V = UΛU−1, Λ is a diagonal matrix of elements equal to eigenvalues of G0V and the corresponding eigenvectors are columns of U. For high-order m, the largest eigenvalue λmax dominates over all the other eigenvalues, and the polynomial orders in T can be approximated as \(\mathop{\sum }_{m = 1}^{\infty }{\lambda }_{\max }^{m-1}\). Thus, the nonlinear coefficient decays exponentially with order m. Experimentally, due to chaotic ray dynamics in our cavity, it is difficult to extract the largest eigenvalues for different active areas on the DMD. Furthermore, since part of the surface area of the cavity can be modified by the DMD using the input (modulation) patterns, this also provides a reconfigurable scattering potential inside the cavity. Multiple bounces of light off the modulated area of the DMD results in a nonlinear mapping from the input pattern displayed on the DMD to the output speckle pattern. As the number of bounces on the DMD increases, the order of this nonlinear mapping increases (Fig. 1b,c). It is this nonlinear relationship that forms the foundation for the passive nonlinear encoding technique that we explore in this work.

Light scattering within the cavity can be solely adjusted by altering the pattern displayed on the DMD as an input pattern (Fig. 1a). Each micromirror on the DMD can be toggled between two angles. This action effectively modifies the scattering potential V for light, determining mapping from the input pattern on the DMD to the output optical field Eout. A larger modulation area boosts the probability of light scattering by the modulated part of the scattering potential, thereby enhancing the nonlinear mapping. The more times light is scattered by the DMD pattern, the more chance it samples the input pattern (Fig. 1b). Each scattering event effectively mixes the information from different parts of the DMD, resulting in a complex optical encoding of the entire pattern. The longer the light remains in the cavity, the more encoding and mixing occur, effectively ensuring that light in each output mode (speckle grain) carries information about a multitude of input data (Fig. 1b). The interaction due to multiple scatterings results in a nonlinear mapping where the intensity of each output mode (speckle grain) becomes a highly nonlinear function of the input pattern (Fig. 1c). The number of bounces determines the order of nonlinearity. To further enhance the nonlinear order, the number of bounces is increased by covering the output port by a partial reflector to increase the dwell time of light inside the cavity. Such nonlinear mapping induced by multiple scatterings is purely passive (no need for high power) and is fundamentally distinct from the traditional nonlinear optics that rely on an intrinsic material response.

This scheme offers an efficient means of achieving tunable high-order nonlinear random mapping at a constant low power with a continuous-wave laser in a passive manner (Methods), compared with conventional optical nonlinearities that rely on the material response at high optical intensity29,30. In our case, the nonlinear order is independent of the input power, and can be rapidly tuned (~20 kHz) by altering the DMD-modulated area. This rapid tuning capability outperforms many known nonlinear effects, such as thermo-optical nonlinearity31,32. Additionally, our scheme avoids dynamic chaos and instabilities commonly associated with conventional nonlinear optical systems and lasers29,30,33.

To comprehend and characterize the tunable nonlinear mapping introduced by our system, we explore how deep neural networks can function as a proxy to understand the nonlinear random mapping in our system. As detailed in Supplementary Section 3, we find that the higher-order nonlinear mapping, provided by a larger area of modulation on the DMD, can be approximated by a deeper neural network (the ‘Further explanation of Born series’ section explains the reformation of the Born series in terms of its proxy as deep neural networks with fixed random weights).

Enhanced image classification

To evaluate whether this nonlinear mapping can indeed provide any computational benefits, we begin by testing on a simple but widely recognized machine learning benchmark task, namely, the Fashion MNIST dataset34. Fashion MNIST is a popular fashion image classification challenge that includes 60,000 training samples and 10,000 test samples, each image measuring 28 × 28 pixels.

We input the Fashion MNIST data on the DMD and directly read the output speckle pattern to obtain both higher- and lower-dimensional representations. These representations, which we refer to as nonlinear features of the input information, can be utilized to execute computing tasks. To achieve nonlinear random mapping with tunable nonlinearity, given a dataset with fixed input size, we either adjust the modulated area of the DMD (Fig. 2a) or partially close the output port to change the number of times light is scattered by the DMD. During the training phase, we train only the linear digital layer using the nonlinear features generated from the training dataset at each given configuration. In the inference/test stage, we forward the output images from the cavity to the trained linear digital layer to generate predictions (Methods).

Fig. 2: Classification with nonlinear mapping.
figure 2

a, Training data from the Fashion MNIST datasets are used to train a one-layer neural network as a digital decoder for classification tasks. Additionally, the percentage of the modulated area on the DMD is changed among 6.25%, 25% and 100% to adjust the order of nonlinear mapping. With full (100%) modulation of DMD, the nonlinear order is further enhanced by covering the output port with a partial reflector (silicon wafer). b, Fashion MNIST classification results with a linear classifier are presented under different numbers of output modes (speckle grains) and varying nonlinear strengths. The optical linear features with quadratic detection are simulated by scattering from a single layer with intensity detection to create a quadratic nonlinear response. Note that a linear regression for binarized Fashion MNIST data cannot exceed 77.6% with the same number of modes. c,d, Violin plots representing the distributions of mutual information between the speckle grains and classification targets under varying numbers of output modes (c) and differing orders of nonlinear mapping by changing the modulated area on the DMD or partially closing the cavity (enhanced) (d). For n speckle mode (n on the x axis), 4n replicated measurements from the same input were performed in c and d. The dashed line plots depict the median values of the mutual information. Each violin’s width reflects the distribution of the mutual information values of the speckle grains and its probability density. Within each violin, the slim black vertical line represents the range of minimum and maximum values; the black box represents the first to third percentile; the white dot represents the median. c, Mutual information analysis when the number of output modes (speckle grains) varies under the highest-order nonlinear mapping. d, Mutual information analysis with low-dimensional speckle features (four output modes) for Fashion MNIST as a function of the nonlinear orders varied by modulated area on the DMD, showing the advantage of going to higher-order nonlinear mapping.

In Fig. 2b, we present the classification performance in the Fashion MNIST dataset using a linear classifier. To quantitatively compare the performance of different nonlinear strengths in the optical encoder, we fixed the linear decoder and used test accuracy as a metric for comparison. We observe that stronger nonlinearity leads to improved classification performance, particularly when the number of optical modes/speckles is smaller. This indicates that each speckle from higher-order nonlinear mapping embeds more information. These findings further suggest that our device may possess a unique advantage in optical data compression.

To more comprehensively quantify the information within each spatial mode (speckle grain) in our output images, we employ the concept of mutual information. Compared with regression, mutual information includes both linear and nonlinear dependencies and does not make assumptions about the underlying data distribution35. It is widely used in compressive sensing36, a technique focused on efficiently acquiring and reconstructing sparse or compressible signals, and has found important applications in machine learning for tasks such as feature selection37, model interpretation38 and understanding variable dependencies39. In our case, we calculate the mutual information between the output features and target classes for the dataset (Methods and Supplementary Section 4). This quantifies how well the nonlinear optical features contain the abstract information that is useful for high-level computing tasks (Supplementary Section 4). In Fig. 2c,d. the violin plots effectively illustrate the distribution of mutual information between the speckle grains and classification targets. A notable observation from the results is the onset of saturation of mutual information required for the Fashion MNIST classification with 4–25 modes/speckles (Fig. 2c). This saturation occurs under the highest-order nonlinear mapping in our experiments. Further, Fig. 2d underscores the benefit of escalating to higher-order nonlinearity. We observe that, indeed, higher-order nonlinear mapping creates stronger mutual information between the features and targeted classes, given the same number of output modes/speckles. This observation implies that our system can more effectively capture the underlying relationships between the features and target classes when higher-order nonlinear mapping is introduced.

Demonstration with complex tasks

Image reconstruction

Building on the enhanced information provided by the nonlinear features from our system, we pose the question: can this enhanced information (within a few output modes) yield superior image reconstruction? To address this, we conduct a comparative analysis of the nonlinear features generated in two distinct scenarios: one featuring a higher-order nonlinear optical random mapping induced in the multiple-scattering cavity (Fig. 3c), and another that presents a linear optical random projection (Fig. 3a)24 with nonlinearity only at the detection stage (intensity measurement).

Fig. 3: Computing performance enhanced by nonlinear optical data compression.
figure 3

a, Concept of image reconstruction using linear optical complex media for linear encoding and camera detection with quadratic response. b, Reconstruction using the speckle features from a. The orange boxes represent the wrongly reconstructed pairs. c, Multiple-scattering cavity as a nonlinear optical encoder along with camera detection and employing compressed speckle features for digital reconstruction of the original image data. d, Reconstruction from speckle features generated by the multiple-scattering cavity. In b and d, approximately 25 speckle grains are used with a compression ratio of 31:1 and are used to train two digital decoders (Methods). It is demonstrated that given the same number of compressed output modes (speckle grains), nonlinear features generated from the cavity can provide a reduced mean squared error by 0.6, resulting in a better reconstruction of the images in d compared with b. More results are provided in Supplementary Figs. 46. e, Concept of keypoint detection in human faces (images with 96 × 96 pixels) with compressed speckle features. f, Keypoint detection with a mode compression ratio of 576:1, using 16 output modes with relatively weaker nonlinearity (25% modulated areas in the DMD) and a five-layer MLP decoder. g, Improved keypoint detection with a reduced mean error in pixels across 15 keypoints (1.06 pixels compared with 1.86 pixels errors in f), using 16 output modes (speckle grains) with relatively stronger nonlinearity (full modulated areas in the DMD) and a nine-layer MLP decoder.

To most efficiently extract the embedded information from a few speckles, we deviate from the traditional approach of employing a digital linear layer for classification and, instead, introduce a customized and optimized multilayer perceptron (MLP) as a decoder for image reconstruction. The architecture of this decoder is finely tuned using neural architecture search to optimize the image reconstruction. Subsequently, we train two digital decoders, each featuring optimal architectures, on the Fashion MNIST dataset under high compression ratios of ~31:1 (only ~25 modes) (Fig. 3).

It is noteworthy that despite the optimally trained decoder in each case, the quality of the reconstructed images varies (Fig. 3b,d and Supplementary Figs. 4 and 5). We observe that augmented nonlinear random mapping indeed facilitates improved image reconstruction, with a mean squared error of ~1.4 in the test set (Fig. 3d and Supplementary Fig. 5) compared with that of ~2.0 in linear optical features (Fig. 3b and Supplementary Fig. 4), each under a separately optimized decoder architecture. When using the same decoder architecture, the nonlinear features still outperform with the mean squared error of ~1.5 (Supplementary Fig. 6).

Our findings show that the nonlinear optical mapping in our system can efficiently compress and retain vital information as well as decrease data dimensionality. Motivated by these results, we are prompted to explore the potential of nonlinear features in executing other high-level computing tasks.

Keypoint detection

A key advantage emerging from our work is that optical data compression, facilitated by multiple scatterings in the cavity, generates mixtures of highly nonlinear features. These are particularly useful for applications that require high-speed analysis and responses of high-dimensional data. Our DMD contains 4 million pixels and can accommodate large images. However, in our image reconstruction demonstration, the input dimensions of the Fashion MNIST dataset are limited to 28 × 28 pixels, creating an inherent upper limit for the maximum compression ratio that can be demonstrated. A major strength of our system is its ability to easily scale up the size of the input data as well as the effective neural network’s depth of the optical encoder without increasing the input power, thereby allowing for an efficient representation of the input information in an energy-efficient way. This adaptability and scalability facilitates tackling more complex tasks and processing larger datasets without losing crucial information.

Pushing the compression further and exploring other high-level computing tasks, we delve into two specific applications where we scale up the input images. A notable example (Fig. 3e) demonstrates that we can extract 15 keypoints from human face images40 with an order of magnitude improvement in the mean squared error, which decreases from 0.208 (using 25% modulated area in the DMD) to 0.014 (using 100% modulated area in the DMD enhanced with the partial reflector), due to the incorporation of stronger nonlinearity with a larger modulation area, even when the number of output modes (speckle grains) is reduced to 16 (Fig. 3f–g). In both cases, the architectures of the decoders are separately optimized and trained for optimal performance. Even when we use the same architecture (a five-layer MLP) that was optimized for features from a 25% modulated area in the DMD for the decoder to train features from the latter case, the mean squared error associated with these higher nonlinear features remains low (~0.3). This task—crucial for various applications such as facial recognition, emotion detection and other human–computer interaction systems—illustrates the robustness of our approach in dealing with high-level tasks and maintaining a high compression ratio. An additional advantage of our methodology lies in its implications for privacy protection and adversarial robustness, as our method can securely encode facial information in random speckle grains.

Real-time video analytics

The last application we demonstrate is real-time video analytics, using the benchmark dataset known as Caltech Pedestrian41, including real-time video recordings (Fig. 4a). The images from the videos displayed on the DMD have dimensions of 240 × 320 pixels (Methods). Using our multiple-scattering cavity, we can compress the data to achieve a compression ratio of up to 3,072:1 (that is, using only 25 output modes), as well as maintain high positional accuracy (Fig. 4b) within mean squared errors of 1.92 pixels in identifying pedestrian positions at a high speed and 0.0035 s in total response time (including compressed optical nonlinear feature generation and inference time) with an optimized digital backend (a ten-layer MLP) per frame (Fig. 4c and Supplementary Videos 1 and 2).

Fig. 4: Real-time video pedestrian detection in driving with high mode compression ratio using only 25 output modes.
figure 4

a, Schematic of real-time pedestrian detection using video data from a dash camera during driving. The multiple-scattering cavity functions as an optical data compressor, and compressed nonlinear optical features are utilized for pedestrian detection with a digital decoder. b, Demonstration of pedestrian detection at a rate close to a real-time video. The magenta boxes represent the inference results from the speckle. The green boxes represent the ground truth. The speed of optical processing, that is, nonlinear feature generation, is as fast as light, and its readout speed is limited by only the camera. With only 25 modes, our camera can currently reach at least 800 Hz. The inference time with the 25 modes in pedestrian detection is 0.0035 s, leading to a total response time (inference + generation of optical features) of less than 0.1000 s, which is faster than the typical human response time of ~0.2000–22.0000 s. The error unit is in pixels (px). c, Demonstration of pedestrian detection at various locations during continuous video streaming; the mean detection error with only 25 modes remains within 1.92 pixels (px).

This application is particularly critical in the field of autonomous vehicles and advanced driver-assistance systems, where high-speed pedestrian detection is essential to ensure safety and allow fast reaction time. The high compression ratio of our system, combined with its rapid processing speed, shows great promise for such applications where fast and accurate detections are imperative.

To further estimate the gain we have in terms of optical data compression, we calculated the number of parameters and operations in the digital domain with and without an optical encoder. In human face keypoint detection, our method with an optical encoder demonstrated a mean pixel error of 1.06, slightly surpassing the performance of a widely utilized conventional convolutional neural network (CNN) architecture (which is still widely used as a benchmark for vision tasks). With a CNN model (one convolutional layer + one pooling + one convolutional layer + three fully connected layers), we achieved a mean pixel error of approximately 1.23. The digital CNN model comprises over 74 million parameters and requires around 83 million operations, and are two orders of magnitude higher than the digital operations/parameters used in our system (approximately 310,000 digital trainable parameters/operations). This comparison underlines the enhanced accuracy of our approach, as well as pointing to a substantial reduction in computational complexity and resource utilization inherent to our method. For pedestrian tracking, our method exhibited mean pixel errors ranging from 1.3 to 3.6, closely matching the performance of a conventional CNN model used for comparison, with which we obtained mean pixel errors between 1.37 and 3.33. The comparison model in this instance incorporates three convolutional layers and two fully connected layers, involving more than 39 million parameters and necessitating approximately 172 million operations. In contrast, our decoder, after nonlinear optical projection, used only about 1 million parameters and a similar number of operations (1 million) to achieve comparable performance. In general, the higher the input dimension and the more we compress, the larger the number of digital operations we can allocate into the optical domain and therefore better leverage the advantage of information processing of light.

Discussion

Exploiting optics for computing, which brings benefits such as high speed, large bandwidth and parallelization, has traditionally been impeded by the challenge of addressing optical nonlinearity. Conventional all-optical methods typically involve complex experimental conditions, using nonlinear materials, like nonlinear crystals or polymers, pumped by high-power short-pulsed lasers, or semiconductor lasers operating in continuous or pulsed modes42,43. Although these have shown optical computing benefits in a variety of platforms including multimode fibre18, integrated photonics20 and free-space optics, limitations regarding their robustness, energy efficiency and stability persist.

In this work, we completely avoid the limitations of conventional optical nonlinearity by proposing a unique approach to achieve optical nonlinear random mapping by utilizing multiple scatterings within an optical cavity. This strategy enables us to institute nonlinear random mapping, where the adjustment of nonlinearity is entirely dependent on the geometrical configuration and quality factor of the cavity, thereby influencing the scattering potential. The intrinsic mixing of input information within the dataset at varying nonlinear orders permits us to generate highly nonlinear features compared with traditional optical nonlinear mappings, especially those with solely lower-order (2–3) nonlinearity that most nonlinear materials practically permit. From a machine learning perspective, by expanding to higher-order nonlinear mapping, we essentially generate an augmented optical feature space, incorporating more mixtures of higher-level input information. This expansion increases mutual information between the subspace of the feature space and the entire input pattern (evident by the image reconstruction task) and output targets (Fig. 2 and other high-level tasks), facilitating a higher compression ratio for complex tasks. In essence, our system demonstrates the capability to execute optical data compression by harnessing multiple scatterings of light in a reconfigurable cavity. This approach allows for the efficient preservation of critical information as well as stringently reducing data dimensionality.

We have demonstrated that our multiple-scattering cavity, equipped with passive and tunable nonlinear optical random mapping capabilities, can act as an optical nonlinear encoder with adjustable nonlinearity. Our system can deliver enhanced computing performance in a low-dimensional latent feature space for a range of computing tasks, from image classification to higher-level tasks such as image reconstruction, keypoint detection and object detection, when trained with a lightweight digital backend. This approach might offer considerable benefits for high-speed analytics in both scientific and real-world applications. Our system permits easy scaling of both input data and effective depth of the neural networks approximating the optical encoders, providing an efficient optical representation of large-scale input patterns using a limited number of output modes. This versatility helps to manage more intricate tasks and process larger datasets without substantial loss of vital information.

Our nonlinear mapping system functions as a reservoir computer in a steady state. This is also the case for other systems that have been realized before18,44,45 but comparatively, our design allows for easy scaling up and tuning of nonlinearity without varying the input power. Furthermore, our system may serve as a trainable physical neural network46, if one part of the DMD is utilized for an input pattern and another is tuned or trained for direct readout without the need for digital processing. The performance of our computing tasks can be further improved by, for example, replacing the binary DMD with an analogue spatial light modulator for information encoding. The detection part of our system can be further improved by replacing the camera with a fast photodetector array, given the small number of output modes that need to recorded for decoding.

Our current optical computing architecture is beyond one-to-one architectural mapping of the digital neural network. It may inspire next-generation optical computing to exploit nonlinear mappings beyond conventional schemes and promote the development of more energy-efficient neuromorphic computing platforms including47 and beyond48,49,50,51 optics, where nonlinearity can be effectively harnessed and utilized. Our findings could also spark new research directions in fields such as optical data compression for imaging52,53,54 and sensing12,55, optical communication56 and quantum computing57, where innovative nonlinear mechanisms can substantially enhance performance, efficiency and potential opportunity in enhancing data privacy and adversarial robustness28,58,59.

During the final stage of this work, we became aware of 2 independent works of very different optical machine learning implementations that are based on the same principle of realizing nonlinear processing with linear optics60,61.

Methods

Further explanation of Born series

To better connect the Born series with a neural network, such as an MLP, the Born series can also be rewritten as

$${E}_{{\rm{out}}}={{{{\bf{S}}}}}^{n}{E}_{{\rm{in}}}={{{{\bf{S}}}}}_{u}({{{{\bf{S}}}}}_{u-1}(\ldots ({{{{\bf{S}}}}}_{1}({{{\bf{V}}}}))\ldots )){E}_{{\rm{in}}},$$
(2)

where Sn(⋅) represents a scattering operator and S1(V) = V, for n > 1, Sn+1(V) = V + Sn(V)G0V. The iterative expression of the Born series involving the scattering operator S can be seen to structurally resemble the iterative operation in an MLP, where data are transformed across multiple layers in an iterative way. However, V and G0 are identical in all the layers.

Setup information

The multiple-scattering cavity is an integrating sphere with a rough inner surface and a diameter of 3.75 cm. The cavity has three ports on its boundary: one port is attached to a DMD (Texas Instruments DLP9000X), which provides a reconfigurable scattering potential, whereas the other two ports serve as the input and output ports of the cavity. A single-frequency continuous-wave laser (Agilent 81940A; wavelength, 1,550 nm) at 21.3 mW is coupled through a single-mode fibre into the cavity through the input port. On multiple scatterings at the rough inner surface, the output light escapes the cavity via the output port. From the spectral correlation width of the output speckle pattern, the average path length of light inside the cavity is estimated to be approximately 100 m. The average number of bounces off the cavity boundary is on the order of 5,000. To capture the output intensity pattern, a mirror is positioned adjacent to the output port, directing the output light towards an InGaAs camera (Xenics Xeva FPA-640). A linear polarizer is placed in front of the camera to record the speckle intensity patterns, which represent a complex nonlinear relationship between the configuration of the DMD and the resulting output speckle.

Experimental procedure of computing with the multiple-scattering cavity

Experimentally, we couple a continuous-wave single-frequency laser through a single-mode fibre into the cavity. An input image is loaded onto the DMD, which modifies the scattering potential in a reconfigurable manner. The entire modulation area of the DMD consists of 2,560 × 1,600 micromirrors with a pixel pitch of 7.6 μm. Each micromirror can be tilted by +15° or –15°, representing the binary states +1 and –1, respectively. The input port of the cavity covers a portion of the DMD area (1,260 × 784 micromirrors), which is exposed to light in the cavity; thus, we only modulate the micromirrors within this region. Instead of controlling individual micromirrors, we group micromirrors into macropixels, where all the micromirrors in a single macropixel have the same tilt angle (binary state). To make images compatible for loading onto the DMD, we employed a binary thresholding method using the Floyd–Steinberg dithering algorithm63.

We control the order of nonlinear mapping in our cavity in two ways, both involving changing the number of scattering events on the modulated area of the DMD. First, we reduce the DMD area where micromirros are toggled. Outside the modulated area, the micromirror configuration remains fixed. The number of bounces of light from a smaller modulated area is lower. By shrinking the dimension of macropixels by a factor of 4 or 16, the total modulated area is reduced by the same factor. Alternatively, we can enhance the number of bounces with the DMD by increasing the dwell time of light inside the cavity. This is realized by covering the output port of the cavity with a partial reflector—a silicon wafer (thickness, 0.63 mm), which partially reflects light at 1,550 nm. As a result, the order of nonlinear random mapping increases.

The temporal coherent length of light exceeds the typical optical path length inside the cavity. The output light maintains high spatial coherence, resulting in a relatively high intensity contrast (~0.8) of the output speckle pattern (after passing through a linear polarizer). Compared with the nonlinearity introduced by optical effects such as harmonic generation and self-phase modulation, a broadband pulsed laser is necessary to achieve the high pulse energy required for these nonlinear processes, producing much lower contrast. In addition, our system’s nonlinear response is insensitive to optical power, more stable and more energy efficient. The output images recorded by our camera consistently display stable speckle patterns, with each speckle grain representing a distinct spatial mode. The number of output modes (speckle grains) in the camera image is determined by dividing the total area of the speckle grains used for computation by the average size of one speckle grain, which is derived from the full-width at half-maximum of the intensity correlation function.

Fashion MNIST classification task

In the Fashion MNIST34 classification task, we study the impact of the number of modes and the size of the modulated area on the DMD in terms of classification accuracy and mutual information between the output modes and ground-truth target classes. To vary the number of modes, we crop the output camera image, controlling the number of output modes. This is achieved in PyTorch using the nn.transform.CenterCrop function. We manipulate the modulated area on the DMD by adjusting the macropixel size for the input data. For example, for the Fashion MNIST dataset, we use a 45 × 28 micropixels for each macropixel on the DMD, corresponding to one of the 28 × 28 Fashion MNIST image when we utilize the full modulated area. For a 25% modulated area, we use 22 × 14 micropixels for one macropixel on the DMD. The entire set of 60,000 training data and 10,000 testing data are sequentially input on the DMD, and the corresponding camera speckle images are collected. We further applied a filter based on system stability (Supplementary Section 2) to select images with a speckle stability over the threshold of 0.96. The data are then split in a 9:1 ratio to form training and testing datasets for classification. For classification, we employ ridge regressor from the keras package to train and infer with the output modes. Regarding the calculation of mutual information, detailed information on the algorithm is provided in Supplementary Section 4. We use the mutual_info_regression function, which takes vectors of pixel values in output modes and class labels, from the feature selection module in scikit-learn.

Programmable optical and digital parameters

The maximum number of programmable optical parameters is given by the number of mirrors of the DMD, which is approximately 4 million. The count of digitally programmable parameters, however, depends on the decoder utilized. Specifically, for the Fashion MNIST classification task, the linear classifier requires only 1,000 parameters. In the task of Fashion MNIST reconstruction, the parameter count increases to approximately 90,000. For human face detection, the requirement is around 310,000 parameters, and for pedestrian tracking, the model uses around 1 million programmable parameters.

Training of digital decoder

For the tasks beyond classification, we start with low-dimensional vectors derived from the deep optical encoder—the multiple-scattering cavity. Using these vectors, we train a digital decoder based on a neural network, with the objective of minimizing the mean squared loss relative to the ground-truth target values in our training dataset. The dimensions for each target differed according to the tasks: 28 × 28 for Fashion MNIST image reconstruction, 15 sets of keypoints for human face keypoint detection and four bounding box coordinates for pedestrian detection. The architecture selected for the decoding neural network is an MLP, which incorporates batch normalization before each activation function. The ideal depths and widths of the hidden layers are determined by conducting a neural architecture search, randomly initialized at least 100 times to select the best architecture for the digital decoder. The same activation function, chosen among relu, tanh and sigmoid functions, is used during each search. All training instances are conducted on the NVIDIA A100 Tensor Core GPU via Google Colab.

Fashion MNIST image reconstruction task

In the Fashion MNIST34 image reconstruction task, we train an MLP as a digital decoder using pairs of 16 output modes (inputs) and ground-truth Fashion MNIST images (targets) to reconstruct the original images from the speckle patterns. To optimize the decoder’s architecture, we employ neural architecture search, varying both depth and width to identify the best architecture for the decoder. We primarily study and compare two cases. Case 1, speckle features generated from a linear random projection through complex media, followed by quadratic detection on the optical field (to generate linear optical features, we follow the methods described elsewhere24); case 2, speckle features generated from nonlinear random mapping via a multiple-scattering cavity, again followed by quadratic detection on the optical field. In both scenarios, we ensure that the number of modes remains consistent, making the reconstruction quality comparable between the two cases. For the first case, the optimized decoder comprises a two-layer MLP. For the second case, the optimized decoder utilizes a four-layer MLP, both with the same activation sigmoid function. We further evaluate the reconstruction using a test dataset.

Human face keypoint detection task

The human face keypoint detection dataset at Kaggle40 consists of facial keypoints, each characterized by a real-valued pair (x, y) indicating its position in the domain of pixel indices. This dataset identifies 15 specific keypoints corresponding to facial features, including centres of the left and right eyes, inner and outer corners of both eyes, inner and outer ends of both eyebrows, the tip of the nose, corners of the mouth on both sides, and the top and bottom centres of the lips. It is noteworthy to mention that the terms ‘left’ and ‘right’ are based on the subject’s point of view. Some data points might not have all the keypoints, which are represented as missing entries in the dataset. Each image in the dataset contains a list of pixels, with values ranging from 0 to 255, formatted for a resolution of 96 × 96 pixels. The training set includes 7,049 images. Each row in this file provides the (x, y) coordinates of the 15 keypoints and image data in a row-ordered list of pixels. Conversely, the test set comprises 1,783 images: each row lists an ImageId and the corresponding row-ordered list of pixels for the image.

For data preprocessing, entries without keypoint information are excluded. In cases where an image had fewer than 15 keypoints, we duplicated some keypoints to ensure that all the labels consisted of 15 target points. This procedure ensures a consistent size of the MLP output layer.

Following this, the data are sent into a multiple-scattering cavity, with different modulated areas, corresponding to variable nonlinearity strengths, reminiscent of deep neural networks encoding (Supplementary Section 3). Only 16 output modes (using nn.transform.CenterCrop) are extracted from this system. Using these modes, a digital decoder is developed based on neural architecture search, aiming to train on and infer the 15 facial keypoints.

Our analysis mainly compared two scenarios: one with a modulated area of ~6.25% and another termed ‘100% + enhanced’, which is the full modulated area bolstered by an extra partial reflector for enhanced scattering (Supplementary Section 1). Our findings indicated that even with a decoder trained to its optimal capacity, the 100% + enhanced setup yielded better results.

Pedestrian detection task

In this task, we use the Caltech Pedestrian dataset41, one of the pioneering collections in the domain of computer vision, specifically designed for pedestrian detection tasks. This dataset has played an instrumental role in shaping the research trajectories in pedestrian detection, serving as a benchmark for numerous detection algorithms over the years. The dataset offers a wide variety of real-world scenarios captured from urban settings, including pedestrians in various poses, occlusions and varying light conditions. It provides an invaluable resource for the development and evaluation of algorithms, with its meticulous annotations and diverse challenges it poses.

Within this dataset, bounding boxes are utilized to accurately locate individual pedestrians in frames. These boxes are characterized by a set of four real-valued positions: (x1, y1) for the top-left corner and (x2, y2) for the bottom-right corner. Given the dynamic nature of urban environments, a single frame can contain multiple pedestrians, which results in multiple bounding boxes within that image.

To preprocess the dataset, we adopt a simplification strategy. Regardless of the number of bounding boxes present in the original image, we ensure that only one bounding box is retained per image. For images that contain multiple bounding boxes, only the first bounding box is selected and used as a label. In the cases where an image lacks a bounding box, it is removed from the dataset. All the images from the Caltech dataset inherently possess a resolution of 640 × 480 pixels. In our case, we downsampled the images to 320 × 240 pixels. Our curated version of the dataset, divided into training and test segments, encapsulates a total of 10,000 images.

Following the preprocessing steps, images are then introduced into a multiple-scattering cavity, with the full modulated area being enhanced by the partial reflector—the silicon wafer. From this system, a total of 25 output modes (using nn.transform.CenterCrop) are derived. Harnessing these modes, we engineered a digital decoder rooted in the principles of neural architecture search. The overarching objective of this decoder is to train and subsequently infer the solitary bounding box in the images.

We also generated Supplementary Videos 1 and 2 using the test dataset from various video locations. In these movies, the frame rate was reduced from the actual 30 fps to 9 fps for visualization purposes. The green boxes indicate the ground-truth bounding boxes, whereas the magenta boxes represent the inferences. The actual inference time is well under 0.1 s.