Abstract
This work aims at finding optimal navigation policies for thin, deformable microswimmers that progress in a viscous fluid by propagating a sinusoidal undulation along their slender body. These active filaments are embedded in a prescribed, non-homogeneous flow, in which their swimming undulations have to compete with the drifts, strains, and deformations inflicted by the outer velocity field. Such an intricate situation, where swimming and navigation are tightly bonded, is addressed using various methods of reinforcement learning. Each swimmer has only access to restricted information on its configuration and has to select accordingly an action among a limited set. The optimisation problem then consists in finding the policy leading to the most efficient displacement in a given direction. It is found that usual methods do not converge and this pitfall is interpreted as a combined consequence of the non-Markovianity of the decision process, together with the highly chaotic nature of the dynamics, which is responsible for high variability in learning efficiencies. Still, we provide an alternative method to construct efficient policies, which is based on running several independent realisations of Q-learning. This allows the construction of a set of admissible policies whose properties can be studied in detail and compared to assess their efficiency and robustness.
Graphical abstract
Similar content being viewed by others
Data availability statement
Data sharing is not applicable to this article as no datasets were generated during the current study. Numerical codes will be made available on reasonable request.
References
Z. Wu, Y. Chen, D. Mukasa, O.S. Pak, W. Gao, Medical micro/nanorobots in complex media. Chem. Soc. Rev. 49, 8088–8112 (2020). https://doi.org/10.1039/d0cs00309c
Q. Servant, K. Mazza, Nelson: Controlled in vivo swimming of a swarm of bacteria-like microrobotic flagella. Adv. Mater. 27, 2981–2988 (2015). https://doi.org/10.1002/adma.201404444
L. Berti, L. Giraldi, C. Prud’Homme, Swimming at low Reynolds number. ESAIM Proc. Surv. 67, 46–60 (2020). https://doi.org/10.1051/proc/202067004
F. Alouges, A. DeSimone, L. Giraldi, M. Zoppello, Self-propulsion of slender micro-swimmers by curvature control: N-link swimmers. Int. J. Non Linear Mech. 56, 132–141 (2013). https://doi.org/10.1016/j.ijnonlinmec.2013.04.012
X. Shen, P.E. Arratia, Undulatory swimming in viscoelastic fluids. Phys. Rev. Lett. 106(20), 208101 (2011). https://doi.org/10.1103/PhysRevLett.106.208101
A. Daddi-Moussa-Ider, H. Löwen, B. Liebchen, Hydrodynamics can determine the optimal route for microswimmer navigation. Commun. Phys. 4, 1–11 (2021). https://doi.org/10.1038/s42005-021-00522-6
I. Borazjani, F. Sotiropoulos, Numerical investigation of the hydrodynamics of anguilliform swimming in the transitional and inertial flow regimes. J. Exp. Biol. 212(4), 576–592 (2009). https://doi.org/10.1242/jeb.025007
N. Cohen, J.H. Boyle, Swimming at low Reynolds number: a beginners guide to undulatory locomotion. Contemp. Phys. 51(2), 103–123 (2010). https://doi.org/10.1080/00107510903268381
F. Alouges, A. DeSimone, L. Giraldi, Y. Or, O. Wiezel, Energy-optimal strokes for multi-link microswimmers: Purcell’s loops and Taylor’s waves reconciled. New J. Phys. 21(4), 043050 (2019). https://doi.org/10.1088/1367-2630/ab1142
G. Reddy, V.N. Murthy, M. Vergassola, Olfactory sensing and navigation in turbulent environments. Annu. Rev. Condens. Matter Phys. 13(1), 191–213 (2022). https://doi.org/10.1146/annurev-conmatphys-031720-032754
F. Cichos, K. Gustavsson, B. Mehlig, G. Volpe, Machine learning for active matter. Nat. Mach. Intell. 2, 94–103 (2020). https://doi.org/10.1038/s42256-020-0146-9
G. Reddy, A. Celani, T.J. Sejnowski, M. Vergassola, Learning to soar in turbulent environments. Proc. Natl. Acad. Sci. 113, 4877–4884 (2016). https://doi.org/10.1073/pnas.1606075113
S. Colabrese, K. Gustavsson, A. Celani, L. Biferale, Flow navigation by smart microswimmers via reinforcement learning. Phys. Rev. Lett. 118(15), 158004 (2017). https://doi.org/10.1103/PhysRevLett.118.158004
K. Gustavsson, L. Biferale, A. Celani, S. Colabrese, Finding efficient swimming strategies in a three-dimensional chaotic flow by reinforcement learning. Euro. Phys. J. E 40, 1–6 (2017). https://doi.org/10.1140/epje/i2017-11602-9
E. Schneider, H. Stark, Optimal steering of a smart active particle. Europhys. Lett. 127(6), 64003 (2019). https://doi.org/10.1209/0295-5075/127/64003
S. Muiños-Landin, A. Fischer, V. Holubec, F. Cichos, Reinforcement learning with artificial microswimmers. Sci. Robot. 6(52), 9285 (2021). https://doi.org/10.1126/scirobotics.abd9285
J. Qiu, N. Mousavi, K. Gustavsson, C. Xu, B. Mehlig, L. Zhao, Navigation of micro-swimmers in steady flow: the importance of symmetries. J. Fluid Mech. 932, 10 (2022). https://doi.org/10.1017/jfm.2021.978
Kumar A. Jaya, A.K. Verma, J. Bec, R. Pandit, Machine learning strategies for path-planning microswimmers in turbulent flows. Phys. Rev. E 101, 043110 (2020). https://doi.org/10.1103/PhysRevE.101.043110
X.B. Peng, G. Berseth, M. Van de Panne, Terrain-adaptive locomotion skills using deep reinforcement learning. ACM Trans. Graph. 35(4), 1–12 (2016). https://doi.org/10.1145/2897824.2925881
S. Levine, C. Finn, T. Darrell, P. Abbeel, End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(1), 1334–1373 (2016). https://doi.org/10.5555/2946645.2946684
O. Pironneau, D. Katz, Optimal swimming of flagellated micro-organisms. J. Fluid Mech. 66(2), 391–415 (1974). https://doi.org/10.1017/S0022112074000279
A. Lindner, M.J. Shelley, Elastic fibers in flows, in Fluid-Structure Interactions in Low-Reynolds-Number Flows. ed. by C. Duprat, H.A. Stone (Royal Society of Chemistry, Cambridge, 2015), pp.168–192
C. Moreau, L. Giraldi, H. Gadêlha, The asymptotic coarse-graining formulation of slender-rods, bio-filaments and flagella. J. R. Soc. Interface 15(144), 20180235 (2018). https://doi.org/10.1098/rsif.2018.0235
J.R. Picardo, D. Vincenzi, N. Pal, S.S. Ray, Preferential sampling of elastic chains in turbulent flows. Phys. Rev. Lett. 121(24), 244501 (2018). https://doi.org/10.1103/PhysRevLett.121.244501
M.E. Rosti, A.A. Banaei, L. Brandt, A. Mazzino, Flexible fiber reveals the two-point statistical properties of turbulence. Phys. Rev. Lett. 121(4), 044501 (2018). https://doi.org/10.1103/PhysRevLett.121.044501
Y.-N. Young, M.J. Shelley, Stretch-coil transition and transport of fibers in cellular flows. Phys. Rev. Lett. 99(5), 058303 (2007). https://doi.org/10.1103/PhysRevLett.99.058303
C. Brouzet, G. Verhille, P. Le Gal, Flexible fiber in a turbulent flow: a macroscopic polymer. Phys. Rev. Lett. 112(7), 074501 (2014). https://doi.org/10.1103/PhysRevLett.112.074501
S. Allende, C. Henry, J. Bec, Stretching and buckling of small elastic fibers in turbulence. Phys. Rev. Lett. 121(15), 154501 (2018). https://doi.org/10.1103/PhysRevLett.121.154501
J. Gray, H.W. Lissmann, The locomotion of nematodes. J. Exp. Biol. 41(1), 135–154 (1964). https://doi.org/10.1242/jeb.41.1.135
S. Berri, J.H. Boyle, M. Tassieri, I.A. Hope, N. Cohen, Forward locomotion of the nematode C. elegans is achieved through modulation of a single gait. HFSP J. 3(3), 186–193 (2009). https://doi.org/10.2976/1.3082260
B.M. Friedrich, I.H. Riedel-Kruse, J. Howard, F. Jülicher, High-precision tracking of sperm swimming fine structure provides strong test of resistive force theory. J. Exp. Biol. 213(8), 1226–1234 (2010). https://doi.org/10.1242/jeb.039800
J.F. Jikeli, L. Alvarez, B.M. Friedrich, L.G. Wilson, R. Pascal, R. Colin, M. Pichlo, A. Rennhack, C. Brenker, U.B. Kaupp, Sperm navigation along helical paths in 3D chemoattractant landscapes. Nat. Commun. 6, 1–10 (2015). https://doi.org/10.1038/ncomms8985
A.-K. Tornberg, M.J. Shelley, Simulating the dynamics and interactions of flexible fibers in Stokes flows. J. Comput. Phys. 196(1), 8–40 (2004). https://doi.org/10.1016/j.jcp.2003.10.017
D. Rothstein, E. Henry, J.P. Gollub, Persistent patterns in transient chaotic fluid mixing. Nature 401(6755), 770–772 (1999). https://doi.org/10.1038/44529
M. Hauskrecht, Value-function approximations for partially observable Markov decision processes. J. Artif. Intell. Res. 13, 33–94 (2000). https://doi.org/10.1613/jair.678
S.P. Singh, T. Jaakkola, M.I. Jordan, Learning without state-estimation in partially observable Markovian decision processes, in Machine Learning Proceedings 1994. ed. by W.W. Cohen, H. Hirsh (Morgan Kaufmann, San Francisco, 1994), pp.284–292. https://doi.org/10.1016/B978-1-55860-335-6.50042-8
R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (The MIT Press, Cambridge, 2018)
C.J. Watkins, P. Dayan, Q-learning. Mach. Learn. 8, 279–292 (1992). https://doi.org/10.1007/BF00992698
L. Berti, Z. El Khiyati, Y. Essousy, C. Prud’Homme, L. Giraldi, Reinforcement learning with function approximation for 3-spheres swimmer. IFAC-PapersOnLine 55(16), 1–6 (2022). https://doi.org/10.1016/j.ifacol.2022.08.072
A. Najafi, R. Golestanian, Simple swimmer at low Reynolds number: three linked spheres. Phys. Rev. E 69(6), 062901 (2004). https://doi.org/10.1103/PhysRevE.69.062901
V.R. Konda, J.N. Tsitsiklis, On actor-critic algorithms. SIAM J. Control. Optim. 42(4), 1143–1166 (2003). https://doi.org/10.1137/S03630129013856
P. Perlekar, R. Pandit, Turbulence-induced melting of a nonequilibrium vortex crystal in a forced thin fluid film. New J. Phys. 12(2), 023033 (2010). https://doi.org/10.1088/1367-2630/12/2/023033
G. Michel, J. Herault, F. Pétrélis, S. Fauve, Bifurcations of a large-scale circulation in a quasi-bidimensional turbulent flow. Europhys. Lett. 115(6), 64004 (2016). https://doi.org/10.1209/0295-5075/115/64004
Acknowledgements
The authors are grateful to the OPAL infrastructure from Université Côte d’Azur for providing computational resources. This work received support from the UCA-JEDI Future Investments funded by the French government (Grant No. ANR-15-IDEX-01) and from the Agence Nationale de la Recherche (Grant No. ANR-21-CE30-0040-01).
Author information
Authors and Affiliations
Contributions
ZEK and JB performed research and wrote the paper. RC performed research. JB and LG designed research.
Corresponding author
Additional information
Quantitative AI in Complex Fluids and Complex Flows: Challenges and Benchmarks. Guest editors: Luca Biferale, Michele Buzzicotti, Massimo Cencini.
Appendices
Appendix A: Reinforcement-learning algorithms
We give here details on the three reinforcement-learning algorithms that have been used in this work and whose results are discussed in Sect. 4, namely Q-learning, differential semi-gradient SARSA, and policy gradient/Actor-Critic.
1.1 A.1 Q-learning
Q-learning is a model-free reinforcement-learning algorithm used to estimate the optimal action-value function for a given environment [38]. It uses the discounted reward \(\mathcal {R}^\textrm{disc}[\pi ]\) defined by Eq. (4) with discount rate \(\gamma \). In this algorithm, an agent interacts with the environment, observes the state and reward, and takes an action according to an \(\varepsilon \)-greedy policy. An estimate of the action-value function Q defined in Sect. 3.2 is updated at each step using the immediate reward R and the maximum estimated value for the next state, using a learning rate parameter \(\lambda \). The algorithm continues until convergence is reached or a stopping criterion is met.
The resulting procedure is summarised below:
1.2 A.2 Differential semi-gradient SARSA
This is another model-free reinforcement learning algorithm used to learn the optimal action-value function Q associated to the differential return \(\mathcal {R}^\textrm{diff}[\pi ]\) defined in Eq. (5). In this procedure, the agent interacts with the environment, observes the state and reward, and takes an action according to an \(\varepsilon \)-greedy policy following an approximation \(\hat{Q}_{\varvec{\eta }}\) of the action-value function. The approximation parameters \(\varvec{\eta }\) are updated with a learning rate \(\lambda _2\) using a temporal-difference (TD) error \(\delta \), which is the difference between the observed reward and an estimated value \(\bar{R}\) of the next state. Additionally, the algorithm updates the estimate \(\bar{R}\) at a rate \(\lambda _1\). Details can be found in [37].
The procedure is then:
1.3 A.3 Policy gradient/Actor-Critic
This last procedure is a model-based reinforcement learning algorithm used to learn a policy that maximises the expected differential return \(\mathcal {R}^\textrm{diff}[\pi ]\). In this algorithm, the agent interacts with the environment, observes the state and reward, and takes an action according to an approximation \(\hat{\pi }_{\varvec{\theta }}\) of the optimal policy. The TD error \(\delta \) is this time defined as the difference between the observed reward and the estimated value \(\hat{V}_{\varvec{\eta }}\) of the next state. The algorithm uses it to update the estimated average reward \(\bar{R}\) at a rate \(\lambda _1\), and the approximation parameters \(\varvec{\eta }\) used for the value function at a rate \(\lambda _2\). Additionally, the approximation parameters \(\varvec{\theta }\) of the policy are updated at a rate \(\lambda _3\) using both the TD error and the gradient of the log-likelihood of the policy with respect to \(\varvec{\theta }\). This algorithm is known as the Actor-Critic algorithm [41] because it simultaneously learns an actor (policy \(\hat{\pi }_{\varvec{\theta }}\)) and a critic (value function \(\hat{V}_{\varvec{\eta }}\)).
The resulting procedure is summarised below:
Appendix B: The iterative Markovian approximation
We here give some details on the idea to approximate the dynamical evolution of the swimmer by an MDP. Our hope is that this approximation will capture the most relevant information of our optimisation problem, namely, the transition probabilities between the states of our environment and the distribution of the rewards obtained by our agent. The advantages of this approach are twofold. First, MDPs only require knowledge of the transitions and rewards, abstracting away all the other aspects of the dynamics. This approximation thus enables learning algorithms to run significantly faster, without the need to simulate the entire system at each step. Second, this approach would separate the issue of non-Markovianity from other potential difficulties.
Our procedure consists in constructing a sequence of policies \(\pi _0\), \(\pi _1\), ...\(\pi _k\) that will hopefully converge to the optimal \(\pi _\star \). At each step, we simulate a swimmer that follows the policy \(\pi _k\). Once a statistical steady state is reached, we try out at every time step \(t=n\Delta t\), all possible actions to monitor the new observation and reward at time \(t_{n+1}\). We then use long time averages (over \(10^6\Delta t\)) to construct numerical Monte-Carlo approximations to the transition probability \(p_{\textrm{T},k}(\omega '\vert \omega ,\alpha )\) of observing \(\omega '\) at time \(t+\Delta t\) given that \(\omega \) was observed and action \(\alpha \) was performed at time t, together with the corresponding distribution of rewards \(p_{\textrm{R},k}(R\vert \omega ,\alpha )\). Both distributions depend of course on \(\pi _k\). We then use the approximate probabilities \(p_{\textrm{T},k}\) and \(p_{\textrm{R},k}\) to run the \(\varepsilon \)-greedy Q-learning algorithm that, because of the Markovian formulation imposed now, is ensured to converge. This leads to construct the optimal policy \(\pi _{k+1}\) associated to the approximate system. This procedure is reiterated changing the base policy to \(\pi _{k+1}\), until it attains a fixed point. The method is summarised in Algorithm 4.
The motivation behind this procedure is that if the Markovian approximation is not too far off, then it is natural to think that the optimal policy \(\pi _{k+1}\) of the approximate system should be at least an improvement on the policy \(\pi _k\) if not also the optimal policy when we go back to the real system. Hence, if the optimal policy \(\pi _\star \) is a fixed point of our procedure, then the sequence \(\{\pi _k; k \ge 0\}\) would converge to it, thus solving our problem.
We have run this procedure, choosing for the initial policy \(\pi _0\) the naive strategy of Sect. 3.3. After three iterations, the algorithm circled back to the policy we encountered on the first iteration \(\pi _3 = \pi _1\). Hence, this proposed procedure does not lead to any improvement with respect to the naive policy. This is interpreted as a sign of the highly non-Markovian nature of our setting.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
El Khiyati, Z., Chesneaux, R., Giraldi, L. et al. Steering undulatory micro-swimmers in a fluid flow through reinforcement learning. Eur. Phys. J. E 46, 43 (2023). https://doi.org/10.1140/epje/s10189-023-00293-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1140/epje/s10189-023-00293-8