Steering undulatory micro-swimmers in a fluid flow through reinforcement learning

El Khiyati, Zakarya; Chesneaux, Raphaël; Giraldi, Laëtitia; Bec, Jérémie

doi:10.1140/epje/s10189-023-00293-8

Steering undulatory micro-swimmers in a fluid flow through reinforcement learning

Regular Article – Flowing Matter
Published: 12 June 2023

Volume 46, article number 43, (2023)
Cite this article

The European Physical Journal E Aims and scope Submit manuscript

Zakarya El Khiyati¹,
Raphaël Chesneaux²,
Laëtitia Giraldi¹ &
…
Jérémie Bec ORCID: orcid.org/0000-0002-3618-5743^1,2

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

This work aims at finding optimal navigation policies for thin, deformable microswimmers that progress in a viscous fluid by propagating a sinusoidal undulation along their slender body. These active filaments are embedded in a prescribed, non-homogeneous flow, in which their swimming undulations have to compete with the drifts, strains, and deformations inflicted by the outer velocity field. Such an intricate situation, where swimming and navigation are tightly bonded, is addressed using various methods of reinforcement learning. Each swimmer has only access to restricted information on its configuration and has to select accordingly an action among a limited set. The optimisation problem then consists in finding the policy leading to the most efficient displacement in a given direction. It is found that usual methods do not converge and this pitfall is interpreted as a combined consequence of the non-Markovianity of the decision process, together with the highly chaotic nature of the dynamics, which is responsible for high variability in learning efficiencies. Still, we provide an alternative method to construct efficient policies, which is based on running several independent realisations of Q-learning. This allows the construction of a set of admissible policies whose properties can be studied in detail and compared to assess their efficiency and robustness.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal Control of Slender Microswimmers

Taming Lagrangian chaos with multi-objective reinforcement learning

Article 03 March 2023

Optimal Control of Point-to-Point Navigation in Turbulent Time Dependent Flows Using Reinforcement Learning

Data availability statement

Data sharing is not applicable to this article as no datasets were generated during the current study. Numerical codes will be made available on reasonable request.

References

Z. Wu, Y. Chen, D. Mukasa, O.S. Pak, W. Gao, Medical micro/nanorobots in complex media. Chem. Soc. Rev. 49, 8088–8112 (2020). https://doi.org/10.1039/d0cs00309c
Article Google Scholar
Q. Servant, K. Mazza, Nelson: Controlled in vivo swimming of a swarm of bacteria-like microrobotic flagella. Adv. Mater. 27, 2981–2988 (2015). https://doi.org/10.1002/adma.201404444
Article Google Scholar
L. Berti, L. Giraldi, C. Prud’Homme, Swimming at low Reynolds number. ESAIM Proc. Surv. 67, 46–60 (2020). https://doi.org/10.1051/proc/202067004
Article MathSciNet MATH Google Scholar
F. Alouges, A. DeSimone, L. Giraldi, M. Zoppello, Self-propulsion of slender micro-swimmers by curvature control: N-link swimmers. Int. J. Non Linear Mech. 56, 132–141 (2013). https://doi.org/10.1016/j.ijnonlinmec.2013.04.012
Article ADS Google Scholar
X. Shen, P.E. Arratia, Undulatory swimming in viscoelastic fluids. Phys. Rev. Lett. 106(20), 208101 (2011). https://doi.org/10.1103/PhysRevLett.106.208101
Article ADS Google Scholar
A. Daddi-Moussa-Ider, H. Löwen, B. Liebchen, Hydrodynamics can determine the optimal route for microswimmer navigation. Commun. Phys. 4, 1–11 (2021). https://doi.org/10.1038/s42005-021-00522-6
Article Google Scholar
I. Borazjani, F. Sotiropoulos, Numerical investigation of the hydrodynamics of anguilliform swimming in the transitional and inertial flow regimes. J. Exp. Biol. 212(4), 576–592 (2009). https://doi.org/10.1242/jeb.025007
Article Google Scholar
N. Cohen, J.H. Boyle, Swimming at low Reynolds number: a beginners guide to undulatory locomotion. Contemp. Phys. 51(2), 103–123 (2010). https://doi.org/10.1080/00107510903268381
Article ADS Google Scholar
F. Alouges, A. DeSimone, L. Giraldi, Y. Or, O. Wiezel, Energy-optimal strokes for multi-link microswimmers: Purcell’s loops and Taylor’s waves reconciled. New J. Phys. 21(4), 043050 (2019). https://doi.org/10.1088/1367-2630/ab1142
Article ADS MathSciNet Google Scholar
G. Reddy, V.N. Murthy, M. Vergassola, Olfactory sensing and navigation in turbulent environments. Annu. Rev. Condens. Matter Phys. 13(1), 191–213 (2022). https://doi.org/10.1146/annurev-conmatphys-031720-032754
Article ADS Google Scholar
F. Cichos, K. Gustavsson, B. Mehlig, G. Volpe, Machine learning for active matter. Nat. Mach. Intell. 2, 94–103 (2020). https://doi.org/10.1038/s42256-020-0146-9
Article Google Scholar
G. Reddy, A. Celani, T.J. Sejnowski, M. Vergassola, Learning to soar in turbulent environments. Proc. Natl. Acad. Sci. 113, 4877–4884 (2016). https://doi.org/10.1073/pnas.1606075113
Article ADS Google Scholar
S. Colabrese, K. Gustavsson, A. Celani, L. Biferale, Flow navigation by smart microswimmers via reinforcement learning. Phys. Rev. Lett. 118(15), 158004 (2017). https://doi.org/10.1103/PhysRevLett.118.158004
Article ADS Google Scholar
K. Gustavsson, L. Biferale, A. Celani, S. Colabrese, Finding efficient swimming strategies in a three-dimensional chaotic flow by reinforcement learning. Euro. Phys. J. E 40, 1–6 (2017). https://doi.org/10.1140/epje/i2017-11602-9
Article Google Scholar
E. Schneider, H. Stark, Optimal steering of a smart active particle. Europhys. Lett. 127(6), 64003 (2019). https://doi.org/10.1209/0295-5075/127/64003
Article ADS Google Scholar
S. Muiños-Landin, A. Fischer, V. Holubec, F. Cichos, Reinforcement learning with artificial microswimmers. Sci. Robot. 6(52), 9285 (2021). https://doi.org/10.1126/scirobotics.abd9285
Article Google Scholar
J. Qiu, N. Mousavi, K. Gustavsson, C. Xu, B. Mehlig, L. Zhao, Navigation of micro-swimmers in steady flow: the importance of symmetries. J. Fluid Mech. 932, 10 (2022). https://doi.org/10.1017/jfm.2021.978
Article ADS MathSciNet MATH Google Scholar
Kumar A. Jaya, A.K. Verma, J. Bec, R. Pandit, Machine learning strategies for path-planning microswimmers in turbulent flows. Phys. Rev. E 101, 043110 (2020). https://doi.org/10.1103/PhysRevE.101.043110
Article ADS Google Scholar
X.B. Peng, G. Berseth, M. Van de Panne, Terrain-adaptive locomotion skills using deep reinforcement learning. ACM Trans. Graph. 35(4), 1–12 (2016). https://doi.org/10.1145/2897824.2925881
Article Google Scholar
S. Levine, C. Finn, T. Darrell, P. Abbeel, End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(1), 1334–1373 (2016). https://doi.org/10.5555/2946645.2946684
Article MathSciNet MATH Google Scholar
O. Pironneau, D. Katz, Optimal swimming of flagellated micro-organisms. J. Fluid Mech. 66(2), 391–415 (1974). https://doi.org/10.1017/S0022112074000279
Article ADS MATH Google Scholar
A. Lindner, M.J. Shelley, Elastic fibers in flows, in Fluid-Structure Interactions in Low-Reynolds-Number Flows. ed. by C. Duprat, H.A. Stone (Royal Society of Chemistry, Cambridge, 2015), pp.168–192
Chapter Google Scholar
C. Moreau, L. Giraldi, H. Gadêlha, The asymptotic coarse-graining formulation of slender-rods, bio-filaments and flagella. J. R. Soc. Interface 15(144), 20180235 (2018). https://doi.org/10.1098/rsif.2018.0235
Article Google Scholar
J.R. Picardo, D. Vincenzi, N. Pal, S.S. Ray, Preferential sampling of elastic chains in turbulent flows. Phys. Rev. Lett. 121(24), 244501 (2018). https://doi.org/10.1103/PhysRevLett.121.244501
Article ADS Google Scholar
M.E. Rosti, A.A. Banaei, L. Brandt, A. Mazzino, Flexible fiber reveals the two-point statistical properties of turbulence. Phys. Rev. Lett. 121(4), 044501 (2018). https://doi.org/10.1103/PhysRevLett.121.044501
Article ADS Google Scholar
Y.-N. Young, M.J. Shelley, Stretch-coil transition and transport of fibers in cellular flows. Phys. Rev. Lett. 99(5), 058303 (2007). https://doi.org/10.1103/PhysRevLett.99.058303
Article ADS Google Scholar
C. Brouzet, G. Verhille, P. Le Gal, Flexible fiber in a turbulent flow: a macroscopic polymer. Phys. Rev. Lett. 112(7), 074501 (2014). https://doi.org/10.1103/PhysRevLett.112.074501
Article ADS Google Scholar
S. Allende, C. Henry, J. Bec, Stretching and buckling of small elastic fibers in turbulence. Phys. Rev. Lett. 121(15), 154501 (2018). https://doi.org/10.1103/PhysRevLett.121.154501
Article ADS Google Scholar
J. Gray, H.W. Lissmann, The locomotion of nematodes. J. Exp. Biol. 41(1), 135–154 (1964). https://doi.org/10.1242/jeb.41.1.135
Article Google Scholar
S. Berri, J.H. Boyle, M. Tassieri, I.A. Hope, N. Cohen, Forward locomotion of the nematode C. elegans is achieved through modulation of a single gait. HFSP J. 3(3), 186–193 (2009). https://doi.org/10.2976/1.3082260
Article Google Scholar
B.M. Friedrich, I.H. Riedel-Kruse, J. Howard, F. Jülicher, High-precision tracking of sperm swimming fine structure provides strong test of resistive force theory. J. Exp. Biol. 213(8), 1226–1234 (2010). https://doi.org/10.1242/jeb.039800
Article Google Scholar
J.F. Jikeli, L. Alvarez, B.M. Friedrich, L.G. Wilson, R. Pascal, R. Colin, M. Pichlo, A. Rennhack, C. Brenker, U.B. Kaupp, Sperm navigation along helical paths in 3D chemoattractant landscapes. Nat. Commun. 6, 1–10 (2015). https://doi.org/10.1038/ncomms8985
Article Google Scholar
A.-K. Tornberg, M.J. Shelley, Simulating the dynamics and interactions of flexible fibers in Stokes flows. J. Comput. Phys. 196(1), 8–40 (2004). https://doi.org/10.1016/j.jcp.2003.10.017
Article ADS MathSciNet MATH Google Scholar
D. Rothstein, E. Henry, J.P. Gollub, Persistent patterns in transient chaotic fluid mixing. Nature 401(6755), 770–772 (1999). https://doi.org/10.1038/44529
Article ADS Google Scholar
M. Hauskrecht, Value-function approximations for partially observable Markov decision processes. J. Artif. Intell. Res. 13, 33–94 (2000). https://doi.org/10.1613/jair.678
Article MathSciNet MATH Google Scholar
S.P. Singh, T. Jaakkola, M.I. Jordan, Learning without state-estimation in partially observable Markovian decision processes, in Machine Learning Proceedings 1994. ed. by W.W. Cohen, H. Hirsh (Morgan Kaufmann, San Francisco, 1994), pp.284–292. https://doi.org/10.1016/B978-1-55860-335-6.50042-8
Chapter Google Scholar
R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (The MIT Press, Cambridge, 2018)
MATH Google Scholar
C.J. Watkins, P. Dayan, Q-learning. Mach. Learn. 8, 279–292 (1992). https://doi.org/10.1007/BF00992698
L. Berti, Z. El Khiyati, Y. Essousy, C. Prud’Homme, L. Giraldi, Reinforcement learning with function approximation for 3-spheres swimmer. IFAC-PapersOnLine 55(16), 1–6 (2022). https://doi.org/10.1016/j.ifacol.2022.08.072
Article Google Scholar
A. Najafi, R. Golestanian, Simple swimmer at low Reynolds number: three linked spheres. Phys. Rev. E 69(6), 062901 (2004). https://doi.org/10.1103/PhysRevE.69.062901
Article ADS Google Scholar
V.R. Konda, J.N. Tsitsiklis, On actor-critic algorithms. SIAM J. Control. Optim. 42(4), 1143–1166 (2003). https://doi.org/10.1137/S03630129013856
Article MathSciNet MATH Google Scholar
P. Perlekar, R. Pandit, Turbulence-induced melting of a nonequilibrium vortex crystal in a forced thin fluid film. New J. Phys. 12(2), 023033 (2010). https://doi.org/10.1088/1367-2630/12/2/023033
Article ADS Google Scholar
G. Michel, J. Herault, F. Pétrélis, S. Fauve, Bifurcations of a large-scale circulation in a quasi-bidimensional turbulent flow. Europhys. Lett. 115(6), 64004 (2016). https://doi.org/10.1209/0295-5075/115/64004
Article ADS Google Scholar

Download references

Acknowledgements

The authors are grateful to the OPAL infrastructure from Université Côte d’Azur for providing computational resources. This work received support from the UCA-JEDI Future Investments funded by the French government (Grant No. ANR-15-IDEX-01) and from the Agence Nationale de la Recherche (Grant No. ANR-21-CE30-0040-01).

Author information

Authors and Affiliations

Université Côte d’Azur, Inria, CNRS, Sophia-Antipolis, Valbonne, France
Zakarya El Khiyati, Laëtitia Giraldi & Jérémie Bec
Ecole Nationale Supérieure des Mines de Paris, PSL University, CNRS, Cemef, Sophia-Antipolis, Valbonne, France
Raphaël Chesneaux & Jérémie Bec

Authors

Zakarya El Khiyati
View author publications
You can also search for this author in PubMed Google Scholar
Raphaël Chesneaux
View author publications
You can also search for this author in PubMed Google Scholar
Laëtitia Giraldi
View author publications
You can also search for this author in PubMed Google Scholar
Jérémie Bec
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ZEK and JB performed research and wrote the paper. RC performed research. JB and LG designed research.

Corresponding author

Correspondence to Jérémie Bec.

Additional information

Quantitative AI in Complex Fluids and Complex Flows: Challenges and Benchmarks. Guest editors: Luca Biferale, Michele Buzzicotti, Massimo Cencini.

Appendices

Appendix A: Reinforcement-learning algorithms

We give here details on the three reinforcement-learning algorithms that have been used in this work and whose results are discussed in Sect. 4, namely Q-learning, differential semi-gradient SARSA, and policy gradient/Actor-Critic.

1.1 A.1 Q-learning

Q-learning is a model-free reinforcement-learning algorithm used to estimate the optimal action-value function for a given environment [38]. It uses the discounted reward \(\mathcal {R}^\textrm{disc}[\pi ]\) defined by Eq. (4) with discount rate \(\gamma \). In this algorithm, an agent interacts with the environment, observes the state and reward, and takes an action according to an \(\varepsilon \)-greedy policy. An estimate of the action-value function Q defined in Sect. 3.2 is updated at each step using the immediate reward R and the maximum estimated value for the next state, using a learning rate parameter \(\lambda \). The algorithm continues until convergence is reached or a stopping criterion is met.

The resulting procedure is summarised below:

1.2 A.2 Differential semi-gradient SARSA

This is another model-free reinforcement learning algorithm used to learn the optimal action-value function Q associated to the differential return \(\mathcal {R}^\textrm{diff}[\pi ]\) defined in Eq. (5). In this procedure, the agent interacts with the environment, observes the state and reward, and takes an action according to an \(\varepsilon \)-greedy policy following an approximation \(\hat{Q}_{\varvec{\eta }}\) of the action-value function. The approximation parameters \(\varvec{\eta }\) are updated with a learning rate \(\lambda _2\) using a temporal-difference (TD) error \(\delta \), which is the difference between the observed reward and an estimated value \(\bar{R}\) of the next state. Additionally, the algorithm updates the estimate \(\bar{R}\) at a rate \(\lambda _1\). Details can be found in [37].

The procedure is then:

1.3 A.3 Policy gradient/Actor-Critic

This last procedure is a model-based reinforcement learning algorithm used to learn a policy that maximises the expected differential return \(\mathcal {R}^\textrm{diff}[\pi ]\). In this algorithm, the agent interacts with the environment, observes the state and reward, and takes an action according to an approximation \(\hat{\pi }_{\varvec{\theta }}\) of the optimal policy. The TD error \(\delta \) is this time defined as the difference between the observed reward and the estimated value \(\hat{V}_{\varvec{\eta }}\) of the next state. The algorithm uses it to update the estimated average reward \(\bar{R}\) at a rate \(\lambda _1\), and the approximation parameters \(\varvec{\eta }\) used for the value function at a rate \(\lambda _2\). Additionally, the approximation parameters \(\varvec{\theta }\) of the policy are updated at a rate \(\lambda _3\) using both the TD error and the gradient of the log-likelihood of the policy with respect to \(\varvec{\theta }\). This algorithm is known as the Actor-Critic algorithm [41] because it simultaneously learns an actor (policy \(\hat{\pi }_{\varvec{\theta }}\)) and a critic (value function \(\hat{V}_{\varvec{\eta }}\)).

The resulting procedure is summarised below:

Appendix B: The iterative Markovian approximation

We here give some details on the idea to approximate the dynamical evolution of the swimmer by an MDP. Our hope is that this approximation will capture the most relevant information of our optimisation problem, namely, the transition probabilities between the states of our environment and the distribution of the rewards obtained by our agent. The advantages of this approach are twofold. First, MDPs only require knowledge of the transitions and rewards, abstracting away all the other aspects of the dynamics. This approximation thus enables learning algorithms to run significantly faster, without the need to simulate the entire system at each step. Second, this approach would separate the issue of non-Markovianity from other potential difficulties.

Our procedure consists in constructing a sequence of policies \(\pi _0\), \(\pi _1\), ...\(\pi _k\) that will hopefully converge to the optimal \(\pi _\star \). At each step, we simulate a swimmer that follows the policy \(\pi _k\). Once a statistical steady state is reached, we try out at every time step \(t=n\Delta t\), all possible actions to monitor the new observation and reward at time \(t_{n+1}\). We then use long time averages (over \(10^6\Delta t\)) to construct numerical Monte-Carlo approximations to the transition probability \(p_{\textrm{T},k}(\omega '\vert \omega ,\alpha )\) of observing \(\omega '\) at time \(t+\Delta t\) given that \(\omega \) was observed and action \(\alpha \) was performed at time t, together with the corresponding distribution of rewards \(p_{\textrm{R},k}(R\vert \omega ,\alpha )\). Both distributions depend of course on \(\pi _k\). We then use the approximate probabilities \(p_{\textrm{T},k}\) and \(p_{\textrm{R},k}\) to run the \(\varepsilon \)-greedy Q-learning algorithm that, because of the Markovian formulation imposed now, is ensured to converge. This leads to construct the optimal policy \(\pi _{k+1}\) associated to the approximate system. This procedure is reiterated changing the base policy to \(\pi _{k+1}\), until it attains a fixed point. The method is summarised in Algorithm 4.

The motivation behind this procedure is that if the Markovian approximation is not too far off, then it is natural to think that the optimal policy \(\pi _{k+1}\) of the approximate system should be at least an improvement on the policy \(\pi _k\) if not also the optimal policy when we go back to the real system. Hence, if the optimal policy \(\pi _\star \) is a fixed point of our procedure, then the sequence \(\{\pi _k; k \ge 0\}\) would converge to it, thus solving our problem.

We have run this procedure, choosing for the initial policy \(\pi _0\) the naive strategy of Sect. 3.3. After three iterations, the algorithm circled back to the policy we encountered on the first iteration \(\pi _3 = \pi _1\). Hence, this proposed procedure does not lead to any improvement with respect to the naive policy. This is interpreted as a sign of the highly non-Markovian nature of our setting.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

El Khiyati, Z., Chesneaux, R., Giraldi, L. et al. Steering undulatory micro-swimmers in a fluid flow through reinforcement learning. Eur. Phys. J. E 46, 43 (2023). https://doi.org/10.1140/epje/s10189-023-00293-8

Download citation

Received: 04 February 2023
Accepted: 23 April 2023
Published: 12 June 2023
DOI: https://doi.org/10.1140/epje/s10189-023-00293-8

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Steering undulatory micro-swimmers in a fluid flow through reinforcement learning

Abstract

Graphical abstract

Access this article

Similar content being viewed by others

Optimal Control of Slender Microswimmers

Taming Lagrangian chaos with multi-objective reinforcement learning

Optimal Control of Point-to-Point Navigation in Turbulent Time Dependent Flows Using Reinforcement Learning

Data availability statement

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Appendices

Appendix A: Reinforcement-learning algorithms

1.1 A.1 Q-learning

1.2 A.2 Differential semi-gradient SARSA

1.3 A.3 Policy gradient/Actor-Critic

Appendix B: The iterative Markovian approximation

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation