Skip to main content
Log in

Steering undulatory micro-swimmers in a fluid flow through reinforcement learning

  • Regular Article – Flowing Matter
  • Published:
The European Physical Journal E Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

This work aims at finding optimal navigation policies for thin, deformable microswimmers that progress in a viscous fluid by propagating a sinusoidal undulation along their slender body. These active filaments are embedded in a prescribed, non-homogeneous flow, in which their swimming undulations have to compete with the drifts, strains, and deformations inflicted by the outer velocity field. Such an intricate situation, where swimming and navigation are tightly bonded, is addressed using various methods of reinforcement learning. Each swimmer has only access to restricted information on its configuration and has to select accordingly an action among a limited set. The optimisation problem then consists in finding the policy leading to the most efficient displacement in a given direction. It is found that usual methods do not converge and this pitfall is interpreted as a combined consequence of the non-Markovianity of the decision process, together with the highly chaotic nature of the dynamics, which is responsible for high variability in learning efficiencies. Still, we provide an alternative method to construct efficient policies, which is based on running several independent realisations of Q-learning. This allows the construction of a set of admissible policies whose properties can be studied in detail and compared to assess their efficiency and robustness.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data availability statement

Data sharing is not applicable to this article as no datasets were generated during the current study. Numerical codes will be made available on reasonable request.

References

  1. Z. Wu, Y. Chen, D. Mukasa, O.S. Pak, W. Gao, Medical micro/nanorobots in complex media. Chem. Soc. Rev. 49, 8088–8112 (2020). https://doi.org/10.1039/d0cs00309c

    Article  Google Scholar 

  2. Q. Servant, K. Mazza, Nelson: Controlled in vivo swimming of a swarm of bacteria-like microrobotic flagella. Adv. Mater. 27, 2981–2988 (2015). https://doi.org/10.1002/adma.201404444

    Article  Google Scholar 

  3. L. Berti, L. Giraldi, C. Prud’Homme, Swimming at low Reynolds number. ESAIM Proc. Surv. 67, 46–60 (2020). https://doi.org/10.1051/proc/202067004

    Article  MathSciNet  MATH  Google Scholar 

  4. F. Alouges, A. DeSimone, L. Giraldi, M. Zoppello, Self-propulsion of slender micro-swimmers by curvature control: N-link swimmers. Int. J. Non Linear Mech. 56, 132–141 (2013). https://doi.org/10.1016/j.ijnonlinmec.2013.04.012

    Article  ADS  Google Scholar 

  5. X. Shen, P.E. Arratia, Undulatory swimming in viscoelastic fluids. Phys. Rev. Lett. 106(20), 208101 (2011). https://doi.org/10.1103/PhysRevLett.106.208101

    Article  ADS  Google Scholar 

  6. A. Daddi-Moussa-Ider, H. Löwen, B. Liebchen, Hydrodynamics can determine the optimal route for microswimmer navigation. Commun. Phys. 4, 1–11 (2021). https://doi.org/10.1038/s42005-021-00522-6

    Article  Google Scholar 

  7. I. Borazjani, F. Sotiropoulos, Numerical investigation of the hydrodynamics of anguilliform swimming in the transitional and inertial flow regimes. J. Exp. Biol. 212(4), 576–592 (2009). https://doi.org/10.1242/jeb.025007

    Article  Google Scholar 

  8. N. Cohen, J.H. Boyle, Swimming at low Reynolds number: a beginners guide to undulatory locomotion. Contemp. Phys. 51(2), 103–123 (2010). https://doi.org/10.1080/00107510903268381

    Article  ADS  Google Scholar 

  9. F. Alouges, A. DeSimone, L. Giraldi, Y. Or, O. Wiezel, Energy-optimal strokes for multi-link microswimmers: Purcell’s loops and Taylor’s waves reconciled. New J. Phys. 21(4), 043050 (2019). https://doi.org/10.1088/1367-2630/ab1142

    Article  ADS  MathSciNet  Google Scholar 

  10. G. Reddy, V.N. Murthy, M. Vergassola, Olfactory sensing and navigation in turbulent environments. Annu. Rev. Condens. Matter Phys. 13(1), 191–213 (2022). https://doi.org/10.1146/annurev-conmatphys-031720-032754

    Article  ADS  Google Scholar 

  11. F. Cichos, K. Gustavsson, B. Mehlig, G. Volpe, Machine learning for active matter. Nat. Mach. Intell. 2, 94–103 (2020). https://doi.org/10.1038/s42256-020-0146-9

    Article  Google Scholar 

  12. G. Reddy, A. Celani, T.J. Sejnowski, M. Vergassola, Learning to soar in turbulent environments. Proc. Natl. Acad. Sci. 113, 4877–4884 (2016). https://doi.org/10.1073/pnas.1606075113

    Article  ADS  Google Scholar 

  13. S. Colabrese, K. Gustavsson, A. Celani, L. Biferale, Flow navigation by smart microswimmers via reinforcement learning. Phys. Rev. Lett. 118(15), 158004 (2017). https://doi.org/10.1103/PhysRevLett.118.158004

    Article  ADS  Google Scholar 

  14. K. Gustavsson, L. Biferale, A. Celani, S. Colabrese, Finding efficient swimming strategies in a three-dimensional chaotic flow by reinforcement learning. Euro. Phys. J. E 40, 1–6 (2017). https://doi.org/10.1140/epje/i2017-11602-9

    Article  Google Scholar 

  15. E. Schneider, H. Stark, Optimal steering of a smart active particle. Europhys. Lett. 127(6), 64003 (2019). https://doi.org/10.1209/0295-5075/127/64003

    Article  ADS  Google Scholar 

  16. S. Muiños-Landin, A. Fischer, V. Holubec, F. Cichos, Reinforcement learning with artificial microswimmers. Sci. Robot. 6(52), 9285 (2021). https://doi.org/10.1126/scirobotics.abd9285

    Article  Google Scholar 

  17. J. Qiu, N. Mousavi, K. Gustavsson, C. Xu, B. Mehlig, L. Zhao, Navigation of micro-swimmers in steady flow: the importance of symmetries. J. Fluid Mech. 932, 10 (2022). https://doi.org/10.1017/jfm.2021.978

    Article  ADS  MathSciNet  MATH  Google Scholar 

  18. Kumar A. Jaya, A.K. Verma, J. Bec, R. Pandit, Machine learning strategies for path-planning microswimmers in turbulent flows. Phys. Rev. E 101, 043110 (2020). https://doi.org/10.1103/PhysRevE.101.043110

    Article  ADS  Google Scholar 

  19. X.B. Peng, G. Berseth, M. Van de Panne, Terrain-adaptive locomotion skills using deep reinforcement learning. ACM Trans. Graph. 35(4), 1–12 (2016). https://doi.org/10.1145/2897824.2925881

    Article  Google Scholar 

  20. S. Levine, C. Finn, T. Darrell, P. Abbeel, End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17(1), 1334–1373 (2016). https://doi.org/10.5555/2946645.2946684

    Article  MathSciNet  MATH  Google Scholar 

  21. O. Pironneau, D. Katz, Optimal swimming of flagellated micro-organisms. J. Fluid Mech. 66(2), 391–415 (1974). https://doi.org/10.1017/S0022112074000279

    Article  ADS  MATH  Google Scholar 

  22. A. Lindner, M.J. Shelley, Elastic fibers in flows, in Fluid-Structure Interactions in Low-Reynolds-Number Flows. ed. by C. Duprat, H.A. Stone (Royal Society of Chemistry, Cambridge, 2015), pp.168–192

    Chapter  Google Scholar 

  23. C. Moreau, L. Giraldi, H. Gadêlha, The asymptotic coarse-graining formulation of slender-rods, bio-filaments and flagella. J. R. Soc. Interface 15(144), 20180235 (2018). https://doi.org/10.1098/rsif.2018.0235

    Article  Google Scholar 

  24. J.R. Picardo, D. Vincenzi, N. Pal, S.S. Ray, Preferential sampling of elastic chains in turbulent flows. Phys. Rev. Lett. 121(24), 244501 (2018). https://doi.org/10.1103/PhysRevLett.121.244501

    Article  ADS  Google Scholar 

  25. M.E. Rosti, A.A. Banaei, L. Brandt, A. Mazzino, Flexible fiber reveals the two-point statistical properties of turbulence. Phys. Rev. Lett. 121(4), 044501 (2018). https://doi.org/10.1103/PhysRevLett.121.044501

    Article  ADS  Google Scholar 

  26. Y.-N. Young, M.J. Shelley, Stretch-coil transition and transport of fibers in cellular flows. Phys. Rev. Lett. 99(5), 058303 (2007). https://doi.org/10.1103/PhysRevLett.99.058303

    Article  ADS  Google Scholar 

  27. C. Brouzet, G. Verhille, P. Le Gal, Flexible fiber in a turbulent flow: a macroscopic polymer. Phys. Rev. Lett. 112(7), 074501 (2014). https://doi.org/10.1103/PhysRevLett.112.074501

    Article  ADS  Google Scholar 

  28. S. Allende, C. Henry, J. Bec, Stretching and buckling of small elastic fibers in turbulence. Phys. Rev. Lett. 121(15), 154501 (2018). https://doi.org/10.1103/PhysRevLett.121.154501

    Article  ADS  Google Scholar 

  29. J. Gray, H.W. Lissmann, The locomotion of nematodes. J. Exp. Biol. 41(1), 135–154 (1964). https://doi.org/10.1242/jeb.41.1.135

    Article  Google Scholar 

  30. S. Berri, J.H. Boyle, M. Tassieri, I.A. Hope, N. Cohen, Forward locomotion of the nematode C. elegans is achieved through modulation of a single gait. HFSP J. 3(3), 186–193 (2009). https://doi.org/10.2976/1.3082260

    Article  Google Scholar 

  31. B.M. Friedrich, I.H. Riedel-Kruse, J. Howard, F. Jülicher, High-precision tracking of sperm swimming fine structure provides strong test of resistive force theory. J. Exp. Biol. 213(8), 1226–1234 (2010). https://doi.org/10.1242/jeb.039800

    Article  Google Scholar 

  32. J.F. Jikeli, L. Alvarez, B.M. Friedrich, L.G. Wilson, R. Pascal, R. Colin, M. Pichlo, A. Rennhack, C. Brenker, U.B. Kaupp, Sperm navigation along helical paths in 3D chemoattractant landscapes. Nat. Commun. 6, 1–10 (2015). https://doi.org/10.1038/ncomms8985

    Article  Google Scholar 

  33. A.-K. Tornberg, M.J. Shelley, Simulating the dynamics and interactions of flexible fibers in Stokes flows. J. Comput. Phys. 196(1), 8–40 (2004). https://doi.org/10.1016/j.jcp.2003.10.017

    Article  ADS  MathSciNet  MATH  Google Scholar 

  34. D. Rothstein, E. Henry, J.P. Gollub, Persistent patterns in transient chaotic fluid mixing. Nature 401(6755), 770–772 (1999). https://doi.org/10.1038/44529

    Article  ADS  Google Scholar 

  35. M. Hauskrecht, Value-function approximations for partially observable Markov decision processes. J. Artif. Intell. Res. 13, 33–94 (2000). https://doi.org/10.1613/jair.678

    Article  MathSciNet  MATH  Google Scholar 

  36. S.P. Singh, T. Jaakkola, M.I. Jordan, Learning without state-estimation in partially observable Markovian decision processes, in Machine Learning Proceedings 1994. ed. by W.W. Cohen, H. Hirsh (Morgan Kaufmann, San Francisco, 1994), pp.284–292. https://doi.org/10.1016/B978-1-55860-335-6.50042-8

    Chapter  Google Scholar 

  37. R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (The MIT Press, Cambridge, 2018)

    MATH  Google Scholar 

  38. C.J. Watkins, P. Dayan, Q-learning. Mach. Learn. 8, 279–292 (1992). https://doi.org/10.1007/BF00992698

  39. L. Berti, Z. El Khiyati, Y. Essousy, C. Prud’Homme, L. Giraldi, Reinforcement learning with function approximation for 3-spheres swimmer. IFAC-PapersOnLine 55(16), 1–6 (2022). https://doi.org/10.1016/j.ifacol.2022.08.072

    Article  Google Scholar 

  40. A. Najafi, R. Golestanian, Simple swimmer at low Reynolds number: three linked spheres. Phys. Rev. E 69(6), 062901 (2004). https://doi.org/10.1103/PhysRevE.69.062901

    Article  ADS  Google Scholar 

  41. V.R. Konda, J.N. Tsitsiklis, On actor-critic algorithms. SIAM J. Control. Optim. 42(4), 1143–1166 (2003). https://doi.org/10.1137/S03630129013856

    Article  MathSciNet  MATH  Google Scholar 

  42. P. Perlekar, R. Pandit, Turbulence-induced melting of a nonequilibrium vortex crystal in a forced thin fluid film. New J. Phys. 12(2), 023033 (2010). https://doi.org/10.1088/1367-2630/12/2/023033

    Article  ADS  Google Scholar 

  43. G. Michel, J. Herault, F. Pétrélis, S. Fauve, Bifurcations of a large-scale circulation in a quasi-bidimensional turbulent flow. Europhys. Lett. 115(6), 64004 (2016). https://doi.org/10.1209/0295-5075/115/64004

    Article  ADS  Google Scholar 

Download references

Acknowledgements

The authors are grateful to the OPAL infrastructure from Université Côte d’Azur for providing computational resources. This work received support from the UCA-JEDI Future Investments funded by the French government (Grant No. ANR-15-IDEX-01) and from the Agence Nationale de la Recherche (Grant No. ANR-21-CE30-0040-01).

Author information

Authors and Affiliations

Authors

Contributions

ZEK and JB performed research and wrote the paper. RC performed research. JB and LG designed research.

Corresponding author

Correspondence to Jérémie Bec.

Additional information

Quantitative AI in Complex Fluids and Complex Flows: Challenges and Benchmarks. Guest editors: Luca Biferale, Michele Buzzicotti, Massimo Cencini.

Appendices

Appendix A: Reinforcement-learning algorithms

We give here details on the three reinforcement-learning algorithms that have been used in this work and whose results are discussed in Sect. 4, namely Q-learning, differential semi-gradient SARSA, and policy gradient/Actor-Critic.

1.1 A.1 Q-learning

Q-learning is a model-free reinforcement-learning algorithm used to estimate the optimal action-value function for a given environment [38]. It uses the discounted reward \(\mathcal {R}^\textrm{disc}[\pi ]\) defined by Eq. (4) with discount rate \(\gamma \). In this algorithm, an agent interacts with the environment, observes the state and reward, and takes an action according to an \(\varepsilon \)-greedy policy. An estimate of the action-value function Q defined in Sect. 3.2 is updated at each step using the immediate reward R and the maximum estimated value for the next state, using a learning rate parameter \(\lambda \). The algorithm continues until convergence is reached or a stopping criterion is met.

The resulting procedure is summarised below:

figure b

1.2 A.2 Differential semi-gradient SARSA

This is another model-free reinforcement learning algorithm used to learn the optimal action-value function Q associated to the differential return \(\mathcal {R}^\textrm{diff}[\pi ]\) defined in Eq. (5). In this procedure, the agent interacts with the environment, observes the state and reward, and takes an action according to an \(\varepsilon \)-greedy policy following an approximation \(\hat{Q}_{\varvec{\eta }}\) of the action-value function. The approximation parameters \(\varvec{\eta }\) are updated with a learning rate \(\lambda _2\) using a temporal-difference (TD) error \(\delta \), which is the difference between the observed reward and an estimated value \(\bar{R}\) of the next state. Additionally, the algorithm updates the estimate \(\bar{R}\) at a rate \(\lambda _1\). Details can be found in [37].

The procedure is then:

figure c

1.3 A.3 Policy gradient/Actor-Critic

This last procedure is a model-based reinforcement learning algorithm used to learn a policy that maximises the expected differential return \(\mathcal {R}^\textrm{diff}[\pi ]\). In this algorithm, the agent interacts with the environment, observes the state and reward, and takes an action according to an approximation \(\hat{\pi }_{\varvec{\theta }}\) of the optimal policy. The TD error \(\delta \) is this time defined as the difference between the observed reward and the estimated value \(\hat{V}_{\varvec{\eta }}\) of the next state. The algorithm uses it to update the estimated average reward \(\bar{R}\) at a rate \(\lambda _1\), and the approximation parameters \(\varvec{\eta }\) used for the value function at a rate \(\lambda _2\). Additionally, the approximation parameters \(\varvec{\theta }\) of the policy are updated at a rate \(\lambda _3\) using both the TD error and the gradient of the log-likelihood of the policy with respect to \(\varvec{\theta }\). This algorithm is known as the Actor-Critic algorithm [41] because it simultaneously learns an actor (policy \(\hat{\pi }_{\varvec{\theta }}\)) and a critic (value function \(\hat{V}_{\varvec{\eta }}\)).

The resulting procedure is summarised below:

figure d

Appendix B: The iterative Markovian approximation

We here give some details on the idea to approximate the dynamical evolution of the swimmer by an MDP. Our hope is that this approximation will capture the most relevant information of our optimisation problem, namely, the transition probabilities between the states of our environment and the distribution of the rewards obtained by our agent. The advantages of this approach are twofold. First, MDPs only require knowledge of the transitions and rewards, abstracting away all the other aspects of the dynamics. This approximation thus enables learning algorithms to run significantly faster, without the need to simulate the entire system at each step. Second, this approach would separate the issue of non-Markovianity from other potential difficulties.

Our procedure consists in constructing a sequence of policies \(\pi _0\), \(\pi _1\), ...\(\pi _k\) that will hopefully converge to the optimal \(\pi _\star \). At each step, we simulate a swimmer that follows the policy \(\pi _k\). Once a statistical steady state is reached, we try out at every time step \(t=n\Delta t\), all possible actions to monitor the new observation and reward at time \(t_{n+1}\). We then use long time averages (over \(10^6\Delta t\)) to construct numerical Monte-Carlo approximations to the transition probability \(p_{\textrm{T},k}(\omega '\vert \omega ,\alpha )\) of observing \(\omega '\) at time \(t+\Delta t\) given that \(\omega \) was observed and action \(\alpha \) was performed at time t, together with the corresponding distribution of rewards \(p_{\textrm{R},k}(R\vert \omega ,\alpha )\). Both distributions depend of course on \(\pi _k\). We then use the approximate probabilities \(p_{\textrm{T},k}\) and \(p_{\textrm{R},k}\) to run the \(\varepsilon \)-greedy Q-learning algorithm that, because of the Markovian formulation imposed now, is ensured to converge. This leads to construct the optimal policy \(\pi _{k+1}\) associated to the approximate system. This procedure is reiterated changing the base policy to \(\pi _{k+1}\), until it attains a fixed point. The method is summarised in Algorithm 4.

figure e

The motivation behind this procedure is that if the Markovian approximation is not too far off, then it is natural to think that the optimal policy \(\pi _{k+1}\) of the approximate system should be at least an improvement on the policy \(\pi _k\) if not also the optimal policy when we go back to the real system. Hence, if the optimal policy \(\pi _\star \) is a fixed point of our procedure, then the sequence \(\{\pi _k; k \ge 0\}\) would converge to it, thus solving our problem.

We have run this procedure, choosing for the initial policy \(\pi _0\) the naive strategy of Sect. 3.3. After three iterations, the algorithm circled back to the policy we encountered on the first iteration \(\pi _3 = \pi _1\). Hence, this proposed procedure does not lead to any improvement with respect to the naive policy. This is interpreted as a sign of the highly non-Markovian nature of our setting.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

El Khiyati, Z., Chesneaux, R., Giraldi, L. et al. Steering undulatory micro-swimmers in a fluid flow through reinforcement learning. Eur. Phys. J. E 46, 43 (2023). https://doi.org/10.1140/epje/s10189-023-00293-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1140/epje/s10189-023-00293-8

Navigation