Abstract
We introduce a new parser generator, called Berry–Sethi Parser (BSP), for ambiguous regular expressions (RE). The generator constructs a deterministic finite-state transducer that recognizes an input string, as the classical Berry–Sethi algorithm does, and additionally outputs a linear representation of all the syntax trees of the string; for infinitely ambiguous strings, a policy for selecting representative sets of trees is chosen. To construct the transducer, the RE symbols, including letters, parentheses and other metasymbols, are distinctly numbered, so that the corresponding language becomes locally testable. In this way a deterministic position automaton can be constructed, which recognizes and translates the input into a compact DAG representation of the syntax trees. The correctness of the construction is proved. The transducer operates in a linear time on the input. Its descriptive complexity is analyzed as a function of established RE parameters: the alphabetic width, the number of null string symbols and the height of the RE tree. A condition for checking RE ambiguity on the transducer graph is stated. Experimental results of running the parser generator and the parser on a large RE collection are presented. The POSIX RE disambiguation criterion has also been applied to the parser.
Similar content being viewed by others
Notes
The code is available at https://github.com/FLC-project/BSP together with the input data used for the experiments.
The benchmark and generator codes are available at https://github.com/FLC-project/BSP.
On a computer AMD Athlon 64 X2 4200+ with clock 2.2 GHz and operating system Windows 10.
Since RE2 outputs one tree and is coded in C\(++\), to offset the difference due to the programming language we implemented a version of BSP that uses POSIX disambiguation for selecting one tree and is coded in C\(++\) as well; some experimental results are available at https://github.com/FLC-project/BSP. A systematic experimental comparison between existing RE parsing algorithms would be interesting, but it requires more research and presents practical difficulties. Only a few published algorithms come with well-engineered and available programs, and such programs may be coded in different languages. Moreover, the parsing process may return incomparable information on the syntax trees. Lastly, such a research has to face the problem of choosing an unbiased collection of REs as a benchmark.
References
Aaraj, N., Raghunathan, A., Jha, N.K.: Dynamic binary instrumentation-based framework for malware defense. In: Zamboni, D. (ed.) DIMVA, LNCS, vol. 5137, pp. 64–87. Springer (2008)
Allauzen, C., Mohri, M.: A unified construction of the Glushkov, Follow, and Antimirov automata. In: Kralovic, R., Urzyczyn, P. (eds.) MFCS, LNCS, vol. 4162, pp. 110–121. Springer (2006)
Berry, G., Sethi, R.: From regular expressions to deterministic automata. Theor. Comput. Sci. 48(1), 117–126 (1986)
Berstel, J., Pin, J.E.: Local languages and the Berry–Sethi algorithm. Theor. Comput. Sci. 155(2), 439–446 (1996)
Bille, P., Gørtz, I.L.: From regular expression matching to parsing. In: Rossmanith, P., Heggernes, P., Katoen, J. (eds.) MFCS, LIPIcs, vol. 138, pp. 71:1–71:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2019)
Book, R., Even, S., Greibach, S., Ott, G.: Ambiguity in graphs and expressions. IEEE Trans. Comput. C–20(2), 149–153 (1971)
Borsotti, A., Breveglieri, L., Crespi Reghizzi, S., Morzenti, A.: From ambiguous regular expressions to deterministic parsing automata. In: Drewes, F. (ed.) CIAA, LNCS, vol. 9223, pp. 35–48. Springer (2015)
Borsotti, A., Breveglieri, L., Crespi Reghizzi, S., Morzenti, A.: A benchmark production tool for regular expressions. In: Hospodár, M., Jirásková, G. (eds.) CIAA, LNCS, vol. 11601, pp. 95–107. Springer (2019)
Crespi Reghizzi, S., Breveglieri, L., Morzenti, A.: Formal Languages and Compilation. Texts in Computer Science, 3rd edn. Springer, Berlin (2019)
Dubè, D., Feeley, M.: Efficiently building a parse tree from a regular expression. Acta Inf. 37(2), 121–144 (2000)
Frisch, A., Cardelli, L.: Greedy regular expression matching. In: Díaz, J., Karhumäki, J., Lepistö, A., Sannella, D. (eds.) ICALP, LNCS, vol. 3142, pp. 618–629. Springer (2004)
Grathwohl, N., Henglein, F., Nielsen, L., Rasmussen, U.: Two-pass greedy regular expression parsing. In: Konstantinidis, S. (ed.) CIAA, LNCS, vol. 7982, pp. 60–71. Springer (2013)
Gruber, H., Holzer, M.: From finite automata to regular expressions and back—a summary on descriptional complexity. Int. J. Found. Comput. Sci. 26(8), 1009–1040 (2015)
Haber, S., Horne, W., Manadhata, P., Mowbray, M., Rao, P.: Efficient submatch extraction for practical regular expressions. In: Dediu, A.H., Vide, C.M., Truthe, B. (eds.) LATA, LNCS, vol. 7810, pp. 323–334. Springer (2013)
IEEE: std. 1003.2, POSIX, regular expression notation, section 2.8 (1992)
Kearns, S.: Extending regular expressions with context operators and parse extraction. Softw. Pract. Exp. 21(8), 787–804 (1991)
Laurikari, V.: NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions. In: de la Fuente, P. (ed.) SPIRE, pp. 181–187. IEEE Computer Society (2000)
McNaughton, R., Papert, S.: Counter-Free Automata. MIT Press, Cambridge (1971)
Nielsen, L., Henglein, F.: Bit-coded regular expression parsing. In: Dediu, A.H., Inenaga, S., C.M. (eds.) LATA, LNCS, vol. 6638, pp. 402–413. Springer (2011)
Okui, S., Suzuki, T.: Disambiguation in regular expression matching via position automata with augmented transitions. In: Domaratzki, M., Salomaa, K. (eds.) CIAA, LNCS, vol. 6482, pp. 231–240. Springer (2010)
Schwarz, N., Karper, A., Nierstrasz, O.: Efficiently extracting full parse trees using regular expressions with capture groups. PeerJ PrePrints 3, e1248 (2015)
Sulzmann, M., Lu, K.Z.M.: POSIX regular expression parsing with derivatives. In: Codish, M., Sumii, E. (eds.) FLOPS, LNCS, vol. 8475, pp. 203–220. Springer (2014)
Sulzmann, M., Lu, K.Z.M.: Derivative-based diagnosis of regular expression ambiguity. Int. J. Found. Comput. Sci. 28(5), 543–562 (2017)
Watson, B.: A taxonomy of finite automata construction algorithms. Technical Report, Computing Science Notes, Technische Univ. Eindhoven (1993)
Acknowledgements
To the anonymous reviewers for their valuable suggestions and references.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary version of the first part of this work is in [7].
Rights and permissions
About this article
Cite this article
Borsotti, A., Breveglieri, L., Crespi Reghizzi, S. et al. A deterministic parsing algorithm for ambiguous regular expressions. Acta Informatica 58, 195–229 (2021). https://doi.org/10.1007/s00236-020-00366-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00236-020-00366-7