Measuring the Readability of Geometric Proofs: The Area Method Case

Using an approach, inspired by our modernisation of Lemoine’s Geometrography, this paper proposes a new readability criterion for formal proofs produced by automated theorem provers for geometry. We analyse two criteria to measure the readability of a proof: the criterion given by Chou et al. and the one given by Wiedijk. After discussing the limitations of these two criteria, we introduce a novel approach, which provides a new criterion. We conclude discussing some future work.

text, e.g. its typographical aspects. Readability has to be distinguished from legibility that is the ease with which a reader can recognise individual characters in a text.
In order to quantify the readability of a text, various formulas have been defined [5]. In this paper we will deal with the readability of geometric proofs produced by automatic theorem provers. A potential approach that has been followed in the past is to take formulas that were developed for applications to non-scientific texts and apply them to mathematical texts [13]. However, this sort of approach has not been extended in a proper way to measure the readability of proofs produced by automatic theorem provers and much more work needs to be done 1 [6,12,19,22,26].
A mathematical text is composed of many elements: descriptions in natural language, formulas and diagrams; thus, it is much more difficult to quantify its readability through formulas, then in the case of regular text. Even more complex is the problem of the readability of mathematical proofs produced by automatic provers that are often presented in a form that can only be read by experts.
In this paper, we will introduce both a language to formulate readability criteria for formal proofs produced by automated theorem provers for geometry, based on the area method [2,10] (see Appendix 1), and a novel criterion based on our modernisation of Lemoine's Geometrography [14,22,24]. We will show how this new criterion is consistent with the results of the other already existing criteria, but that it is also more general and expressive compared to the others.
The proposed language allows for an easy formulation of new readability criteria as well as for an easy implementation of those criteria in repositories such as the Thousand of Geometric problems for geometric Theorem Provers (TGTP), 2 thereby collecting data relevant to the further development of the area of automated theorem proving in geometry. The data will help strengthen the use of automatic tools not only in research but also in applications like in mathematics education, where the use of automated deduction is already making its way [7,23]. Therefore, as in the Automath project, the formulation of a readability criterion will allow the definition of a threshold below which "people will start using them (the proofs produced by automated theorem provers 3 ) for serious work" (see §2. 22.2). Overview of the paper. The paper is organised as follows: first, in §2, the known readability criteria will be discussed. In §3, Lemoine's Geometrography, its modernisation and a formal language employed to study readability of formal proofs produced by automated theorem provers for geometry, based on the area method, will be analysed. In §4, a new readability criterion that uses Geometrography will be presented, providing also some examples of its application. In §5 conclusions are drawn and future work will be discussed.

Criteria of Readability (by Experts)
To the best of our knowledge there are two precise proposals to measure the readability of a proof. The first one is that proposed by Chou et al. [1, p. 452], while the second is that proposed by Freek Wiedijk and is known as the de Bruijn factor [4,27]. 1 We will not consider here the problem of the readability of geometric proofs from the Mathematics Educations point of view. We will address some issues related to that context in the conclusions. 2 http://hilbert.mat.uc.pt/TGTP/index.php. 3 In the original sentence, "Automath like Systems".

Maxt-Lems Criterion
Chou et al. [1, p. 452] proposed a way to measure how difficult it is to read a formal proof, obtained by using an automated theorem prover for geometry (GATP) implementing the area method. The Maxt-Lems (ML) criterion considered the following pair (maxt, lems), where: maxt is the number of terms of the maximal polynomial occurring in the machine's proof.
Thus, maxt measures the number of computations needed in the proof; lems is the number of elimination lemmas used to eliminate points from geometric quantities. In other words, lems indicates the number of deduction steps in the proof.
Using those two elements and analysing all the proofs done by their GATP, they managed to determine an indicative threshold for readability. According to [1, p. 452] a formal proof, which employs the area method, is considered readable if one of the following conditions holds: -the maximal term in the proof is less than or equal to 5; -the number of deduction steps of the proof is less than or equal to 10; -the maximal term in the proof is less than or equal to 10 and the deduction steps are less than or equal to 20.
Let us consider, for example, the Thousand of Geometric problems for geometric Theorem Provers (TGTP) repository, specifically, problem GEO0001, the Ceva's Theorem.
Theorem 1 (Ceva's Theorem) Let ABC be a triangle and P be any point in the plane. Let D = AP ∩ C B, E = B P ∩ AC, and F = C P ∩ AB. Show that: AF P should not be in the lines parallel to AC, AB and BC and passing through B, C and A respectively [10].
With respect to the ML criterion, considering the proof made by the Geometry Constructions L A T E X Converter (GCLC) [11] GATP (see Appendix 1), the values are: maxterm = 1, and lems = 3. Therefore, this would be considered a readable proof.

The de Bruijn Factor
The Automath project had the goal of developing a system that would allow to write entire mathematical theories in such a precise fashion that verification of the correctness of theorems in such theories could have been carried out by formal (mechanical) operations applied directly to the text [4]. This was a first effort in the direction of the Formalisation of Mathematics that is now pursued by researchers working in systems like Coq, Isabelle and Mizar. 5 In "A Survey of Project Automath", de Bruijn introduced the loss factor between the size of an ordinary mathematical exposition and its full formal translation inside a computer. The loss factor expresses what someone loses, in terms of shortness, when translating informal mathematics into Automath. Wiedjk developed the concept and called it the de Bruijn factor. The de Bruijn factor was developed for a situation where a proof is entered in a computer in full detail in such a way that the computer can check its correctness, e.g, when an existing informal mathematical text is taken and it is translated into a computer representation (using a system like Automath). So the de Bruijn factor measures how efficient a system is [27]. Wiedijk noted that non-meaningful questions about formatting could affect the calculus of the loss factor, for example: if indentation is performed employing the tab key, then such indentation can be eight times smaller compared to situations in which the indentation is done using the space key; also the T E X macro name for the '⇔' symbol uses 15 characters, while an encoding like "<=>" uses only 3. To further smooth formatting choices, Wiedjk proposed to compress the files before calculating the ratios of their sizes. Wiedijk calls the ratio of the uncompressed file sizes the apparent de Bruijn factor, and the ratio of the compressed file sizes the intrinsic de Bruijn factor [27].
We claim that the de Bruijn factor can be used, in a broader sense, to measure the efficiency of an automated theorem prover and a given axiomatisation. Whenever a informal proof is known for a given theorem, it can be compared with the formal proof produced by the automated theorem prover, using a specified axiomatisation. This is particularly true in geometry where a given informal geometric proof can be compared with an, also geometric, formal proof produced by a geometric automated theorem prover.
Using again the Ceva's Theorem as an example, the readability of its formal proof, with respect to the de Bruijn factor can be calculated 6 (see Table 1).
Wiedijk also introduced the de Bruijn threshold, i.e., a limit below which "the people will start using them (Automath like system) for serious work". We will consider the value of 2 as a readability threshold. Further studies are needed in order to establish a readability threshold for automated proofs, using the de Bruijn factor. Moreover, a broader comparison between formal proofs and informal proofs is needed.
Considering the quotient of the size of the compressed formal proof (area method) and the size of the informal proof, the de Bruijn factor of Ceva's Theorem is 1.09. It would therefore be sensible to consider the GCLC area method proof, readable.

ML and de Bruijn Factor's Limits
Analysing the previous criteria, we can note a first limit for both the ML criterion and the de Bruijn factor: they assume that readability by expert is being considered, i.e., a geometer expert in the language of the prover that produces the proof.
A second limit emerges when the following [22] classification of formal geometric proofs produced by GATPs is taken into consideration: 7 1. no readable proof, only a proved/not proved output; 2. non-synthetic proof (i.e., a proof without a corresponding geometric description, e.g. algebraic methods); 3. semi-synthetic proof with a corresponding prover's language rendering; 4. (semi-)synthetic proof with a corresponding natural language rendering; 5. (semi-)synthetic proof with a corresponding natural language and visual rendering; Relating the ML criterion with this classification, we can note that such criterion only allows the definition of a threshold for semi-synthetic proofs that employ the area method (level 3). The direct applicability of the ML criterion to other synthetic methods, e.g. fullangle methods or the deductive database method [3,28], would be possible, considering the number of deduction steps of the proofs and adapting the condition regarding the maximal term in the proofs.
The de Bruijn factor can be used directly in all levels above 1, although it is more meaningful on levels greater or equal than 2.3. Considering the (GCLC) and its integrated GATPs based on the area method, Wu's method and Gröbner Basis method [9], it is possible to calculate the readability of the proofs developed using the different GATPs. It is indeed possible to imagine, extrapolating from the results with the area method, that all those proofs would be readable, and this would hold even though the de Bruijn factor requires informal proofs to be provided. 8 The two criteria analysed are very different, the first is very specific while the second is very generic, although both criteria require readability by experts. We can therefore ask ourselves if it is possible to define a new criterion which does not require readability by experts, which is also more natural and expressive than the previous ones, and which can be generalised to various proof methods.

Looking for a More Natural Readability Criterion
The new criterion that we want to propose is based on our modernisation of Lemoine's Geometrography [14,22,24]. We will begin by explaining what Geometrography is and what its modernisation consists of.
Geometrography, "alias the art of geometric constructions", aims at providing a tool: (i) to designate every geometric construction by a symbol that manifests its simplicity and 7 GATPs can be of two major types: algebraic, the proof, if it exist, is done recurring to an algebraic reasoning (e.g. Gröbner basis); geometric (synthetic), the proof, if it exist, is done recurring a set of axioms and inference rules of geometry, without the use of coordinates. Semi-synthetic methods, e.g. the area method, use also the axioms of a field of characteristic different from 2. 8 The Wu's method and the Gröbner basis method are both algebraic methods, from the geometric point of view their proofs are unreadable (level 2). exactitude; 9 (ii) to teach the simplest way to execute an assigned construction; (iii) to discuss a known solution to a problem and eventually replacing it with a better solution; (iv) to compare different solutions for a problem, by deciding which is the most exact and the simplest solution from the point of view of Geometrography [14-17, 20, 22, 24].

Classical Geometrography
In Lemoine's Geometrography two coefficients are defined to measure the relative difficulty to perform some geometric constructions. The approach is applied to ruler and compass geometry, i.e., geometric constructions made only with the help of a ruler and a compass. Considering the modifications proposed by Mackay [16], the following Ruler and Compass constructions and the corresponding coefficients can be analysed.
where l i and m j are coefficients denoting the number of times any particular operation is performed. The number (l 1 + l 2 + m 1 + m 2 ) is called the coefficient of simplicity (cs) of the construction, and it denotes the total number of operations performed. The number (l 1 + m 1 ) is called the coefficient of exactitude (ce) of the construction, and it denotes the number of preparatory operations on which the exactitude of the construction (made with the help of physical, inaccurate, tools) depends [16,17].

Geometrography in Dynamic Geometry
Classical Geometrography applies to geometric constructions made with the help of a ruler and a compass. Its modernisation, proposed in [22,24] uses the tools of the dynamic geometry systems (DGS). In [22] it was shown how to modernise Geometrography using GCLC, in [24] the generality of the approach is shown, using GeoGebra [8].
Considering the operations: define a point, anywhere in the plane, D and define a given object, using other objects, C, the following values for the GCLC basic constructions are obtained:  In the modernisation (extrapolation) of the Geometrography, considering the "tools" of dynamic geometry systems, the coefficient of exactitude loses its meaning, the constructions will be executed by the DGS, so they are accurate (exact). However, the coefficient of simplicity of the constructions can still be useful, it can be used to classify constructions by levels of simplicity. A new dimension can also be added, the coefficient of freedom (cf), given by the degree of freedom a given geometric object has, e.g. "a point in a line" has one degree of freedom, a point in the plane has two degrees of freedom, etc. This new coefficient will give a value to the dynamism of the geometric construction. The degrees of freedom are measured against the point definitions. The point definition, defines a point with two degrees of freedom, the onsegment, online and oncircle constructions, define points with one degree of freedom. For the GCLC constructions contained in TGTP an average value of simplicity (CS gcl ) of 20.8 was obtained. Using the k-means clustering function implemented in the statistics package of Octave, 10 three classes of geometric constructions describing an increasing level of complexity were defined: simple constructions, 1 ≤ CS gcl ≤ 18; average complexity constructions, 18 < CS gcl ≤ 28; complex constructions, CS gcl > 28.
For example (TGTP problem's GEO0369): "In triangle ABC, let F be the midpoint of the side BC, and D and E the feet of the altitudes on AB and AC, respectively. FG is perpendicular to DE at G. Show that G is the midpoint of DE", has a geometric construction with coefficient of simplicity 19 (see Fig. 2), so an average complexity construction. The value of 6 for its coefficient of freedom is given by the fact that only the three points A, B, and C are free in the plane, while all the other points are completely bind, by construction.

Geometrography in Automatic Theorem Proving
The same approach can be (again) extrapolated to take into consideration synthetic geometric proofs, i.e., proofs based on a geometric axiomatic theory, using geometric inference rules.
Considering the proofs produced by the GATP GCLC, implementing the area method [9,10], 11 the coefficient of simplicity for all the axioms and lemmas of the theory can be calculated.
Apart from the geometric constructions in which the proof is based (with coefficient of simplicity nCnst), there are other steps to be considered. A given proof can thus be measured against the number of those steps. 12 For a given proof expressed by the equation: where n 1 is the coefficient of simplicity of the geometric construction, n 2 is the number of algebraic simplifications and n 3 is the number of geometric simplifications.
The coefficient of simplicity for the proof would be: The coefficient of freedom has no meaning in this setting. Each lemma of the area method, AML j , has a corresponding simplicity coefficient, the term, l k j=l 1 CS proof (AML j ), is the sum of all those values, for all the lemmas used in the proof. In order to achieve this for each lemma of the area method the corresponding coefficients of simplicity were calculated [21].
For example, the proof of Lemma 9 will have the following coefficient of simplicity, CS proof (AML 9 ) = 74. 11 The proofs developed by GATPs based on the Area Method are formal proofs. The method itself was formalised, and proved sound, using the Coq proof assistant. The GATP developed by J. Narboux, as a Coq tactic, can have the proofs verified by Coq. The GCLC area method, do not have, explicitly, that possibility, but, it would be a matter of developing a filter from the GCLC language to the Coq language (see Appendix 1). 12 By elementary algebraic simplification it is understood the basic algebraic operations: addition, subtraction, multiplication, division, and their properties of commutativity, associativity and distributivity. By elementary geometric simplification it is understood the direct application of the definition of the area method quantities. We call them trivial steps.

Lemma 1 (AML 9 ) Let R be a point on the line P Q. Then for any two points A and B it holds that
The following is a shorter version of its proof with the elementary algebraic and geometric simplifications condensed (the expanded version can be see in [21]). with AML 14 = 8 and AML 5 = 18 (first application) and AML 5 = 11 (second application). It is considered that, from the second application of a lemma onward, its proof is accepted, so, only its adaptation to the new configuration is needed, i.e., the pattern matching of the lemma configuration to a new setting. For that reason, in any second, third, etc. application of a lemma, only the CS gcl coefficient values are considered.

Geometrography of Lemma 9 (AML
Given that a mathematical proof is a sequence of steps, in addition to the coefficient of simplicity, it would be useful to have other coefficients: e.g., the total number of steps in the proof; the value of the most difficult step in the proof; the number of different steps of high difficulty in the proof; the number of different types of steps (lemmas) in the proof; a proof script; a numerical description of the proof; and a corresponding line chart or proof trace. Therefore, to fully characterise a formal synthetic proof produced by a GATP, we can define and consider the following coefficients: -CS proof , the simplicity coefficient (as above), it gives the simplicity coefficient for the overall proof; -CT proof , the total number of steps in the proof; -CS proofmax , the highest simplicity coefficient of the lemmas/definitions applications, it gives the simplicity coefficient for the most difficult step of the proof; -CD typeproof , the number of different types of lemmas used in the proof; -CD highproof , the number of different steps of high difficulty in the proof; -The proof script, as defined above; -The corresponding line chart or proof trace in tikz format. 13 It is important to note that to obtain the coefficient CD highproof (hp) the area method lemmas implemented in the GATP GCLC were analysed, and, using the k-means clustering function implemented in the statistics package of Octave, divided into three categories: low difficulty (hp < 284), medium difficulty (284 ≤ hp < 1848) and high difficulty (hp ≥ 1848).
Using the defined coefficients above, we have the following values for the proof of AML 9 : The GATP GCLC implementation of the area method [9,10] is able to produce proof scripts. Using the command prooflevel it is possible to have control over the level of detail of the proof script. Two programs 14 were implemented to calculate the Geometrography of the proofs. The Geometrography of the construction is calculated by a bash script, gclcGeometrography.bash, that analyses the GCLC geometric construction (not considering all the rendering commands). The Geometrography of the proof script (minus the geometric construction) is calculated by, csproof, a parser that analyse the proof script counting the algebraic steps and the geometric steps in sequence and also the lemmas and definitions of the area method with the respective coefficient of simplicity.
Using the program csproof on an arbitrary geometric proof, it can be obtained: a CSV file 15 with the values regarding the Geometrographic Readability Coefficient of Proofs (see Sect. 4); a file with the coefficient of simplicity of the geometric construction; a file with a line chart, a graphical representation of the proof done by the GATP GCLC.
To better understand some details, let's consider again the Ceva's theorem (see Theorem 1). Using the GATP GCLC, with the full level of detail, the proof script of Ceva's theorem has all the details explained and it fills two pages, almost three pages, if the notes about the non-degeneracy conditions and about the proof itself are taken into consideration (see Appendix 1). The line chart is shown in Fig. 3. In it, the sequences of algebraic, or geometric, simplifications are condensed in only one step (for a more condensed view of the graph). Therefore, the Geometrography of Ceva's Theorem Proof is the following: 4D + 18C + 23AS + 3AML 1 + 3AML 8 + 3AML 10 .

A Geometrographic Criterion
It is interesting to note how the Geometrographic coefficients highlight many salient aspects of the proof, aspects that could be used to analyse the readability of such proofs. Furthermore, it is interesting to stress how the proof trace constitutes a sort of electroencephalogram of the machine while proving the theorem. Just as an electroencephalogram can be useful for measuring a brain's electrical activity, the line chart helps to understand some features of the proof by looking at its trace.
Applying the Geometrography to the area method proofs contained in the repository TGTP, using the GATP GCLC with the full level of detail, and using the geometrographic coefficients we can argue in favour of the following new readability coefficient: Geometrographic Readability Coefficient of Proofs (GRCP) This coefficient relates four quantities: the simplicity coefficient of the proof, the total number of steps in the proof, the number of different steps with high-difficulty in the proof, the number of different lemmas used in the proof.
The first factor, (CS proof − CT proof ), gives an approximation to the overall coefficient of simplicity of the non-trivial steps in the proof. Note that CT proof count the number of steps rather than the coefficient of simplicity of each step. By contrast, in CS proof , it is the coefficient of simplicity that counts. Each trivial step has a coefficient of simplicity equal to one, and the coefficients of simplicity for non-trivial steps, such as the construction and the lemmas, are much greater than one. In the light of this, it can be concluded that the difference between CS proof and CT proof emphasises the complexity of the proof, disregarding its length.
The second factor, (CD highproof + CD typeproof ), gives an account of the difficult steps. Steps that, potentiality, make the proof much harder to follow, steps where the normal flow of the proof would be interrupted to jump to the proof of the lemma, resuming after completing the lemma's proof. The addition of the number of high-difficulty steps with the number of different lemmas used in the proof, gave a multiplying factor for the overall complexity of the proof. A final note about this second factor: a high-difficulty step is, for sure, a lemma application, nevertheless we felt that the high-difficulty nature of the lemma is a sufficient reason for this double counting.
Multiplying these factors, the approximation for the overall simplicity coefficient and the number difficult steps-both elements that we believe characterise the readability of a proof-we obtain a readability coefficient of a proof.

Theorem 2 (Circumcenter of a Triangle) The circumcenter of a triangle can be found as the intersection of the three perpendicular bisectors
has the following values for the different coefficients.

Comparing the Different Criteria
The Geometrography Readability Coefficient of Proofs criterion takes into consideration all the significant aspects of a formal proof, its overall difficulty, its number of steps, the number of difficult steps and the number of different lemmas that must be applied. The other criteria consider fewer aspects. The de Bruijn criterion, given its different goal, takes only in consideration the size of the proof and it needs to have an informal proof to compare with. The ML criterion considers the number of different lemmas applied and uses the number of Alongside the ML criterion, in the GRCP criterion, the number of lemmas in the proof is considered: in the GRCP criterion as a multiplicative factor, in the ML criterion as one of the conditions for readability. In the ML criterion the number of terms in the maximal polynomial are considered, but, as its authors remarked, this measures the number of computations needed in the proof, not its readability. This is weakly related to the number of steps in the proof. It approaches the number of steps needed to decompose those long polynomials occurring in the proof to a simple expression.
Regardless of this criteria comparison, we want to emphasise that the Geometrographic view proposed in this paper has a more general scope. Although the GRCP criterion is a reasonable proposal, the elementary quality of the Geometrographic approach, through the analysis of various coefficients of the proofs, the proof scripts and the proof traces, makes it possible to have a language or a tool that can be used by non-experts to formulate other criteria weaker or stronger than the one we propose. The contribution of this paper is therefore not only that of a Geometrographic criterion, but of a Geometrographic approach to the problem of measuring the readability of formal proofs in automated deduction in geometry, an approach that offer an environment in which to analyse the proofs in detail by proposing and test readability criteria. To the best of our knowledge, it is the first time that the community has access to such a general tool to formulate and to study the readability of formal proofs in automated deduction in geometry. It is also interesting to note that our criterion offers a classification of proofs that is in line, when the fundamental points are considered, with the classifications given by the other two criteria. i.e., proofs that are classified as difficult to read according to the new criterion are also classified as difficult to read for the others, and the same applies to proofs that are easy to read ( Table 2). Finally, we have to say that all the criteria proposed here have no empirical validation through the submission of tests to students, experts, etc. Nevertheless, the great advantage that our approach offers is that it allows to formulate criteria that can be implemented in repositories such as TGTP and can be evaluated experimentally in a very simple way.

Conclusions
In this paper we have analysed the problem of measuring the readability of formal proofs in automated deduction in geometry. We have introduced two known criteria and highlighted some of their limitations. We have then introduced a third criterion that seems to overcome the problem of readability by expert, therefore being more natural than the previous ones, and seems to be easily generalised. One possible generalisation is given by the possibility to formulate weaker, or stronger, criteria, using the proposed language. Another possible avenue is given by the generalisation of our approach to other GATPs (e.g. the JGEx integrated GATPs, area method, full-angle method and deductive databases method [3,28,29], ArgoCLP, coherent logic prover [25]) and any other ATP that has a proof script based on axioms, lemmas applications and, eventually, elementary steps (algebraic, geometric, etc.). It is a matter of calculation of the coefficients of simplicity for the axioms and lemmas of the base theory in consideration.
As we pointed out, the great advantage that our approach offers is that it allows to formulate criteria that can be implemented in repositories such as TGTP and evaluated experimentally. For this reason, an important work that we are planning is an experiment to be submitted to mathematicians, computer scientists, educationalists and students providing an adequate empirical test for our Geometrographic criterion.

A Ceva's Theorem, GCLC Area Method Proof
The area method for Euclidean constructive geometry was proposed by Chou, Gao and Zhang in the early 1990's [2]. The method can efficiently prove many non-trivial geometry theorems and is one of the most interesting and most successful methods for automated theorem proving in geometry. In [10] a variant of the original axiom system was presented, based on that axiomatisation all the lemmas needed by the method were formally proved and the soundness of the method was established, using the Coq proof assistant [18]. 17 The GCLC implementation of the area method is able to produce formal proofs. If the highest level of details is chosen, prooflevel 7, it would be possible to (an appropriated filter has to be built) formally verify those proofs using a proof assistant, e.g. Coq. The L A T E X proof scripts that GCLC produces (by default, at prooflevel 2) are a natural language rendering, to be read by mathematicians.
The area method axiomatic system for Euclidean plane geometry (within first order logic with equality), has just one primitive type of geometrical objects: points. Variables can also range over a field (F, +, ·, 0, 1), where F is any field of characteristic different from 2. The axioms of the theory of fields used in GCLC area method proofs, are standard.
The Ceva's proof presented below is a L A T E X proof script produced by GCLC, at prooflevel 7, edited to include the GRCP values.
GCLC Prover Output for conjecture "cevaGEO0001", Area method used