Skip to main content
Log in

Capitalizing on live variables: new algorithms for efficient Hessian computation via automatic differentiation

  • Full Length Paper
  • Published:
Mathematical Programming Computation Aims and scope Submit manuscript

Abstract

We revisit an algorithm [called Edge Pushing (EP)] for computing Hessians using Automatic Differentiation (AD) recently proposed by Gower and Mello (Optim Methods Softw 27(2): 233–249, 2012). Here we give a new, simpler derivation for the EP algorithm based on the notion of live variables from data-flow analysis in compiler theory and redesign the algorithm with close attention to general applicability and performance. We call this algorithm Livarh and develop an extension of Livarh that incorporates preaccumulation to further reduce execution time—the resulting algorithm is called Livarhacc. We engineer robust implementations for both algorithms Livarh and Livarhacc within ADOL-C, a widely-used operator overloading based AD software tool. Rigorous complexity analyses for the algorithms are provided, and the performance of the algorithms is evaluated using a mesh optimization application and several kinds of synthetic functions as testbeds. The results show that the new algorithms outperform state-of-the-art sparse methods (based on sparsity pattern detection, coloring, compressed matrix evaluation, and recovery) in some cases by orders of magnitude. We have made our implementation available online as open-source software and it will be included in a future release of ADOL-C.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The term Single Assignment Code is used in Griewank–Walther [1] to refer to a block of evaluation procedure rather than an elementary function; here we use it to mean the latter, since, for simplicity, we identify mathematical variables with program variables.

References

  1. Griewank, A., Walther, A.: Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd edn. Number 105. SIAM, Philadelphia (2008)

  2. Naumann, U.: The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation. SIAM (2012)

  3. Jackson, R.H.F., McCormick, G.P.: The polyadic structure of factorable function tensors with applications to higher-order minimization techniques. J. Optim. Theory Appl. 51, 63–94 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  4. Christianson, B.: Automatic Hessians by reverse accumulation. IMA J. Numer. Anal 12(2), 135–150 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  5. Dixon, L.C.W.: Use of automatic differentiation for calculating Hessians and Newton steps. In: Griewank, A., Corliss, G.F. (eds.) Automatic Differentiation of Algorithms: Theory, Implementation and Application, pp. 114–125. SIAM, Philadelphia (1991)

  6. Bhowmick, S., Hovland, P.D.: A polynomial-time algorithm for detecting directed axial symmetry in Hessian computational graphs. In: Bischof, C.H., Bucker, H.M., Hovland, P.D., Naumann, U., Utke, J. (eds.) Advances in Automatic Differentiation, pp. 91–102. Springer, Berlin (2008)

    Chapter  Google Scholar 

  7. Coleman, T., Moré, J.: Estimation of sparse Hessian matrices and graph coloring problems. Math. Program. 28, 243–270 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  8. Coleman, T., Cai, J.: The cyclic coloring problem and estimation of sparse Hessian matrices. SIAM J. Alg. Disc. Meth. 7(2), 221–235 (1986)

    Article  MathSciNet  MATH  Google Scholar 

  9. Gebremedhin, A.H., Tarafdar, A., Manne, F., Pothen, A.: New acyclic and star coloring algorithms with applications to Hessian computation. SIAM J. Sci. Comput. 29(3), 1042–1072 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  10. Gebremedhin, A.H., Tarafdar, A., Pothen, A., Walther, A.: Efficient computation of sparse Hessians using coloring and automatic differentiation. INFORMS J. Comput. 1(2), 209–223 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  11. Gower, R.M., Mello, M.P.: A new framework for Hessian automatic differentiation. Optim. Methods Softw. 27(2), 233–249 (2012)

    Article  MathSciNet  Google Scholar 

  12. Gower, R.M., Mello, M.P.: Computing the sparsity pattern of Hessians using automatic differentiation. ACM Trans. Math. Softw. 40(2) (2014) (Article No. 10)

  13. Gower, R.M., Gower, A.L.: Higher-order reverse automatic differentiation with emphasis on the third-order. Math. Program. 1–23 (2014) (ISSN: 0025–5610)

  14. Giewank, A., Juedes, D., Utke, J.: ADOL-C: a package for the automatic differentiation of algorithms written in C/C++. ACM Trans. Math. Softw. 22, 131–167 (1996)

    Article  MATH  Google Scholar 

  15. Walther, A., Griewank, A.: Getting started with ADOL-C. In: Combinatorial Scientific Computing, Chapman-Hall CRC Computational Science, Chapter 7, pp. 181–202 (2012)

  16. Hascoët, L., Naumann, U., Pascual, V.: “To Be Recorded” analysis in reverse-mode automatic differentiation. Future Gener. Comput. Syst. 21, 1401–1417 (2005)

    Article  Google Scholar 

  17. Hascoët, L., Araya-Polo, M.: The adjoint data-flow analyses: formalization, properties, and applications. In: Bücker, M., Corliss, G., Hovland, P., Naumann, U., Norris, B. (eds.) Automatic Differentiation: Applications, Theory and Tools, Lecture Notes in Comput. Sci. Engr. 50, pp. 135–146. Springer, Berlin (2005)

  18. Griewank, A.: Sequential Evaluation of Adjoints and Higher Derivative Vectors by Overloading and Reverse Accumulation. Research Report Konrad-Zuse-Zentrum for Informationstechnik Berlin, No 0 (1991)

  19. Utke, J.: Flattening of Basic Blocks for Preaccumulation Automatic Differentiation: Applications, Theory, and Implementations, pp. 121–133. Springer, Berlin (2006)

    Book  MATH  Google Scholar 

  20. Luksan, L., Matonoha, C., Vlcek, J.: Sparse Test Problems for Unconstrained Optimization. Tech. Rep. V-1064, ICS AS CR (2010)

  21. Munson, T.S., Hovland, P.D.: The FeasNewt benchmark. In: Proceedings of the IEEE International Workload Characterization Symposium, pp. 150–154 (2005)

  22. Gebremedhin, A.H., Nguyen, D., Patwary, M., Pothen, A.: ColPack: software for graph coloring and related problems in scientific computing. ACM Trans. Math. Sofw. 40(1), 1–31 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  23. Walther, A.: Computing sparse Hessians with automatic differentiation. ACM Trans. Math. Softw. 34(1), 1–15 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  24. Walther, A.: On the efficient computation of sparse patterns for Hessians. In: Forth, S. et al. (eds.) Recent Advances in Algorithmic Differentiation, Lecture Notes in Computational Science and Engineering 87, pp. 139–149 (2012)

  25. Nicholas, N., Julian, S.: Valgrind: A framework for heavyweight dynamic binary instrumentation. In: Proceedings of ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI 2007), San Diego, California, USA (2007)

Download references

Acknowledgments

We thank Jean Utke, Andrea Walther and Kshitij Kulshreshta for their comments on earlier versions of this manuscript and discussions around ADOL-C. We thank Paul Hovland and Todd Munson for making the FeasNewt benchmark available to us. This work was supported by the U.S. National Science Foundation Grants CCF-1218916 CCF-1552323, and the U.S. Department of Energy Grant DE-SC0010205.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alex Pothen.

Appendix

Appendix

We provide here details that we left out from the discussions in Sects. 2 through 6. In Sect. 1 we give an alternative, more abstract derivation of the Eqs. (6) and (8) from Sect. 3 on adjoints and Hessians, respectively. In Sect. 1 we analyze the complexity of processing one statement in Livarh and Livarhacc, and then we establish a sufficient condition under which Livarhacc necessarily reduces the total number of updates needed. In Sect. 1 we give a listing of the mathematical expressions describing the test functions F1, F2, F3 and F4 used in Sect. 6.2.

1.1 Derivation of the Hessian invariant equation

Suppose we are about to process a SAC \(v_i = \varphi _i(v_j)_{v_j \prec v_i}\). Following the notations from Sect. 2, the current live variable set is \(S_{i+1}\) and the objective function is equivalent to \(\mathbf {F_{i+1}}(S_{i+1})\). After the SAC is processed, the live variable set will be \(S_i\) and the objective function will be equivalent to \(\mathbf {F_i}(S_i)\). The relation between \(\mathbf {F_{i+1}}(S_{i+1})\) and \(\mathbf {F_i}(S_i)\) is that \(v_i\) is considered an independent variable in \(\mathbf {F_{i+1}}(S_{i+1})\) whereas it is viewed as the implicit function \(v_i=\varphi (v_j)_{v_j \prec v_i}\) in \(\mathbf {F_i}(S_i)\). Thus:

$$\begin{aligned} \mathbf {F_{i+1}}(S_{i+1})&= \mathbf {F_{i+1}} (S_{i+1} {\setminus } \{v_i\}, v_i) \\&= \mathbf {F_{i+1}} (S_{i+1} {\setminus } \{v_i\}, v_i=\varphi _i(v_j)_{v_j \prec v_i}) \\&= \mathbf {F_i} (S_{i+1} {\setminus } \{v_i\} \cup \{v_j \vert v_j \prec v_i \}) \\&= \mathbf {F_i}(S_i). \end{aligned}$$

Let us introduce a notational short-hand at this point. Namely, to distinguish the partial derivative operation applied on \(\mathbf {F_{i+1}}(S_{i+1})\) from that applied on \(\mathbf {F_i}(S_i)\), we use \(\frac{\partial }{\partial v}\) to denote the operation on \(\mathbf {F_{i+1}}(S_{i+1})\) and \(\frac{\hat{\partial }}{\hat{\partial } v}\) to denote the operation on \(\mathbf {F_i}(S_i)\). Then we have:

$$\begin{aligned} \frac{\hat{\partial } \mathbf {F_i}(S_i)}{\hat{\partial } v}&= \frac{\hat{\partial } \mathbf {F_i}(S_{i+1} {\setminus } \{v_i\} \cup \{v_j \vert v_j \prec v_i \})}{\hat{\partial } v} \\&= \frac{\partial \mathbf {F_{i+1}} (S_{i+1} {\setminus } \{v_i\}, v_i=\varphi _i(v_j)_{v_j \prec v_i})}{\partial v} \\&= \frac{\partial \mathbf {F_{i+1}}(S_{i+1})}{\partial v} + \frac{\partial \varphi _i}{\partial v}\frac{\partial \mathbf {F_{i+1}}(S_{i+1})}{\partial v_i}. \end{aligned}$$

This is the same as the adjoints Eq. (6). Viewed in terms of operators, the relationship is:

$$\begin{aligned} \frac{\hat{\partial }}{\hat{\partial } v} = \frac{\partial }{\partial v} +\frac{\partial \varphi _i}{\partial v} \frac{\partial }{\partial v_i}. \end{aligned}$$

Now we wish to extend this relation to second order. To simplify notation, let \(f=\mathbf {F_{i+1}}(S_{i+1})\) and \(\hat{f}=\mathbf {F_i}(S_i)\). Then by applying the first order operator relation twice, we have:

$$\begin{aligned} \frac{\hat{\partial }^2 \hat{f}}{\hat{\partial } v_j \hat{\partial } v_k}&= \frac{\hat{\partial }}{\hat{\partial } v_j} \left[ \frac{\hat{\partial } \hat{f}}{\hat{\partial } v_k} \right] = \frac{\hat{\partial }}{\hat{\partial } v_j} \left[ \frac{\partial f}{\partial v_k}+\frac{\partial \varphi _i}{\partial v_k} \frac{\partial f}{\partial v_i} \right] \\&= \frac{\hat{\partial }}{\hat{\partial } v_j}\left[ \frac{\partial f}{\partial v_k} \right] + \frac{\hat{\partial }}{\hat{\partial } v_j} \left[ \frac{\partial \varphi _i}{\partial v_k} \frac{\partial f}{\partial v_i}\right] \\&= \frac{\hat{\partial }}{\hat{\partial } v_j} \left[ \frac{\partial f}{\partial v_k} \right] +\frac{\partial \varphi _i}{\partial v_k} \frac{\hat{\partial }}{\hat{\partial } v_j} \left[ \frac{\partial f}{\partial v_i}\right] + \frac{\partial f}{\partial v_i} \frac{\hat{\partial }}{\hat{\partial } v_j} \left[ \frac{\partial \varphi _i}{\partial v_k}\right] \\&= \left[ \frac{\partial }{\partial v_j} +\frac{\partial \varphi _i}{\partial v_j} \frac{\partial }{\partial v_i}\right] \frac{\partial f}{\partial v_k} + \frac{\partial \varphi _i}{\partial v_k} \left[ \frac{\partial }{\partial v_j} +\frac{\partial \varphi _i}{\partial v_j} \frac{\partial }{\partial v_i}\right] \left[ \frac{\partial f}{\partial v_i}\right] \\&\quad + \frac{\partial f}{\partial v_i} \left[ \frac{\partial }{\partial v_j} +\frac{\partial \varphi _i}{\partial v_j} \frac{\partial }{\partial v_i} \right] \left[ \frac{\partial \varphi _i}{\partial v_k}\right] \\&= \frac{\partial ^2 f}{\partial v_j \partial v_k} + \frac{\partial \varphi _i}{\partial v_j} \frac{\partial ^2 f}{\partial v_i \partial v_k} + \frac{\partial \varphi _i}{\partial v_k} \frac{\partial ^2 f}{\partial v_j \partial v_i} + \frac{\partial \varphi _i}{\partial v_k} \frac{\partial \varphi _i}{\partial v_j} \frac{\partial ^2 f}{\partial v_i \partial v_i} + \frac{\partial f}{\partial v_i} \frac{\partial ^2 \varphi _i}{\partial v_j \partial v_k} \\&= \frac{\partial ^2 f}{\partial v_j \partial v_k} + \frac{\partial \varphi _i}{\partial v_j} \frac{\partial ^2 f}{\partial v_i \partial v_k} + \frac{\partial \varphi _i}{\partial v_k}\frac{\partial ^2 f}{\partial v_j \partial v_i} + \frac{\partial \varphi _i}{\partial v_j} \frac{\partial \varphi _i}{\partial v_k} \frac{\partial ^2 f}{\partial v_i \partial v_i} + \frac{\partial ^2 \varphi _i}{\partial v_j \partial v_k} \frac{\partial f}{\partial v_i}. \end{aligned}$$

This is an alternate, from first-principles derivation of the Hessian Eq. (8).

1.2 Analysis of Livarhacc

We follow the assumption in Sect. 4.2 that every operator is either unary or binary. Suppose a statement is represented by k SACs, \(\varphi _{i-k+1}, \ldots , \varphi _{i-1}, \varphi _{i}\), meaning \(v_i\) is the left hand side variable of the statement. This statement defines a local computational graph that is embedded in the global computational graph. (See Fig. 7b, d for an illustration.) The embedded local computational graph has the following properties:

  1. 1.

    The local computational graph is almost a tree. The root node of the tree is \(v_i\), the “leaf” nodes are variables that appear in the right hand side of the statement, and the intermediate nodes are \(v_{i-k+1}, \ldots , v_{i-1}\). Only the root node and the “leaf” nodes are explicitly declared variables in the code. The intermediate nodes \(v_{i-k+1}, \ldots , v_{i-1}\) are implicitly generated by the complier at run-time.

  2. 2.

    The local computational graph differs from a tree in that only the “leaf” nodes in the local computational graph can have out-degree more than one (the local subgraph looks like an inverted funnel).

These two properties hold true primarily because of two basic assumptions we make. First, we assume that there is no implicit assignment in AD, since it is not safe. The evaluation order might be different from what is expected for intrinsic floating-point types. Second, for overloaded-operators, a compiler does not eliminate common sub-expressions (or at least cannot safely do it for free).

From the aforementioned two properties, the following claim follows.

Lemma 1

Using the notations and assumptions established earlier, in Livarh, let \(d_{i-j}\) denote the number of nonzeros in the row \(v_{i-j}\) when processing \(\varphi _{i-j}, 0\le j < k\). Then, \(d_i \le d_{i-j}\).

Proof

At the beginning of processing \(\varphi _i\), the other ends of the nonlinear interaction are explicit variables. For all \(\varphi _{i-j}\), \(0<j<k\), the result variable \(v_{i-j}\) is an implicit variable, as stated in Property (1). Therefore, there will be no merge of nonlinear interactions. In other words, the variable \(v_{i-j}\) gets all nonzero entries from its parent, as illustrated in Fig. 7a, b. So the statement follows immediately. \(\square \)

From the discussion in Sect. 4.2, we know that in Livarh, an upper bound on the total number of updates needed in the Hessian mapping to process the SACs \(\varphi _{i-k+1}, \ldots , \varphi _{i-1}, \varphi _{i}\) can be given by:

$$\begin{aligned} U=\sum _{j=0}^{k-1} \Big ( d_{i-j} p_{i-j} + p_{i-j}(p_{i-j}+1) \Big ). \end{aligned}$$
(13)

In Livarhacc (Fig. 6), we have the following property to bound the size of a live variable set during local accumulation:

Lemma 2

During local accumulation, the size of the local live variable set when processing \(\varphi _{i-j}\) is at most \(j+1\).

Thus, in Livarhacc, an upper bound on the total number of updates needed to preaccumulate the same SACs locally can be given by:

$$\begin{aligned} U_L=\sum _{j=0}^{k-1} \Big ( j p_{i-j} + p_{i-j}(p_{i-j}+1) \Big ). \end{aligned}$$
(14)

There is an important difference between Eqs. (14) and  (13). In Eq. (14), we take into consideration that the case in Fig. 5a) and the case in Fig. 5b) are exclusive. In contrast, in Eq. (13) [as well as in Eqs. (10) and (15)], the exclusiveness is ignored to simplify the equation.

Assume there are p “leaf” nodes in the local computational graph. Then an upper bound on the number of operations needed to globally accumulate the derivatives can be given by:

$$\begin{aligned} U_G=d_ip+p(p+1). \end{aligned}$$
(15)

Finally, under the assumption that all operators are unary or binary (which implies \(p_{i-j} \le 2, 0\le j < k\)), the following two lemmas are easy to prove.

Lemma 3

For an assignment and associated computational graph defined by \(\varphi _{i-k+1},\ldots , \varphi _{i}\), the relationship \(\sum _{j=0}^{k-1} p_{i-j} \ge k+p-1\) holds. Equality holds if and only if every variable on the left-hand-side is unique.

Proof

Assume there are q occurrences of explicitly declared variables in the assignment, where multiple occurrences of the same variable count independently. Obviously, \(q \ge p\), and equality holds if and only if every variable on the left-hand-side is unique.

During the evaluation of this assignment, each SAC \(\varphi _{i-k+1}\) generates one variable (temporary result) and consumes \(p_{i-k+1}\) variables (operands). Initially, we have q variables on the right-hand-side as operands, and finally we have one variable on the left-hand-side as assignee. So we have \(\sum _{j=0}^{k-1} p_{i-j} + 1= q + k\). That is, \(\sum _{j=0}^{k-1} p_{i-j} \ge k + p - 1\), and equality holds if and only if every variable on the left-hand-side is unique. \(\square \)

Lemma 4

For an assignment and associated computational graph defined by \(\varphi _{i-k+1},\ldots , \varphi _{i}\), the relationship \(k \ge p\) holds. Equality holds if and only if every variable on the left-hand-side is unique, and every operator is binary except for the assignment operator “=”.

Proof

Following the proof of Lemma 3, we know \(\sum _{j=0}^{k-1} p_{i-j} + 1= q + k\). Further, we have \(p_{i} = 1\) because \(\varphi _i\) is an assignment operator. Further \(p_{i-j} \le 2, 1 \le j < k\) because we assume all operators are unary or binary. Thus we have \(p+k \le q+k = \sum _{j=0}^{k-1} p_{i-j} + 1 \le 2k\). That gives us \(k \ge p\), and equality holds if and only if every variable on the left-hand-side is unique, and every operator is binary except for the assignment operator “=”. \(\square \)

Putting these results together, we prove Theorem 3 that we had stated in Sect. 4.2, and which we redisplay here (for convenience) as Theorem 4.

Theorem 4

Given a statement defined by \(\varphi _{i-k+1},\ldots , \varphi _{i}\), when \(d_i(k-1)\ge 2k(k+1)\) holds, we have \(U_L+U_G \le U\).

Proof

First we consider the left-hand-side of the inequality.

$$\begin{aligned} U_L + U_G&= \sum _{j=0}^{k-1} \left( j p_{i-j} + p_{i-j}(p_{i-j}+1) \right) + d_i p +p(p+1) \\&= d_ip+p(p+1)+\sum _{j=0}^{k-1} j p_{i-j}+\sum _{j=0}^{k-1}p_{i-j}(p_{i-j}+1) \\&\le d_ip+p(p+1)+2 \sum _{j=0}^{k-1} j+\sum _{j=0}^{k-1}p_{i-j}(p_{i-j}+1) \quad \quad (\text{ since }\ p_{i-j} < 2) \\&\le d_ip+p(p+1)+k(k+1)+\sum _{j=0}^{k-1}p_{i-j}(p_{i-j}+1)\\&\quad \quad (\text{ upper } \text{ bound } \text{ for } \text{ the } \text{ sum } \text{ on }\ k) \\&\le 2k(k+1)+d_ip+\sum _{j=0}^{k-1}p_{i-j}(p_{i-j}+1) \quad \quad (\text{ By } \text{ Lemma } \text{4 }, k \ge p). \end{aligned}$$

Now we provide a lower bound for the right-hand-side of the inequality.

$$\begin{aligned} U&= \sum _{j=0}^{k-1} \Big ( d_{i-j} p_{i-j} + p_{i-j}(p_{i-j}+1) \Big ) \\&= \sum _{j=0}^{k-1} d_{i-j} p_{i-j}+\sum _{j=0}^{k-1}p_{i-j}(p_{i-j}+1) \\&\ge d_i \sum _{j=0}^{k-1} p_{i-j}+\sum _{j=0}^{k-1}p_{i-j}(p_{i-j}+1) \quad \quad (\text{ By } \text{ Lemma } \text{1 }, d_i \le d_{i-j}) \\&\ge d_i(k+p-1)+\sum _{j=0}^{k-1}p_{i-j}(p_{i-j}+1) \quad \quad (\text{ By } \text{ Lemma } \text{3 }) \\&= d_i(k-1)+d_ip+\sum _{j=0}^{k-1}p_{i-j}(p_{i-j}+1). \end{aligned}$$

Thus when \(d_i(k-1)\ge 2k(k+1)\), we have \(U_L+U_G \le U\). \(\square \)

1.3 Listing of the synthetic test functions F1, F2, F3 and F4

We list in Table 15 the mathematical definitions of the four synthetic test functions F1, F2, F3 and F4 we used in the experiments discussed in Sect. 6. The functions are called in [20] Problem 1, Problem 5, Problem 80, and Problem 41, respectively.

Table 15 Mathematical descriptions of the synthetic test functions used in the experiments

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, M., Gebremedhin, A. & Pothen, A. Capitalizing on live variables: new algorithms for efficient Hessian computation via automatic differentiation. Math. Prog. Comp. 8, 393–433 (2016). https://doi.org/10.1007/s12532-016-0100-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12532-016-0100-3

Keywords

Mathematics Subject Classification

Navigation