# McVol - A program for calculating protein volumes and identifying cavities by a Monte Carlo algorithm

## Authors

- First Online:

- Received:
- Accepted:

DOI: 10.1007/s00894-009-0541-y

- Cite this article as:
- Till, M.S. & Ullmann, G.M. J Mol Model (2010) 16: 419. doi:10.1007/s00894-009-0541-y

- 44 Citations
- 347 Views

## Abstract

In this paper, we describe a Monte Carlo method for determining the volume of a molecule. A molecule is considered to consist of hard, overlapping spheres. The surface of the molecule is defined by rolling a probe sphere over the surface of the spheres. To determine the volume of the molecule, random points are placed in a three-dimensional box, which encloses the whole molecule. The volume of the molecule in relation to the volume of the box is estimated by calculating the ratio of the random points placed inside the molecule and the total number of random points that were placed. For computational efficiency, we use a grid-cell based neighbor list to determine whether a random point is placed inside the molecule or not. This method in combination with a graph-theoretical algorithm is used to detect internal cavities and surface clefts of molecules. Since cavities and clefts are potential water binding sites, we place water molecules in the cavities. The potential water positions can be used in molecular dynamics calculations as well as in other molecular calculations. We apply this method to several proteins and demonstrate the usefulness of the program. The described methods are all implemented in the program McVol, which is available free of charge from our website at http://www.bisb.uni-bayreuth.de/software.html.

### Keywords

Cavities in proteinsMolecular volumeMonte CarloWater placement inside proteins## Introduction

The identification of the surface of a protein has a long tradition in many fields of protein modeling and drug design [1–5]. The great interest in this subject is motivated by its importance for identifying ligand binding pockets and cavities in proteins. Moreover, protein crystal structures often show internal cavities that could be filled with water molecules. The identification of such water-filled cavities is important for the analysis of proton transfer networks in proteins, since these water molecules can play a role in hydrogen bond networks and therefore influence the long range proton transport within proteins [6–8]. Several methods have been developed to calculate the solvent accessible surface, molecular surface and molecular volume of a protein. Among them, algorithms based on the alpha shape theory are used in many approaches [2, 9, 10]. The alpha shape theory orders a subset of Delauny complexes with the aim of reducing the computational cost of an inclusion-exclusion formalism to calculate the protein surface and volume. An accurate computation of the molecular and solvent accessible surfaces and volumes is possible with this algorithm. However, the main drawbacks are numerical instabilities due to geometric degeneracy. The computation of the Delauny complexes are shown to be prone to such instabilities. A solution to this problem is found with the so-called “Simulation of Simplicity” [9] which is implemented for example in CASTp [2]. Other methods like LIGSITE [11], POCKET [12], or SURFNET [13] are grid based methods to define the protein surface and internal cavities or ligand binding sites. These methods are limited to the resolution of the grid they use. All these methods are basically methods for integrating the protein volume. Monte Carlo algorithms are known to be able to perform such integrations. A well-known textbook example is the integration of a circle area for the determination of the number π [14]. Such an algorithm can also be used for determining the volume of proteins.

In this paper, we describe an efficient Monte Carlo algorithm for calculating protein volumes and for identifying internal cavities. Our new algorithm is neither dependent on grid resolutions nor is the algorithm prone to geometric degeneracy at any point of the integration. Based on the identified cavities, we suggest possible positions for water molecules and place these water molecules. We apply this program to several proteins of different sizes and compare our results with experimentally identified water positions. The program is available from our website at http://www.bisb.uni-bayreuth.de/software.html.

## Methods

### Theory of the volume integration

*V*

_{box}, then the MV is given by

*n*

_{inside}is the number of points inside the MV and

*n*

_{tot}is the total number of points.

- 1.
If the point is closer to one atom than the van der Waals radius of this atom, the point is inside the van der Waals volume and therefore inside the MV, else

- 2.
If the distance of the point to any atom center is smaller than the van der Waals radius of the atom plus the probe sphere radius and the distance to the closest point of the SAS is larger than the probe sphere radius, the point belongs to a void and therefore to the MV.

- 3.
In any other case, the point belongs to the solvent.

*N*is the number of atoms,

*r*

_{i}is the radius of atom

*i*,

*n*

_{surf,i}is the number of dots on the SAS of atoms

*i*and

*n*

_{tot,i}is the number of dots placed on atom

*i*, no matter whether they are on the SAS or not.

### Implementation of the volume integration

*h*(using the standard function ceil()). Then, all distances to the atoms and surface points in the neighboring grid cells are evaluated. Suppose the random point was assigned to the grid cell with the index (

*i*,

*j*,

*k*), the distances to all atoms or surface dots assigned to the grid cells (

*i*±

*h*,

*j*±

*h*,

*k*±

*h*) are calculated. By this procedure, the number of distance calculations is reduced by orders of magnitude. It should be noted, that the grid resolution influences the speed of the program but not the accuracy of the volume calculations, since the points to calculate the volume are placed randomly in the box.

### Identification of cavities

The procedure described above allows not only to calculate protein volume but also identify internal cavities. We have two ways to identify internal cavities in our calculation. First, it is possible to identify cavities based on the dot surface and second, based on the volume integration. We describe both possibilities in the following.

First, the surface is defined based on surface points marking the accessibility to the probe sphere. The surface of an internal cavity is described in the same way as the outside surface of the protein. We applied a graph search algorithm to separate surface points defining the outside surface of the protein from surface points defining internal cavities. The undirected graph is generated by connecting surface dots which are less than a certain distance (ca. 1 to 2 Å) apart using a cell-based neighbor list. The basic idea is to divide the graph in unconnected subgraphs. Typically, the largest subgraph describes the outer surface of the protein and smaller subgraphs describe internal cavities. The graph search is implemented as a breadth first search (BFS) [18]. To save memory, both, searching and building the graph is implemented in one routine, since it is not necessary to keep the connectivity matrix in the memory. The BSF methods starts by placing all surface dots in one graph. A vector representing all surface dots shows the graph division. This vector is initialized with 0 as graph number for all elements. Starting from the first element *i* in this vector, we assign the subgraph number 1 to this element and identify all neighboring surface dots. These neighboring surface dots are considered as connected in our graph and therefore the subgraph number 1 is assigned to these points. Additionally, these points are placed on a stack. If all connections of i are evaluated, a loop is started with an empty stack as termination condition. Within this loop, the last dot placed on the stack is taken from the stack and the subgraph number 1 is assigned to all neighboring dots, which do not already have a subgraph number. These dots are also placed on the stack. In each loop iteration, one dot is taken from the stack and all neighboring dots, which are not already in a subgraph are placed on the stack. Therefore, if the stack becomes empty, no more dots are in the whole graph which are connected to subgraph 1 but are not assigned to subgraph 1. If all dots of the surface are placed in subgraph 1, the whole graph is not dividable into subgraphs. If there are dots with 0 as subgraph number remaining in the vector, one of these dots is taken as the next starting point i for subgraph number 2. This procedure is repeated until all dots are assigned to a subgraph. If more than one subgraph is found by the BFS algorithm, subgraphs not connected to the outer protein surface can be defined as internal cavities. The surface of each subgraph can be calculated using Eq. 2.

Second, we can map the random points placed during the MC integration on a grid with a given resolution. Saving the number of points on a grid reduces dramatically the memory requirements compared to saving all random points individually. In each grid cell, we count the number of random points that were placed inside an atom, inside a void, and inside the solvent. A grid cell is marked as solvent as soon as one random point mapped to this grid cell was evaluated to be in the solvent. All grid cells not marked as solvent are considered to be inside the protein. Searching for cavities is accomplished by separating solvent grid cells completely surrounded by protein grid cells from solvent grid cells which are connected to the borders of the box. This separation is achieved by a BFS algorithm as explained above. An undirected graph is build from all grid cells. Within this graph a grid cell has a connection to a neighboring cell, if both grid cells are marked as solvent. After evaluating all grid cells at least one subgraph is found, defining the solvent surrounding the whole protein. If additional subgraphs of solvent grid cells are found these subgraphs are internal cavities. The volume of the internal cavities is integrated again by a Monte-Carlo algorithm. This time with a box placed only around the cavity. The resulting volume is more exact, since more random points are placed in a smaller volume. The volume is again evaluated by Eq. 1.

### Detecting surface clefts

One problem connected to the calculation of the surface of a protein is the detection of large clefts on the surface reaching deep into the protein. A cleft is a solvent accessible pocket on the protein surface surrounded by a given ratio of protein. By default our algorithm would treat a cleft with a connection to the solvent as solvent accessible and therefore this cleft is treated as solvent and not as cavity. Several attempts to detected surface clefts were made [1, 2, 4, 5, 11–13, 19–24]. Our method for detecting internal cavities led us to an algorithm which is capable of detecting clefts on the protein surface. For testing if a solvent grid point belongs to a cleft, we place a box on each solvent grid point. The volume of this box is checked for points belonging to the protein or cleft. If more than a given percentage of grid points in the box are protein or cleft points, the solvent point is marked as cleft. Figure 3 schematically depicts the evaluation of a solvent point. This algorithm runs iteratively until no more cleft points are found. The points marked as clefts are divided into subgraphs using the BFS method describe above. The determined clefts are treaded like cavities in the program flow, except that the cleft volume is not reevaluated with a smaller box.

### Placing water oxygen atoms

One reason for searching cavities in proteins is that they may contain water molecules. We place water molecules in all cavities with a volume larger than the volume of one water molecule. Based on the volume of each cavity, the number of water molecules each cavity can hold is determined by dividing the volume of the cavity by the volume of a water molecule. The result is rounded to the nearest integer. Initially, the atoms are place randomly inside the cavity by selecting a random solvent grid node that is far enough from the protein atoms. Starting from this configuration, a Monte Carlo method is applied to optimize the water positions on the grid.

*d*(

*i*,

*j*) is the distance between water molecule i and j and

*xyz*

_{min}and

*xyz*

_{max}are the minimal and maximal coordinates of the cavity, respectively. D is maximized by the Monte Carlo algorithm. Maximizing D ensures that the placed water molecules are as far apart from each other as possible and also as far apart as possible from the cavity borders. The algorithm moves one water molecule in a random direction at the grid and checks whether D has increased or not and if a water molecule at this position does not overlap with protein atoms. If the distance sum has increased, the new water position is accepted, otherwise, the move is discarded. The algorithm terminates after a given number of steps. By applying this algorithm, we ensure that the cavity is evenly filled with water molecules. Since no energy criteria are applied during the placement of water molecules, it is recommended to minimize the positions of the water molecules afterward.

### Adding a membrane to membrane proteins

For electrostatic calculation on membrane proteins, it is often required to add dummy atoms around the protein representing the hydrophobic region of the membrane [25–27]. When such a membrane of dummy atoms is added, care must be taken, that internal cavities of the protein that are filled potentially by water molecules are not filled by dummy atoms. We implemented a procedure to add a dummy atom membrane in McVol to handle this problem.

Since the protein is placed in a box, all grid points of this box not assigned to a cavity or cleft are solvent grid points. On the basis of these grid points, McVol is capable of placing a membrane of dummy atoms around the protein. This membrane is built by defining an upper and lower border of the membrane. All solvent grid points within these borders (defined by the z-coordinates) are considered as membrane region. Grid points that are identified as cavities are not considered as membrane region in order to avoid that water filled cavities in the protein that are potentially important, for example for proton transfer, are filled with dummy atoms.

## Computational details

### Structure preparation

### Computational details

All calculations were done with 50 Monte Carlo steps per Å^{3} of the box volume and 2500 surface dots unless stated otherwise. The probe sphere radius was initially set to 1.3 Å in accordance to the water volume. The grid resolution for the initial grid was set to 1 Å, the cavity volume refinement was done with a grid resolution of 0.5 Å. Water molecules were only placed in cavities larger than 18 Å^{3}. The number of water molecules per cavity was determined by dividing the cavity volume by the volume of a water molecule and rounding the result.

## Results

### Convergence of the Monte Carlo algorithm

^{3}and the number of surface points per atom are the critical parameters for the runtime of the program. Table 1 gives a short overview of the runtime of the program in dependence of these two parameters. The runtime depends approximately linearly on the number of Monte Carlo steps with a slope of one. The dependence on the number of initial surface points is also linear but with a much smaller slope of about 0.01.

Runtime of McVol (in seconds) for different parameter settings

MC steps per Å | Runtime [s] surface points per Atom | ||||||||
---|---|---|---|---|---|---|---|---|---|

500 | 1000 | 1500 | 2000 | 2500 | 3000 | 4000 | 5000 | 10000 | |

50 | 45 | 52 | 64 | 70 | 76 | 83 | 94 | 107 | 175 |

100 | 89 | 101 | 151 | 161 | 176 | 173 | 201 | 224 | 332 |

150 | 132 | 169 | 221 | 266 | 234 | 232 | 244 | 270 | 398 |

200 | 169 | 191 | 223 | 247 | 266 | 280 | 315 | 351 | 506 |

250 | 205 | 234 | 271 | 297 | 314 | 340 | 378 | 432 | 623 |

### The relation between protein volume and number of atoms

^{3}/atom and a y-intercept of 102.9 Å

^{3}. The y-intercept shows that the volume of the voids makes a significant contribution to the protein volume.

Volume of 15 different proteins calculated by the program McVol

Protein | # atoms | Molecular volume [Å | Volume/# atoms [Å | vdW-Volume/void-Volume |
---|---|---|---|---|

Bovine pancreatic tryp. inhibitor (1bpi) [40] | 896 | 7325 | 8.175 | 3.648 |

Henn egg white Lysozyme (4lym) [34] | 1967 | 16369 | 8.322 | 3.248 |

Bacterial BLUF photoreceptor (2byc) [41] | 2262 | 17480 | 7.728 | 2.800 |

Bovine beta-lactoglobulin (1beb) [42] | 2492 | 19668 | 7.892 | 2.646 |

Ferrodoxin NADP(H) reductase (2bgi) [43] | 2716 | 31616 | 11.641 | 2.454 |

Bacteriorhodopsin (1c3w) [44] | 3560 | 27483 | 7.720 | 2.788 |

Urate Oxidase (1r4u) [45] | 4670 | 39054 | 8.363 | 3.155 |

Ammonuim transporter (2b2f) [46] | 6140 | 45487 | 7.408 | 2.86 |

Alpha amylase (1bag) [47] | 6446 | 53168 | 8.248 | 2.397 |

Cryptochrome (1np7) [48] | 7842 | 62631 | 7.987 | 2.605 |

Glucose oxidase (1cf3) [49] | 8803 | 73259 | 8.322 | 2.324 |

BM-40 FS/EC domain pair (1bmo) [50] | 9145 | 72138 | 7.888 | 2.721 |

3-hydrobenzoate hydrolase (2dkh) [30] | 9474 | 79876 | 8.431 | 3.027 |

Acetylene Hydratase (2e7z) [51] | 11528 | 95304 | 8.267 | 2.363 |

Bacterial reaction center(2j8c) [26] | 16738 | 138220 | 8.258 | 2.837 |

average | 7.94 ± 1.84 | 2.76 ± 0.4 |

### Cavities in proteins

The major goal of the above described algorithm is to find cavities in proteins. Identification of cavities in proteins is important for developing mechanistic models of the enzymatic activity, since cavities are often filled with water molecules that provide hydrogen bonds or are involved in proton transfer [31, 32]. The above described algorithm was applied to search cavities in three enzymes: Hen egg lysozyme, bacteriorhodopsin and the photosynthetic reaction center.

#### Hen egg lysozyme

NMR experiments identified three major cavities in hen egg lysozyme [33]. Each of these cavities is well defined by a set of amino acid side chains surrounding these cavities. We applied our algorithm to hen egg lysozyme (pdb-code 4lym [34]) using a probe sphere radius of 1.3 Å, 250 Monte-Carlo steps per Å^{3} of the box volume and 2562 dots per atom on the dot surface. With this probe sphere radius we were not able to detect all of the experimentally reported cavities. Therefore we reduced the probe sphere radius to 1.1 Å. Applying our algorithm with the reduced probe sphere radius, we could reproduce the cavities proposed for hen egg lysozyme. The reduced probe sphere radius may be necessary since a water molecule is not a perfect sphere and the Bondi hydrogen radius may be too large for polar hydrogens.

^{3}therefore, cavity I was approximated to 38 Å

^{3}. The “hydrated cavity” contains the water molecules 65, 70, and 75 in the pdb file 4lym. If cavity I is subtracted from the large cleft detected by our algorithm, the remaining volume of the “hydrated cavity” is 76 Å

^{3}, which perfectly fits the three water molecules (see Fig. 7).

Cavities found in the hen egg white lysozyme (4lym). The calculation was done with 250 MC steps per Å^{3} box volume and 2500 surface points per atom

Cavity | Volume [Å | SAS [Å | Water molecules |
---|---|---|---|

I | 38a | 8.8 | 2 |

II | 12 | 0.6 | 1 |

III | 22 | 4.1 | 1 |

hydrated cavity | 76 | — | 3 |

#### Bacteriorhodopsin

Cavities found in the bacteriorhodopsin (1c3w) with a probe sphere radius of 1.3 Å. The calculation was done with 250 MC steps per Å^{3} box volume and 2500 surface points per atom

Cavity | Volume [Å | SAS [Å | Water molecules |
---|---|---|---|

I | 22 | 2.2 | 1 |

II | 60 | 10.6 | 3 |

III | 13 | 0.4 | 1 |

IV | 43 | 9.0 | 2 |

#### Photosynthetic reaction center

_{B}). The location of the placed water molecules in the photosynthetic reaction center is shown in Fig. 9.

## Conclusion

In this work, we introduced a Monte Carlo algorithm for the calculation of protein volumes. Based on this algorithm, cavities inside the protein were located. The volume calculation are independent from any grid and therefore more accurate than the grid based methods developed so far.

The algorithm was applied to 15 proteins of different size. We found, that the ratio between the protein volume (including the volume of voids) and the number of atoms is almost the same for all sizes of proteins.

Our algorithm was able to reproduce experimentally derived cavities in the hen egg white lysozyme. Also the reported cavity volumes are in good agreement with our calculations. For bacteriorhodopsin, we could locate a cavity near the Schiff base maybe containing the water molecules important for the proton transfer process. An analysis of the cavities in the photosynthetic reaction center enabled us to place water molecules connecting originally separated proton transfer pathways through the protein. The Monte Carlo algorithm and the graph theoretical analysis of the protein volume, surfaces and cavities as well as the placement of water molecules is implemented in the program McVol. This program is able to calculate protein volumes, solvent accessible volumes and surfaces. McVol is available free of charge from our webpage http://www.bisb.uni-bayreuth.de/software.html.

## Acknowledgements

This work was supported by the DFG grant UL 174/7-1.