Input point cloud
The reconstruction algorithm described in this paper uses an initial point cloud estimate as a basis for the growth of plant surfaces in three dimensions. Numerous software- and hardware-based techniques exist to obtain point representations of objects. We have chosen to make use of a software-based technique, patch-based multi-view stereo (PMVS) [13]. This approach reconstructs dense point clouds from any calibrated image set, and is not restricted to plant data. However, by including robust visibility constraints, it is well suited to plant datasets that contain large amounts of occlusion. Let \(\{X_{i}\}^{n}_{i=1}\) be the set of all points in an input cloud of size n. We identify the co-ordinate system used by the point cloud, and the resulting reconstruction, as “world” co-ordinates. An individual point \(p \in X\) in world co-ordinates is represented as a 3D vector \(\varvec{w}\).
A requirement of both PMVS and our reconstruction approach is that the intrinsic and extrinsic camera parameters be known. We use the VisualSFM [14] system to perform automatic camera calibration. Any number of arbitrary camera positions may be calibrated using VisualSFM, and calibration is performed quickly. However, as it is based on SIFT features [15], the approach is not suitable for images with insufficient texture and feature information. This is particularly problematic within plant datasets, where leaves may have few suitable feature points. In our real plant datasets, the surrounding scene provides an adequate feature set for correspondence. In our artificial plant dataset, a highly textured calibration target is used, and in our virtual dataset camera parameters are extracted automatically without the need for calibration. We have found in our experiments that the calibration performed within VisualSFM is sufficiently accurate to drive PMVS, and our method. Where the intrinsic parameters of the camera are known, for example, where the model and lens are kept constant, it is possible to replace VisualSFM calibration with a more robust technique, which may improve accuracy.
We capture \(N_\mathrm{cam}\) images of the scene from \(N_\mathrm{cam}\) locations to obtain a set of images \(\{I_{i}\}^{N_\mathrm{cam}}_{i=1}\). Associated with each camera location is a perspective projection matrix, based on a standard pinhole camera model [16], derived from the calibration information output by VisualSFM. For a given world point, there is a perspective projection function, \(\mathcal {V}_{i}\), that maps onto a point in a specific camera co-ordinate frame, given by the 2D vector \(\varvec{v}\). This gives a set of functions \(\{\mathcal {V}_{j}(\varvec{w}):\mathbb {R}^{3}\rightarrow \mathbb {R}^{2}\}^{N_\mathrm{cam}}_{j=1}\), where j is the index of the input image and associated camera geometry. Once in camera co-ordinates, pixel information for a given location is represented by \(I_{j}(\varvec{v})\).
PMVS makes no assumptions about the nature of the objects being reconstructed. It is likely that additional points are contained in X that comprise background or other non-plant material. Many such points will be removed by our level set approach; however, for computational efficiency many can be removed before reconstruction begins.
The point cloud is pre-filtered to remove obvious outliers; those points that differ greatly from the expected colour of the plant, or those that appear below the expected location of the plant. Two filters are applied, first a clipping plane positioned at the base of the plant is used to remove the majority of background points on the floor, container, etc. Second, colour filtering is achieved by examining the projected pixel values for every point, and removing those that do not appear green in hue. These filters are meant only as a conservative first pass, a more sensitive colour-based metric is used within the speed function during application of the level set method. The final filtered point cloud \(X^{\prime } \subseteq X\) is used in place of X for the remainder of the reconstruction process.
Point cloud clustering
The point cloud representation produced by PMVS contains no explicit surface description. Methods for the reconstruction of a surface mesh from a point cloud exist [17, 18]. Most, however, construct a single a surface describing the entire point cloud. Plants contain complex surface geometry that encourages the separation of leaves. We also wish to approach the more general problem of plant reconstruction, without assuming the connectivity or nature of the plant leaves is known. Instead, we model plant material as a series of small planar patches. Patch size is restricted to avoid fitting surfaces between nearby leaves, and to accurately model the curved nature of each leaf surface. The filtered point cloud is first clustered into small clusters of points using a radially-bounded nearest neighbour strategy [19]. Points are grouped with their nearest neighbours, as defined by a pre-set distance, and the method is extended to limit the potential size of each cluster. More formally, from the filtered cloud we obtain a set of clusters \({C_{k}}^{N_\mathrm{clus}}_{k=1}\) in which each cluster contains at least one point and all clusters are disjoint, so \(|C_{k}| > 0, \forall k\) and \(C_{k} \cap C_{l} = \emptyset , \forall k \ne l\).
This distance used for the nearest neighbour approach is dependent on the size and resolution of the model being captured. As PMVS (and laser scanning devices) usually output points with a consistent density, the distance parameter can be set once and then remain unchanged between experiments using the same image capture technique. Reducing this number will increase the number of planar sections fitted to the data, increasing accuracy at the cost of decreased algorithmic efficiency.
Our surface fitting approach begins with an approximation of the surface that will then be refined. A least-squares orthogonal regression plane is fitted to each cluster using singular value decomposition. This best fit plane minimises the orthogonal distance to each point, providing each cluster with a centre point \(\varvec{c}\), a normal vector \(\varvec{n}\), and an orthogonal vector \(\varvec{x}\) indicating the rotation about the normal. The vector \(\varvec{x}\) is aligned along the major-principle axis of the point within the cluster. We then define a set of orthographic projection functions that project individual world points into each cluster plane, \(\{\mathcal {C}_{k}(\varvec{w}): \mathbb {R}^{3}\rightarrow \mathbb {R}^{2}\}^{N_\mathrm{clus}}_{k=1}\), where \(\mathcal {C}_{k}\) represents the projection into plane k (i.e. the plane associated with cluster \(C_{k}\)). We say that points projected onto any plane now occupy planar co-ordinates. Any such point, denoted by the 2D vector \(\varvec{p}\), can be projected back into world co-ordinates by the set of functions \(\{\mathcal {W}_{k}(\varvec{p}): \mathbb {R}^{2}\rightarrow \mathbb {R}^{3}\}^{N_\mathrm{clus}}_{k=1}\).
The orthogonal projection in \(\mathcal {C}_{k}\) has the effect of flattening the points in each cluster to lie on their best fit plane, reducing any noise in individual points, and reducing the surface fitting algorithm to a 2D problem. Point and mesh surfaces generated on a cluster plane will have an associated world position that can be output as a final 3D model. An overview of the geometric projections in use within our reconstruction approach can be seen in Fig. 1.
Surface estimation
An initial surface estimate is constructed by calculating the \(\alpha \)-shape of the set of 2D points in planar co-ordinates. An \(\alpha \)-shape is a generalisation of the convex hull for a set of points, and is closely related to the commonly used Delaunay triangulation. For the incomplete leaf surfaces that exist within the input cloud, the Delaunay triangulation and convex hull represent an over-simplification of the complex boundary topology of the clusters. For a point set S, Edelsbrunner [20] defines the concept of a generalised disk of radius \(1/\alpha \), with an edge between two points in S being included in the alpha shape if both points like on the boundary of the generalised disk, and that disk contains the entire point set. The set of \(\alpha \)-shapes, produced when varying alpha, represent a triangulation of each surface at varying levels of detail. In this work, a negative value of \(\alpha \) is used, with larger negative values removing larger edges or faces. The \(\alpha \) value can be tuned for a given data set, to preserve the shape of the boundary of each reconstructed point set.
Boundary optimisation
The \(\alpha \)-shapes computed over each cluster form an initial estimate of the location and shape of the plant surface. The challenging nature of plant datasets in multi-view reconstruction means that in many instances the initial point cloud estimate will be inaccurate or incomplete. The initial surface boundaries based on these points will require further optimisation to adequately reflect the true shape of each leaf surface. Missing leaf surfaces should be reconstructed, and overlapping shapes should be optimised to meet at a single boundary. Many methods, such as active contours [21], parameterise the boundary shape before attempting this optimisation. However, such approaches are ill suited to the complex boundary conditions produced by \(\alpha \)-shapes. For any value of \(\alpha < 0\), the surface may contain holes or disjoint sections, and as such many surfaces will change topology during any boundary optimisation process.
Tracking of such complex boundaries can be achieved using the level set method [22, 23]. The method defines a 3D function \(\varphi \) that intersects the cluster plane, with a single level set being initialised for each surface patch. \(\varphi \) is represented as a signed distance function, initialised such that negative values lie within our \(\alpha \)-shape boundary, and positive values occur outside. Thus, the boundary itself is defined as the set of all points in \(\varphi \) that intersect the cluster plane, given as:
$$\begin{aligned} \Gamma = \{(x,y) | \varphi (x,y) = 0 \}. \end{aligned}$$
(1)
A speed function determines the rate of change of \(\varphi \). It may be based on both global and local parameters, and will act to grow or shrink the boundary \(\Gamma \) as necessary to fit the underlying data. The change in \(\varphi \), based on a speed function v, is defined as
$$\begin{aligned} \frac{\partial \varphi }{\partial t} = -v \cdot |\Delta \varphi |, \end{aligned}$$
(2)
where \(\Delta \varphi \) is the gradient of the level set function at a given point, which we calculate through Godunov’s upwinding scheme. The speed function is defined as
$$\begin{aligned} v = v_\mathrm{curve}+v_\mathrm{image}+v_\mathrm{inter}, \end{aligned}$$
(3)
where \(v_\mathrm{curve}\) is a measure of the local curvature, calculated using a central finite difference approximation
$$\begin{aligned} v_\mathrm{curve}=\omega \cdot \frac{\varphi _{xx}\varphi ^{2}_{x} - 2\varphi _{y}\varphi _{x}\varphi _{xy} + \varphi _{yy}\varphi ^{2}_{x}}{(\varphi ^{2}_{x}+\varphi ^{2}_y)^{3/2}}. \end{aligned}$$
(4)
The curvature term encourages the boundary of the level set to remain smooth. The weighting \(\omega \) is required to prevent curvature from dictating the movement of the front, in cases where the boundary is already sufficiently smooth.
The image term, \(v_\mathrm{image}\), references colour information in the input images to ascertain whether the projection of the planar surface lies over regions with a high likelihood of containing leaf material. To achieve this, the function \(\varphi \) is discretized and uses the planar co-ordinate system, each planar point \(\varvec{p}\) maps to a position on \(\varphi \), and any point on \(\varphi \) will have an associated planar position. By performing consecutive projections, we are able to examine the relevant location in any image of a cluster plane position. Such a projection is given as \((\mathcal {V}_{i} \circ \mathcal {W}_{k})(\varvec{p}):\mathbb {R}^{2}\rightarrow \mathbb {R}^{2}\), where k is the cluster index, and i is the camera index. Not every image will provide a helpful view of every cluster, they may be out of the camera’s field of view, or seen at an oblique angle. One reference view is chosen from which to obtain colour information, as follows. We choose a reference image \(I_{R} \in I\) that represents a calculated “best view” of a planar surface. Selection of the reference view begins by projecting each cluster into each camera view. Only the interiors (triangular faces) of each \(\alpha \)-shape are projected using a scan-line rasterisation algorithm. Attached to each projected position is a z depth, calculated as the third component output from the function \(\mathcal {C}_{i}(\varvec{w})\) when using homogenous co-ordinates. This z depth represents the distance that the projected point lies from the camera’s image plane, and can be used to sort clusters that project onto the same location. Projections with the lowest z value are seen in front of, so occlude, those with higher z values.
The projection locations and z depths for all clusters are analysed using a series of z-buffer data structures, one z-buffer associated with each input image. We define the z-buffers as a set \(\{Z_{i}\}^{N_\mathrm{cam}}_{i=0}\), where each buffer contains pixel locations in camera co-ordinates that map directly to the corresponding image. For each image location, any cluster that can be seen in (i.e. projects onto) that point is recorded in the z-buffer. A given position \(Z_{i}(\varvec{v})\) contains a depth sorted list of all clusters that project into that camera co-ordinate, i.e. \(Z_{i}(\varvec{v}) = (C_{0}, \ldots , C_{n})\).
It is desirable to select camera views that contain as little interference between clusters as possible. For a given z-buffer j, and a given cluster i, we can calculate the following measure:
$$\begin{aligned} \mathcal {V}^\mathrm{clear}_{j}(i)=|\{\varvec{v} | i \in Z_{j}(\varvec{v}) \wedge |Z_{j}(\varvec{v})|= 1 \}|. \end{aligned}$$
(5)
The clear pixel count represents a measure of the number of pixels each cluster projects into for a given image. This value reflects both the proximity of the cluster to the camera plane, and the angle of incidence between the camera view and the cluster plane. The clear pixel counts for all projections of a given cluster i are normalised to the range [0, 1]. This measure does not include pixel positions shared by other clusters, to avoid heavily occluded views affecting the normalised value. The amount of occlusion for each cluster i, in a given z-buffer j is calculated as:
$$\begin{aligned} \mathcal {V}^\mathrm{occluded}_{j}(i) = \frac{|\{ \varvec{v} | i \in Z_{j}(\varvec{v}) \backslash \{Z_{j}(\varvec{v})_{(1)}\} \wedge |Z_{j}(\varvec{v}) |> 1 \} |}{|\{ \varvec{v} | i \in Z_{j}(\varvec{v}) \} |}, \end{aligned}$$
(6)
$$\begin{aligned} \mathcal {V}^\mathrm{occluding}_{j}(i) = \frac{|\{ \varvec{v} | i \in Z_{j}(\varvec{v}) \backslash \{Z_{j}(\varvec{v})_{(n)}\} \wedge |Z_{j}(\varvec{v}) |> 1 \} |}{|\{ \varvec{v} | i \in Z_{j}(\varvec{v}) \} |}. \end{aligned}$$
(7)
where \(Z_{i}(\varvec{v})_{(k)}\) is the \(k\mathrm{th}\) ordered element of \(Z_{j}(\varvec{v})\). \(\mathcal {V}^\mathrm{occluded}_{j}(i)\) can be read as “the percentage of cluster i that projects into z-buffer j behind at least one other cluster.” Similarly, \(\mathcal {V}^\mathrm{occluding}_{j}(i)\) can be read as “the percentage of cluster i that projects into z-buffer j in front of at least one other cluster.” Thus, a combination of normalised clear pixel count, occlusion and occluding percentages can be used to sort images in terms of view quality. A reference image, \(I_{R}\), is chosen where
$$\begin{aligned} R=\mathrm{argmax}_{j}(\mathcal {V}^\mathrm{clear}_{j}(i)(1-\mathcal {V}^\mathrm{occluded}_{j}(i))(1-\mathcal {V}^\mathrm{occluding}_{j}(i))). \end{aligned}$$
(8)
Penalising views that present occlusion with respect to each surface will help ensure that self-occlusion is prevented from affecting the reconstruction accuracy, only a single view of each surface needs to have an unobscured view for reconstruction of that patch to be successful.
When referencing pixel values using the image \(I_{R}\), we use a normalised green value to measure the likelihood of leaf material existing at that location,
$$\begin{aligned} \mathcal {N}_{j}(\varvec{v})=\frac{I_{j}(\varvec{v})_{(\mathrm{green})}}{I_{j}(\varvec{v}_{(\mathrm{red})}+I_{j}(\varvec{v}_{(\mathrm{green})}+I_{j}(\varvec{v}_{(\mathrm{blue})})}. \end{aligned}$$
(9)
We can assume that normalised green values will be higher in pixels containing leaf material, and lower in pixels containing background. Where lighting conditions remain consistent over an image set, we can also assume that distribution of normalised green values are the same over the each image in I. However, between different image sets we cannot assume that the properties of the normalised green values are known. These properties must be ascertained before \(\mathcal {N}_{i}\) can be used to contribute to the \(v_\mathrm{image}\) term in the speed function. We sample from all images those pixels that are projected into by the \(\alpha \)-shapes, and use Rosin’s unimodal thresholding approach [24] to threshold below the normalised green peak that is observed. Using this threshold, the mean and standard deviation of the peak are calculated, and used to produce an image speed function centred around the calculated threshold t, with a spread based on the standard deviation of the peak:
where t is the threshold calculated using Rosin’s method, and \(\sigma \) is the standard deviation of the \(\mathcal {N}_{j}\) peak. A width of \(2\sigma \) was chosen as a value that characterises the spread of the normalised green values.
The final component of the speed function, \(v_\mathrm{inter}\), works to reshape each surface based on the location and shape of nearby clusters. As each cluster may have different normal orientations, it is challenging to calculate their 3D intersections in terms of 2D positions in planar co-ordinates. Indeed, two nearby clusters that could be considered as overlapping, may not intersect in world co-ordinates. Instead we project each planar position into \(I_{R}\), and examine the interactions in the 2D camera co-ordinate system.
Any overlapping projections are calculated by maintaining z-buffers that update as each region reshapes. The function \(v_\mathrm{inter}\) is calculated such that each cluster in \(Z_{j}(x)\) is penalised except for the front-most cluster. Thus, for a cluster i, the function is calculated as:
$$\begin{aligned} v_\mathrm{inter}= {\left\{ \begin{array}{ll} p-v_\mathrm{image}, &{} Z_{j}(\varvec{v})_{1} \ne i \\ 0, &{} \mathrm{otherwise} \end{array}\right. }, \end{aligned}$$
(11)
where p is a small negative value such that the level set boundary \(\Gamma \) shrinks at this location. Note that the subtraction of \(v_\mathrm{image}\) results in the image component being ignored where clusters are occluded.
The complete speed function is used to update each discrete position on the level set function \(\varphi \). This process must be repeated until each cluster boundary has reshaped to adequately fit the underlying image data. The speed function will slow significantly as the boundary approaches an optimal shape. Where a level set boundary no longer moves with respect to the reference image (does not alter the number of projected pixels), we mark this cluster as complete and discontinue level set iterations. Any level sets that do not slow significantly will continue until a maximum time is elapsed, a parameter that can be set by the user. We typically use a value of 100–200 iterations as a compromise between computational efficiency and offering each level set adequate time to optimise.
Model output
Once all clusters have been iterated sufficiently, each surface triangulation must be re-computed. The level set function provides a known boundary that was not available during the original surface estimation. This can be used to drive a more accurate meshing approach that will preserve the contours of each shape. We use constrained Delaunay triangulation for this task [25]. A constrained triangulation will account for complex boundary shape when producing a mesh from a series of points; however, it will not over-simplify the boundary by fitting surfaces across concave sections, and can retain holes in the surface if required. Points are sampled from the boundary of each surface, and a constrained triangulation is fitted. This process will automatically generate additional points, where required, within the shape itself. As each point in the new triangulation exists in planar co-ordinates, they can be easily back-projected into world co-ordinates to be output in a 3D mesh format.