Step 1: find the stops
A GPS typically reports time, latitude, and longitude. It may also report speed and heading, but these can be computed from the other three data elements.
-
Scan the sequence of GPS readings to identify stops, which we take to be any maximal sub-sequence of readings for which the speed is sufficiently small (because of inaccuracy, a GPS may report a speed greater than zero even though stopped; to account for this, we consider any speed less than 5 km/h to be, for all practical purposes, a stop).
-
Convert each such maximal sub-sequence into a single stop, at a location that is the average latitude and longitude of the constituent GPS readings. And if the GPS recorded every Δ seconds, assume the stop began Δ/2 s before the first reading and concluded at Δ/2 s after the last reading.
Step 2: find the path that best matches the sequence of stops
We require a description of the terminal, such as that shown in Fig. 1, in which each yard is circumscribed by a geofence. In addition, we require a list of possible sequences of services that might be required by a truck. Typically, this can be enumerated by listing each type of container that might be dropped off at the terminal and each type that might be picked up and the services required by each. It is also helpful, but not necessary, to know a lower bound on the time required for each service, so we can more easily recognize the corresponding stop.
To find the most plausible path, we first construct, for each candidate path, the most plausible association of the sequence of GPS stops \(i = 1, \ldots , m\) to the sequence of services \(j = 1, \ldots , n\).
Assume we have associated stops \(1, \ldots , i - 1\) with services \(1, \ldots , j - 1\), and we are now considering GPS stop i and service j. There are four possible interpretations of the relation of stop i to service j:
-
1.
GPS stop i represents arrival to a yard providing service j.
-
2.
GPS stop i represents an additional stop in a yard providing service j. (This might happen if, for example, the truck were creeping forward in a queue and so generating a sequence of stops.)
-
3.
GPS stop i should be ignored because it does not plausibly correspond to any yard providing service j.
-
4.
Service j has been visited but the GPS failed to record the stop and so it appeared to skip it.
We model the plausibility of each interpretation as a cost, where higher cost means less plausible. These costs follow some basic rules that reflect what is known about GPS accuracy and typical times required for each service:
In general, the farther a stop is from a particular yard, the less likely the stop was for a service provided by that yard, and so the higher the cost. Similarly, the less the duration of a stop resembles that expected for a particular service, the less likely the stop was for that service, and so the higher the cost.
There are many reasonable ways to model the costs, consistent with the guidance above. We chose to model the cost of associating stop i with service j as
$${\text{match}}\left( {i,j} \right) = \left( {\frac{{{\text{meters from}}\; i\;{\text{to}}\;j}}{20}} \right)^{2},$$
where we define the distance from GPS stop i to service j to be the shortest distance to the perimeter of any yard providing service j. If the GPS reading is in the interior of such a yard, the distance (and therefore the cost) is defined to be negative. This cost function assesses small penalties for small distances, but increases quickly as distances exceed 20 m, a distance that we believe represents a typical outer limit to GPS error. This severely penalizes implausible matches and strongly rewards matches when the GPS stop lies well within the yard. It is less assertive when the GPS stop is close to the perimeter.
We modeled the cost of failing to find a stop at service j, and so appearing to skip that service, as a linear function that increases with the expected duration of that service. This reflects the fact that, with GPS readings every 10 s, it would be easy to miss a service that required only 10 s or less; but if the service is expected to require 10 min, then one would expect GPS readings to show a stop comparably long. It is unlikely, and therefore of high cost, to assume the GPS entirely missed such a stop.
Finally, we modeled the cost of ignoring GPS stop i as the product of a location cost and a duration cost. The cost of ignoring the location varies inversely with distance to service j and increases directly with the duration of the stop. This assumes that a long stop close to or inside a yard providing service j is more likely to be related to that service. Conversely, a short stop far from service j is unlikely to be for that service.
We can now write the least cost (most plausible) association of GPS stops to services for a particular path (sequence of services) as a dynamic programming recursion. Let \(C(i,j)\) be the most plausible association of stops through i to services through j. Then
$$C\left( {i,j} \right) = \hbox{min} \left\{ {\begin{array}{*{20}l} {{\text{match}}\left( {i,j} \right) + C(i - 1,\;j - 1)} \hfill \\ {{\text{match}}\left( {i,\;j} \right) + C(i - 1,\;j)} \hfill \\ {{\text{skip}}\left( j \right) + C(i,\;j - 1)} \hfill \\ {{\text{ignore}}\left( i \right) + C(i - 1,\;j)} \hfill \\ \end{array} } \right.$$
and the minimum cost association of stops to services is that which corresponds to \(C\left( {m,n} \right)\).
We use this recursion to evaluate the plausibility of each of the standard paths: each path receives a score giving its total cost, and that path with minimum total cost is the best interpretation of the trip as recorded by GPS. For example, among the sequences of expected services, the stops of Fig. 1 best match the sequence 6-8-4-1-0-9-2-6, which corresponds to delivery of an empty container, followed by pick-up of an import container with customs clearance.
Step 3: remove implausible matches
Our GPS dataset included some trips by trucks dispatched for purposes other than swapping containers. Such trips did not follow one of the expected paths; indeed, they may not have visited the terminal at all. Nevertheless, some path will have been identified as the best match, even though it is a very poor match. Such trips should be identified and purged from the study of service times. Unless the trucking company distinguishes between types of trips, purging must be done programmatically. The costs of the best match provide clues to plausibility, but it is not always obvious how to interpret them. For example, a small total cost may be due to a good match or to a bad match to a path with few services.
The problem of separating the implausible best matches from the plausible ones may be viewed as a problem of statistical binary classification, one of the classic problems of machine learning. We chose to recognize implausible matches by means of a classification tree, for which we used the ID3 algorithm (Quinlan 1986). To build the classification tree, we plotted the GPS trails of hundreds of trips and compared the maps with the computed best matches. Each best match was labeled as plausible or implausible (the best match was judged implausible for about 20% of the trips). The tree was then constructed on the following statistics, the first seven of which are the predictor variables generated by the matching, and the last of which was the judgement of an expert human.
-
Number of GPS stops within the terminal.
-
Number of services required by the path.
-
Number of services unmatched.
-
Fraction of services unmatched.
-
Total cost of matching GPS stops with service.
-
Total cost of ignoring some stops as spurious.
-
Total cost of seeming to skip a service required by this path.
-
Was the best match plausible or implausible?
The resultant classification tree revealed a simple test that correctly identified more than half of the implausible best matches, with very few false positives. This test was: If the total cost of ignoring GPS stops is sufficiently large then the best match is implausible if the total cost of matches is sufficiently small. This makes sense because spurious trips may have entered and left through the main gate of the terminal, which matched some GPS stops, but the other GPS stops of the trip might not have visited any service areas and so were ignored in the best match. In other words, this test filtered out many of the trips that were for reasons other than to swap containers.
For the final study, we removed all trips for which the best match was labeled implausible. The remaining trips were those for which we had confidence that the best-matched paths correctly described the sequence of services for each trip. From these, we derived the distribution of times spent in each service, as well as times spent queueing and times spent driving.