Time Prediction Biases

Halkjelsvik, Torleif; Jørgensen, Magne

doi:10.1007/978-3-319-74953-2_5

Torleif Halkjelsvik¹² &
Magne Jørgensen¹³

Part of the book series: Simula SpringerBriefs on Computing ((SBRIEFSC,volume 5))

7998 Accesses
1 Citations

Abstract

To a larger extent than we like to think, our judgements and decisions are affected by irrelevant factors. Fortunately, there are patterns to our irrationality. We are, in a sense, predictably irrational [1]. These patterns of irrationality are what we call judgement and decision biases. This chapter describes some of the biases relevant to understanding when and why we make systematic time prediction errors. Better knowledge about the biases and fallacies may help us become better at designing time prediction processes and avoiding situations and information that mislead us.

You have full access to this open access chapter, Download chapter PDF

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

To a larger extent than we like to think, our judgements and decisions are affected by irrelevant factors. Fortunately, there are patterns to our irrationality. We are, in a sense, predictably irrational [1]. These patterns of irrationality are what we call judgement and decision biases. This chapter describes some of the biases relevant to understanding when and why we make systematic time prediction errors. Better knowledge about the biases and fallacies may help us become better at designing time prediction processes and avoiding situations and information that mislead us.

5.1 The Team Scaling Fallacy

Let us say that you enjoy playing with Lego but not building it, so you decide to hire a team to build a Lego model for you. The team will bill you for the total amount of work, that is, the sum of the time spent on the task of all the workers on the team. If you want to minimize your cost, would you hire a two- or four-person team to complete your Lego construction work?

Usually you will benefit from hiring the smaller team, because the four-person team would spend more time coordinating the work and thus cost you more. This expected decrease in productivity with more people was reflected in the time predictions of the participants of a study of Lego-building teams [2]. Those in the four-person teams predicted, on average, that they would spend a total of 30 minutes on the task, whereas those in the two-person teams predicted that they would spend a total of 23 minutes. The two- and four-person teams both tended to predict too low a time usage, but those in the four-person teams gave the most overoptimistic time predictions. The average actual time usage of the two-person teams was 36 minutes, 55% higher than they had predicted, while the average time usage of the four-person teams was 53 minutes, 75% higher than predicted. Although the participants took coordination costs into account, as reflected in the higher time predictions of the four-person teams, they did not do so sufficiently. This finding, that people tend to neglect the true increase in coordination costs with increasing team size, has been named the team scaling fallacy. The team scaling fallacy in the Lego-building study was not limited to the people actually doing the building. When external judges, who were students from another university, were asked to predict the total time usage of the Lego-building teams, the omission of coordination costs was even more severe. These judges tended to predict that the four-person teams would, altogether, have a lower time usage than the two-person teams, resulting in time predictions that would produce, on average, 140% time overrun for the four-person teams but only 45% overrun for the two-person teams. When asked about their time predictions, it appeared that the external judges focused more on the benefits of cooperation, such as synergies, than on the costs of coordinating more people.

The team scaling fallacy also seems to arise in the time usage predictions of larger projects. Studies have found that IT projects with more people have a higher likelihood of cost overrun [3, 4]. Documenting the size of the team scaling fallacy using real-life data is somewhat problematic, since we do not know whether more workers lead to greater coordination costs and cost overrun or whether more workers are allocated to projects that are problematic in the first place. The Lego study, on the other hand, does not suffer from such problems in interpretation and reliably demonstrates the team scaling fallacy in a controlled setting.

Another example of neglecting coordination costs is the common belief that merging organizations will reduce costs and improve productivity. Much of the available research contradicts this belief. Consider a small research institute with about 35 researchers that is merged with a larger research institute of about 150 researchers. According to a study on coordination costs in research units, the predicted number of administrative staff required for the small research unit will grow from the seven originally needed for 35 researchers to 12, for their approximately 20% share of the total number of administrative staff needed for an organization of 35 + 150 = 185 researchers. This finding is based on the following evidence-based relation between administrative and academic staff [5]:

$$ Administrative\,staff = 0.07 \times Academic\,staff^{1.3} $$

The important property of the formula is its nonlinearity. A doubling in the number of academic staff does not require merely a doubling of the administrative staff but, rather, a 2.5 (=2^1.3) times increase. When the number of academic staff is 10 times higher, the administrative staff needs to be as much as 20 (=10^1.3) times larger.

This formula was developed for academic organizations and there may be organizational growth and mergers that lead to much smaller increases or even a decrease in the need for administration. It is nevertheless a useful reminder of how coordination and administration tend to increase nonlinearly with team and organization size.^{Footnote 1} Good project planners and organizers are well aware of this effect, and are able to take the disproportionally higher costs of administration in larger projects into account, but it is not uncommon to neglect or underestimate this increase. As an illustration of how coordination costs can increase more than we would intuitively expect, recall that there was as much as a 50% increase in total time usage when going from a two- to a four-person Lego construction team.

Awareness of the team scaling fallacy is also important when predicting time based on past time usage in other projects or tasks. For example, it is not a good idea to use the unadjusted productivity of one project to predict the time usage for another project when their team sizes are very different.

Take home message: A workforce that is twice as large tends to deliver less than twice as much output per time unit due to an increase in the need for coordination. When predicting time usage for projects with many people, one usually needs to include a larger proportion of work on project management and administration and assume lower productivity than in smaller projects.

5.2 Anchoring

The anchoring effect may be said to be the king of human biases. Many biases sometimes arise and sometimes do not and, when they do, they tend to be small. Studies on anchoring, on the other hand, hardly ever fail to show large effects. So, what is anchoring?

The most famous study of the anchoring effect involved a rigged wheel of fortune and asked the question ‘What percentage of the members of the United Nations (UN) are African countries?’ First, the research participants spun the wheel, which stopped at 10 or 65, depending on how the wheel was rigged, and were asked whether they thought the percentage of African countries in the UN was more than or less than the number on the wheel. Following that question, the participants were asked to predict the proportion of African countries in the UN. The difference in answers between two groups of participants, one with the wheel stopping at 10 and another at 65, was striking: those in the first group gave a median prediction of 25% African countries in the UN, while those in the second group gave a median prediction of 45% [6]. It is hard to imagine that the participants would think that a number on a wheel of fortune, that they believed gave a random number between zero and 100, revealed any information about the actual proportion of African countries in the UN. They were, nevertheless, strongly affected by the number presented to them.

Many anchoring studies in time prediction contexts follow the same procedure [7]. Study participants are first introduced to a task description and asked whether they think the task will take more or less time than a given time usage, which plays the role of an anchoring number. Typically, one group of participants is presented a high time usage anchor and another a low time usage anchor. Subsequently, all the participants are asked to predict the time required to complete the task. This procedure always^{Footnote 2} produces time predictions that are biased towards the anchoring number. Even anchoring numbers that are completely unrelated to the time prediction task, such as digits from Social Security numbers or phone numbers, may strongly affect people’s predictions. More relevant numbers, such as the past time usage of a task, is usually—but not always [8]—more potent as anchors than completely irrelevant ones [9].

The relevance of the anchoring bias outside artificial experimental settings is well documented. We found, for example, that software professionals’ time predictions were strongly affected by knowledge about what a customer had communicated as an expectation of time usage, despite knowing that the customer had no competence in predicting the time usage [10]. When the software professionals were asked whether they thought they had been affected by the customer’s expectations, that is, the anchoring information, they either denied it or responded that they were affected only a little. This feeling of not being much affected when, in reality, one is being affected a great deal, is part of what makes the anchoring bias so potent and hard to avoid.

What if the customer’s expectation represents a totally implausible time usage anchor? In an experiment with software professionals [11], we informed one group of participants that the customer expected a task to take about 800 hours (a very high anchor), another group that the expected time usage was 40 hours (a rather low anchor), and a third group that the customer expected the task to take only four hours (an implausibly low anchor). All participants were instructed to dismiss the anchoring information when predicting. Those in a control group, who received no information about the customer’s expectations, gave a median time prediction of 160 hours. Those in the high anchor group (800 hours) gave the highest time predictions, with a median of 300 hours. The rather low anchor group (40 hours) gave a median time prediction of 100 hours. The most striking finding was, however, that those with the implausibly low anchor (four hours) gave even lower time predictions, with a median of 60 hours. This group was even more affected than the group given the somewhat more realistic low anchor. Anchoring studies in other contexts show similar results. Even extreme anchors or suggestions, for instance, that the length of a whale is 900 metres (an unreasonably high anchor) or 0.2 m (an unreasonably low anchor), are at least as effective in influencing people’s predictions as more realistic anchors are [12]. Thus, the effect of anchors does not always depend on their realism or on a belief that they reveal relevant information.

Anchoring effects are fairly robust to all kinds of warnings and there is so far no efficient strategy to remove the effect. The following are instructions from two different studies on anchoring:

The client does not want you to be affected by his cost expectation in your estimation work, and wants you to estimate the effort that you most likely will need to develop a quality system that satisfies the needs described in the requirement specification [11].
I admit I have no experience with software projects, but I guess this will take about two months to finish. I may be wrong, of course; we’ll wait for your calculations for a better estimate [13].

Although the above warnings cast serious doubt on the relevance of the initial time predictions or expectations of the customers, that is, the time prediction anchors, they did not even come close to removing the influence of anchors.

One does not need numbers to produce anchor-like effects. In one study, the exact same software development task was described as either developing new functionality, a description usually applied for larger pieces of work, or as a minor extension, a description usually applied for smaller, simpler tasks [11, 14]. Those who received the task described as a minor extension gave much lower time predictions than those predicting the time to develop new functionality.

Evidence on the importance of the anchoring effect includes findings from randomized controlled field experiments. In one such experiment, actual software development companies were paid for giving second-opinion time predictions based on project information [15]. Half of the companies received the project information with different variants of anchoring information included. The anchoring information seemed to have a bit weaker effect than what is typically reported in laboratory-based studies. The strongest effect was found for a low anchor, in the form of a short expected completion time (‘the work should be completed in three weeks from the startup date’). In reality, a short development period would lead to more—not less—time usage, because a larger team of people would be required to complete the project on time and more people means higher coordination costs. The companies, on the other hand, gave lower time predictions when the development period was short.

A real-life case of the sometimes devastating effect of anchoring involved a Norwegian public agency that invited software companies a few years ago to bid for a software project. As part of the announcement, the company stated that its initial budget was €25 million. The initial budget was based on a so-called informal dialog with the market. As expected from what we know about the anchoring effect, the time predictions received from the bidders typically represented costs close to €25 million. The actual cost, however, turned out to be around €80 million and the project ran into huge problems, partly due to the vast underestimation of cost and time, and was eventually cancelled.^{Footnote 3}

The anchoring effect is sometimes used to our disadvantage, such as setting a minimum payment requirement on a credit card bill. A study found that, if this minimum amount were removed from the bill, the repayment increased by 70% [17]. A low minimum payment, representing a low anchor, on a credit card bill makes you pay off less debt, which, in the long run, produces higher costs for you and higher profits for those who presented the low anchor value.

There is no single explanation why anchors affect people’s time predictions. One explanation is that an anchor triggers associations [12], for instance, a low anchor makes you think about tasks and solutions that are easy and quick to carry out. Another explanation is that people start out at the anchor value and adjust until they arrive at what they think is a reasonable time prediction. Since the range of reasonable time predictions can be large, the first value that seems reasonable after adjusting from the anchor will be too close to the anchor. In other words, people adjust insufficiently [18]. A third explanation is based on conversational norms. If you ask whether my project will require more or less than 30 work hours, I will assume that you believe 30 hours is a plausible prediction or you would not ask such a question. However, as explained above, even anchors based on random numbers seem to have an impact on judgement. A fourth explanation is that the anchor distorts the perception of the response scale [19]; that is, when larger quantities are anchors, such as 300 work hours, two hours does not appear to be much work but, when exposed to shorter durations, such as 15 minutes, two hours seems like a large amount of time. Which explanation is better seems to depend on the context. It is also reasonable to assume that anchoring can be caused by more than one phenomenon [20].

Take home message 1: Anchoring effects in time prediction contexts are typically strong. The only safe method for avoiding anchoring effects is to avoid being exposed to information that can act as a time prediction anchor.

Take home message 2: Anchors come in many shapes and disguises, such as budgets, time usage expectations, words associated with complex or simple tasks, early deadlines, and even completely irrelevant numbers brought to your attention before predicting time usage.

5.3 Sequence Effects

The sequence effect, as several other biases presented later, may be a close cousin of the anchoring effect. When sequence effects occur, the anchor is disguised as a preceding time prediction. If you first predict that it will take 10 minutes to do the dishes, your time prediction for cleaning the entire house may be two hours. If, on the other hand, you first predict that it will take two days to paint the house, your house cleaning time prediction might be three hours. Although we have not specifically tested the above example, studies suggest that such effects on your house cleaning predictions are likely [21, 22].

We evaluated the sequence effect in the context of software development, with software professionals divided into two groups. One group first predicted the time usage of a large task and then a medium large task. The second group first predicted the time usage of a small and then the same medium large task as the first group. The first group predicted a median time usage of 195 hours for the middle-sized task, whereas the second group gave a median estimate of 95 hours for the same task. In other words, their predictions of the medium task were biased towards their initial prediction of a different task [23].

Sequence effects are quite general and appear in most, perhaps all domains. For instance, when research participants predicted the price of 100 chairs from an Ikea catalogue, the predicted prices depended on not only the chair’s actual price but also the predicted price given for the preceding chair [24].

Take home message: Your previous time prediction will typically influence your next time prediction. Predictions are biased towards previous predictions, meaning that predicting the time of a medium task after a small task tends to make the time prediction too low and predicting the time of a medium task after a large task tends to make the time prediction too high.

5.4 Format Effects

Time predictions typically answer questions such as ‘How long will it take?’, ‘How many work hours will this require?’, and ‘How much time do you need?’ The responses to these questions involve judging how much time one will need for a given amount of work. However, we could turn the question around and ask how much work one can do within a given amount of time. Examples of this alternative request format are ‘How much of the work are you able to complete within five work days?’, ‘How many units can you complete before lunch?’, and ‘Do I have time to eat breakfast before the meeting starts at 9:00?’ That is, instead of giving an amount of work and requesting a prediction for the time usage, one can instead give an amount of time and request a prediction for the amount of work to be completed within the given time frame. The basic finding from studies on such inverted time prediction formats is that the more work you have lined up and the less time you have at your disposal, the more overoptimistic your time predictions will be.

One of the first studies of the inverted time prediction format varied the number of errands one could complete (six vs. 12 potential errands) and how much time one had at one’s disposal (two hours vs. four hours) [25]. Those who had 12 potential errands believed they could complete more errands within the given time than those who had six potential errands. Furthermore, those who were to predict work to be done within two hours believed they would complete more errands per hour than those with four hours available. Consequently, the participants with the 12 errands and only two hours of time available were the most overoptimistic. They predicted that they would be able to run errands within two hours that, in reality, would take about five hours to complete (150% overrun).

We found a similar effect among students predicting the time to read or walk a certain distance. The students gave more optimistic predictions on how far they could walk and how many pages they could read when given a short time frame. When given five minutes to read from a book, the participants predicted that they would read four pages within this time frame (=0.8 pages per minute) but, if given 30 minutes, they predicted that they would read only 10 pages (=0.3 pages per minute) [26]. In other words, reducing the time frame from 30 minutes to four minutes almost tripled the predicted productivity, but hardly the real productivity.

The same format effect arose for IT professionals predicting the time usage to complete software development work. A group of IT professionals predicted how much of a project they would be able to complete in either 20 or 100 work hours. Those given 20 work hours believed they could complete tasks corresponding to about 20% of the project, while those given 100 work hours believed they could complete 50%. This means that the participants with the 20-hour time frame thought they would be twice as productive as those with the 100-hour time frame (1% vs. 0.5% of the total project work per hour).

The format effect may be one of the more important effects to worry about when predicting time or requesting time predictions. Do not ask how much your colleague can complete in 15 minutes or other short periods; instead, it is usually better to ask how much time is needed for a given amount of work.

There are contexts in which the inverted time prediction format seems to be useful. In so-called agile software development, the team considers how many requirements (called user stories) they have been able to complete in the previous weeks and uses this information to predict the amount of work to be completed next week. This approach seems to lead to fairly accurate predictions of next week’s work output. Consequently, it could be that the inverted format is mainly problematic when we lack or ignore historical data on productivity.

Take home message: When the time frame is short and a large amount of work must be done, the inverted request format, ‘How much do you think you can do in X hours?’, tends to lead to more overoptimistic time predictions.

5.5 The Magnitude Effect

The magnitude effect is the observation that the time usage of larger tasks tends to be underestimated by a greater amount, in both percentage and absolute underestimation, than the time usage of smaller tasks, which may even tend to be overestimated. The effect is easily observed when, for example, comparing the time and cost overrun of multimillion-euro projects with those of smaller projects [4].

Although the larger time and cost overruns for large projects are frequently reported in the media and the association between task size and overrun is extensively documented in research, there are good reasons to believe that the effect is exaggerated and sometimes does not even exist. One reason for an artificially strong association between project size and time overrun is that actual time usage is used both as a measure of task size and as part of the time prediction accuracy measure (coupling of variables). Why this would create an artificial or exaggerated association between project/task size and overrun is a bit difficult to explain, but let us try in an example. Consider the following two situations.

Situation 1: The task size is measured by the actual time usage

Assume that several workers with about the same experience are asked to execute the same task, independently of each other. A reasonable time usage prediction would be the same number of hours for all of them. Let us say that we predict the task will take 100 hours for each worker. Even though their experience levels are very similar and a reasonable predicted time usage for each of them is the same, we cannot expect that their actual time usages will be the same. Some will have bad luck, perhaps get distracted to a larger extent than the others, and spend more than 100 hours, while others may be more fortunate and spend less than 100 hours. Since we predicted 100 hours for all of the workers, we underestimated the time for the workers with bad luck and overestimated the time for the lucky workers. If we use actual time usage as our measure of task size, we see that the ‘large tasks’, defined here as those with an actual time usage greater than 100 hours, were underestimated and the ‘small’ tasks, defined here as those with an actual time usage under 100 hours, were overestimated. In other words, the use of actual time usage as our task size measure has created an artificial magnitude effect where increased task size, measured as an increased actual effort, is associated with an increased time overrun. On the other hand, we do know that there is no real connection between time overrun and the true task size, since the task is exactly the same for all workers. The connection between task size and overoptimistic predictions is just a result of random variation in the actual time usage, the degree of luck and bad luck, and the fact that we used actual time usage as our measure of task size.

Situation 2: The task size is measured by the predicted time usage

People’s time predictions have random components. The randomness of people’s judgements may be caused by predictions made earlier that day, by what comes to mind at the moment they make the prediction, individual differences, and so on. This randomness in time prediction, similarly to the randomness in actual time usage, can create an artificial association between task size and time overrun. Assume that several people predict the time they need to complete a task. The task is the same for all of them and requires 100 hours, independent of the person completing it; for example, the task may be to watch eight seasons of the Andy Griffith Show. In this case, people who predict more than 100 hours will overestimate the time usage and those who predict less than 100 hours will underestimate the time usage. If we use the predicted time as our measure of task size instead of the actual time as in the previous example, we have a situation in which the ‘larger’ tasks—which are not truly larger but just have higher time predictions—are overestimated and ‘smaller’ tasks—those with lower time predictions—are underestimated. As in Situation 1, there is no actual relation between the true task size and the degree of over- or underestimation. The observed association is simply a result of random variation in time predictions and the fact that we used predicted time usage as our measure of task size.

The above two situations illustrate that we should expect greater time overrun for larger tasks when the task size is measured as the actual time usage (or cost) and greater time underrun for larger tasks when the task size is measured as the predicted time usage (or budgeted cost). This was also the case in a comparison of the results from 13 different studies on the magnitude effect [27]; all seven studies that had used actual time or actual cost usage as the measure of task size found greater underestimation of larger tasks. This finding is in accordance with the common claim that overrun increases with increasing task size. In contrast, the studies that used predicted time usage or budgeted cost as their measure of task size found little or no underestimation of larger tasks.

So what is the true story about the relation between task size and time overruns? One way of reducing the methodological challenges of studying this relation is through controlled experiments. In controlled experiments, the task size may be set by the experimenter and there is no need to use predicted or actual time as a size measure.^{Footnote 4} One controlled experiment on this topic used the number of sheets of paper in a counting task as the measure of task size [28]. Participants received stacks of paper and predicted how long it would take to count the sheets of paper. An analysis of the time predictions showed that people were more optimistic for larger stacks of paper (larger tasks) than for smaller stacks (smaller tasks).^{Footnote 5} Other controlled experiments with the task size set by the experimenter have shown similar results: larger tasks were more likely to be underestimated than smaller tasks were [26]. In fact, an entire literature shows that people typically underestimate large quantities of any kind (duration, size, luminance, etc.) more than smaller quantities [30].

Consequently, the true magnitude effect, supported by the findings of controlled experiments, is that we should expect greater overestimation, or at least less underestimation, with smaller tasks and greater underestimation, or at least less overestimation, with larger tasks. A natural question is then what constitutes small and large tasks? Not surprisingly, what is perceived as small and large and, consequently, the magnitude effect depends on the context.

An experiment on time perception may serve as a good example of how the context defines whether a task is small or large [31]. In this experiment, people watched a circle on a computer screen for varying amounts of time (between 494 milliseconds and 847 milliseconds) and were asked to reproduce this interval by pressing the spacebar on a computer keyboard. The data showed that the longer intervals were underestimated, whereas the shorter intervals were overestimated. The intervals in the middle of the distribution were rather accurately estimated. The next week, the participants repeated the procedure but, now, with a change in the range of the intervals (between 847 and 1200 milliseconds). In the first session, the 847-milliseconds interval was the longest and most underestimated, but in the second session it was the shortest and was consequently overestimated. This rather elaborate experiment (each participant was required to produce about 4500 judgements) demonstrates that larger stimuli are underestimated by greater amounts than smaller stimuli are and that the context establishes what is considered large or small. By the way, this result was also thoroughly documented in a study published in 1910, more than 100 years ago [32].

The experiment described above shows how judgements are biased towards the middle of the distribution of a set of durations. Usually, in the context of time predictions, we do not know this distribution and we do not know what kind of information people take into account in their mental representations of typical or middle time usage. It is, for example, possible that the time it usually takes to drive to work influences time usage predictions in other situations, such as predictions of time spent walking to the nearest grocery store from home. The research on the influence of prior time usage experience, distributions, and time usage categories on time predictions is very limited.

Time prediction biases, by definition, describe systematic deviation from reality. Biases should consequently be avoided. When it comes to the magnitude bias, however, it is not obvious that we can or should try to avoid it, especially when the prediction uncertainty is high. Adjusting judgements towards the centre of the distribution of similar tasks will inevitably produce a magnitude bias, where larger tasks are underestimated and smaller tasks are overestimated. In the long run, however, this tendency or strategy actually provides more accurate time predictions. Time predictions are inherently uncertain and the best strategy in the face of this uncertainty is often to be conservative and rely on the middle time usage of similar tasks. The more uncertain you are, the more you should adjust your time predictions towards the middle time usage of previously completed tasks. When high average time prediction accuracy is the goal, there may be no need to correct for the magnitude bias.

What about large projects with major time and cost overruns? Are those overruns the products of magnitude bias? The experimental research on magnitude effects concerns small or extremely small tasks and we do not know how much or whether a magnitude effect plays a role in the time overruns of larger projects. The magnitude effect does, however, seem to be at work when predicting the time usage of software development tasks that are parts of larger projects. Dividing software development tasks into smaller subtasks, for example, has been observed to increase the prediction of the total time usage [33].

Take home message 1: Larger projects have been frequently reported to suffer from greater underestimation than smaller projects (magnitude bias) but, with observational data (as opposed to controlled experiments), this association between task size and prediction bias may be due to statistical artefacts.

Take home message 2: Controlled experiments, avoiding the statistical problems of observational studies, suggest that a magnitude bias actually exists, at least for predictions of relatively small tasks.

Take home message 3: Predicting that the time usage is close to the average of similar tasks will result in a magnitude bias but may also increase the accuracy of the predictions in situations with high uncertainty.

5.6 Length of Task Description

A group of software developers was asked to predict the time they would need to develop a software application. The software application was described as follows:

The application should take one picture every time ‘ENTER’ (the Return key) is pressed. New pictures are taken until the person is satisfied and selects one of them. During the picture taking and selection process, the last 20 pictures should be displayed on the screen. The selected picture should be stored on the hard disc as a.jpg file with proper naming. The application should run on a Microsoft Windows XP platform and work with an Apple iSight web cam that features auto focus. This camera comes with a Java interface, and is connected to a Dell Latitude D800 laptop. The laptop is connected to the local area network available at the premises (10 Mbit/s).

Another group of software developers received the following, longer task description:

The application should take one picture every time ‘ENTER’ (the Return key) is pressed. New pictures are taken until the person is satisfied and selects one of them. During the picture taking and selection process, the last 20 pictures should be displayed on the screen. The selected picture should be stored on the hard disc as a.jpg file with proper naming. The application should run on a Microsoft Windows XP platform and work with an Apple iSight web cam that features auto focus. This camera comes with a Java interface, and is connected to a Dell Latitude D800 laptop. The laptop is connected to the local area network available at the premises (10 Mbit/s).

The e-dating company sugar-date.com specializes in matching e-daters (people looking for a friend/partner/etc.) based on an extensive personal profile with 70 dimensions. The profile is based on questions that are carefully formulated and selected to establish and enable the matching of the preferences of both young and old. The matching process is performed by a sophisticated algorithm that has been developed by leading researchers in psychology. The matching process results in, for each of the relations to other members of a database of people, a score between 0 and 100. This unique system has received worldwide attention. In fact, many of the features in their matching processes have led other major e-dating companies to change how they do their matching of e-daters. The e-dating system on sugar-date.com is also used for e-dater parties—these are large dating party events, held at up-class restaurants and clubs. At the premises, PCs, digital cameras and printers provide each e-dater with a card showing the photo of the 18 other e-daters present who are their best e-dating matches (highest scores). As members arrive at the party, they are guided to one of many locations inside the premises where they can have their photo taken. The photo is attached to their profile and printed on the cards of those who have them as one of their 18 best matches. Many of the members are concerned that they look good on the photo (naturally), so several shots are often necessary. At present, the photographing process is quite slow, due to the many manual steps involved in taking, picking and storing the photos. The managers of sugar-date.com are as always eager to improve their business processes and are not satisfied with the current photo capturing.

If you read the two task specifications carefully, you will, probably even without any software development competence, see that they have the first part in common and that the text added for the second group does not add any information useful for developing the software. Rationally speaking, the task is the same and the time predictions should be the same for the two groups. The experiment, however, found that the longer text led to substantially higher time predictions. The median time prediction was 66 work hours for the short version and 90 work hours for the longer version [14]. The software developers seemed to have used the length of the description and not just the actual work requirement as an indicator of the time required for the task. Similar effects were found among students, who predicted that they would need much more time (40% more, on average) to read a text printed on 40 double-spaced, single-sided pages than the same text printed on seven single-spaced, double-sided pages [34].

Based on the above two studies, one may gain the impression that it is easy to manipulate time predictions by increasing or decreasing the length of the task description. However, this was not the result from a field experiment with software companies [15]. In this experiment, one group of software companies received a specification on a few pages and another group received the same specification on many more pages. The median time predictions of the two groups were about the same. The effects of increased task description lengths on time predictions are remarkable when they occur, but they may not be very large for important time predictions in real-world contexts made by people experienced in the task.

Take home message: Longer task descriptions tend to increase the time predictions, but the effect may be weak for important real-world time predictions by people with relevant experience.

5.7 The Time Unit Effect

Do you feel that 365 days is longer than one year? Most people seem to feel that way. The likelihood of starting a diet is, for example, higher when the diet program is framed as a one-year plan rather than a 365-day plan [35]. If you find this effect amusing, an even more remarkable and frightening result was reported in a study on judges [36]. Active trial judges were given hypothetical cases and asked to decide what would be the appropriate length of the prison sentences for the offenders. One group of judges was asked to give sentences in months and the other group was asked to give sentences in years. The average length of the sentences, when given in years, was 9.7 years, whereas the average length of the sentences for the same crimes, when given in months, corresponded to 5.5 years (66 months). So, if you happen to commit a crime, you should really hope that the judge gives your sentence in months, or, perhaps even better, in weeks or days, rather than in years.

If we feel that 365 days or 12 months is longer than one year, we should also think that it is possible to complete more work in the same time frame when the prediction uses a time unit of fine granularity, that is, a unit that leads to high numbers. For instance, we should feel that we are able to complete more work in 40 work hours than in five work days of eight work hours each. Being affected not just by the actual magnitude of the work, time, or other quantities but also by the nominal values used to describe the magnitude is called the numerosity effect [37].

The numerosity effect is not the only reason we should expect higher time predictions when using, for instance, work weeks rather than work hours. The unit itself may indicate what the person requesting the prediction thinks about the time needed. A person would hardly ask how many months a project will take unless the work is considered substantial. Consequently, the granularity of time units may work as a sort of time prediction anchor. Asking for time usage predictions in person-months makes people think of the task as large, whereas asking for predictions in work hours makes people think the task is smaller. The influence of the unit itself is called the unitosity effect [38].

We would expect that both the numerosity and unitosity effects lead to lower time predictions with finer granularity time units, as in predicting time usage in work hours instead of work days or person-months. Is this really the case? Can we affect people’s time predictions simply by requesting them in a different time unit?

To test this, we invited 74 software professionals, all experienced in predicting time usage, to participate in an experiment [39]. Half of them predicted the software development time usage in work hours and the other half in workdays. The latter group also indicated how many work hours they usually included in one workday to enable a conversion from workdays to work hours. Two tasks were predicted: For the first, smaller task, those predicting time usage in workdays predicted almost twice the number of work hours as those predicting in work hours (88 vs. 45 work hours). For the second task, the relative difference was smaller (335 work hours when predicted in workdays vs. 224 work hours when predicted in work hours) but still substantial and in the expected direction.

The effect of the time unit seems to be less important when predicting the work to be completed in a given amount of time, as opposed to predicting the time to complete a task. In an unpublished study, we asked students how many pages of their psychology book they could read in either half an hour or 30 minutes. The mean values of the predictions were practically identical (about nine pages). We have also conducted an unpublished study on software development tasks that showed no effect of the units when people were asked how much they thought they would accomplish within a given time frame.

Take home message 1: The selection and use of units in time predictions matters. Coarser-granularity units tend to lead to higher time predictions. In a context where overoptimistic time predictions are typical, it is important to avoid predicting time in finer granularity time units, such as work hours for tasks that require several person-months.

Take home message 2: The choice of time units when predicting the amount of work to be completed in a given time seems to have little or no effect, such as in predicting the amount of work that can be completed in two hours versus 120 minutes.

Notes

1.
The effect of decreasing productivity with increasing team size has been repeatedly documented. One of the first to do so was the French agricultural engineer Maximilien Ringelmann (1861–1931). Ringelmann also found an increased risk of ‘social loafing’—lower motivation to contribute—with increased work unit sizes. Frederick Taylor (the father of scientific management) had previously identified similar challenges with group productivity in his work on process improvement in the steel industry.
2.
Always is a strong claim, but after conducting numerous such experiments, most of them for educational purposes, we feel fairly confident that this is true. It is especially easy to create large anchoring effects in situations in which there is a substantial element of uncertainty involved in the predictions.
3.
See [16]. The project was restarted and it ran into new time prediction problems but, in the end, it was able to properly predict the time usage, plan the project, and deliver a good software solution to its users.
4.
The experimental manipulation means that we neutralize the effect of the random variation in the measure of task size; that is, task size is not a random variable but, instead, a variable fixed by the experimenter.
5.
A reanalysis presented in [29].

References

Ariely D (2008) Predictably irrational. HarperCollins, New York
Google Scholar
Staats BR, Milkman KL, Fox CR (2012) The team scaling fallacy: underestimating the declining efficiency of larger teams. Organ Behav Hum Decis Process 118(2):132–142
Article Google Scholar
Hill J, Thomas LC, Allen DE (2000) Experts’ estimates of task durations in software development projects. Int J Project Manage 18:13–21
Article Google Scholar
Sauer C, Gemino A, Reich BH (2007) The impact of size and volatility on IT project performance. Commun ACM 50(11):79–84
Article Google Scholar
Jamtveit B, Jettestuen E, Mathiesen J (2009) Scaling properties of European research units. PNAS 106(32):13160–13163
Article Google Scholar
Tversky A, Kahneman D (1974) Judgment under uncertainty: heuristics and biases. Science 185(4157):1124–1131
Article Google Scholar
König CJ (2005) Anchors distort estimates of expected duration. Psychol Rep 96:253–256
Article Google Scholar
Løhre E, Jørgensen M (2016) Numerical anchors and their strong effects on software development effort estimates. J Syst Softw 116:49–56
Article Google Scholar
Thomas KE, Handley SJ (2008) Anchoring in time estimation. Acta Physiol (Oxf) 127(1):24–29
Google Scholar
Jørgensen M, Sjøberg DI (2004) The impact of customer expectation on software development effort estimates. Int J Project Manage 22(4):317–325
Article Google Scholar
Jørgensen M, Grimstad S (2008) Avoiding irrelevant and misleading information when estimating development effort. IEEE Softw 25(3):78–83
Article Google Scholar
Strack F, Mussweiler T (1997) Explaining the enigmatic anchoring effect: mechanisms of selective accessibility. J Pers Soc Psychol 73(3):437
Article Google Scholar
Aranda J, Easterbrook S (2005) Anchoring and adjustment in software estimation. Software Engineering Notes 30:346–355
Article Google Scholar
Jørgensen M, Grimstad S (2012) Software development estimation biases: the role of interdependence. IEEE Trans Software Eng 38(3):677–693
Article Google Scholar
Jørgensen M, Grimstad S (2011) The impact of irrelevant and misleading information on software development effort estimates: a randomized controlled field experiment. IEEE Trans Software Eng 37(5):695–707
Article Google Scholar
Lånekassen (2015) Sluttevaluering av LØFT programmet. www.lanekassen.no/Global/Omorganisasjonen/SluttevalueringL%C3%98FT%20Gartner.pdf. Accessed June 2016
Stewart N (2009) The cost of anchoring on credit-card minimum repayments. Psychol Sci 20(1):39–41
Article Google Scholar
Epley N, Gilovich T (2006) The anchoring-and-adjustment heuristic: why the adjustments are insufficient. Psychol Sci 17(4):311–318
Article Google Scholar
Frederick SW, Mochon D (2012) A scale distortion theory of anchoring. J Exp Psychol Gen 141(1):124–133
Article Google Scholar
Epley N (2004) A tale of tuned decks? Anchoring as accessibility and anchoring as adjustment. In: Koehler D, Harvey N (eds) The Blackwell handbook of judgment and decision making Blackwell. Malden, MA, pp 240–256
Chapter Google Scholar
Thomas KE, Handley SJ, Newstead SE (2007) The role of prior task experience in temporal misestimation. Q J Exp Psychol 60(2):230–240
Article Google Scholar
Thomas KE, Newstead SE, Handley SJ (2003) Exploring the time prediction process: the effects of task experience and complexity on prediction accuracy. Appl Cogn Psychol 17:655–673
Article Google Scholar
Grimstad S, Jørgensen M (2009) Preliminary study of sequence effects in judgment-based software development work-effort estimation. IET Softw 3:435–441
Article Google Scholar
Matthews WJ, Stewart N (2009) Psychophysics and the judgment of price: judging complex objects on a non-physical dimension elicits sequential effects like those in perceptual tasks. Judgment Decis Making 4(1):64
Google Scholar
Hayes-Roth BB (1980) Estimation of time requirements during planning: interactions between motivation and cognition. Rand Corp, Santa Monica, CA
Google Scholar
Halkjelsvik T, Jørgensen M, Teigen KH (2011) To read two pages, I need 5 minutes, but give me 5 minutes and I will read four: how to change productivity estimates by inverting the question. Appl Cogn Psychol 25(2):314–323
Article Google Scholar
Jørgensen M, Halkjelsvik T, Kitchenham B (2012) How does project size affect cost estimation error? Statistical artifacts and methodological challenges. Int J Project Manage 30(7):839–849
Article Google Scholar
Roy MM, Christenfeld NJ (2008) Effect of task length on remembered and predicted duration. Psychon Bull Rev 15(1):202–207
Article Google Scholar
Halkjelsvik T, Jørgensen M (2012) From origami to software development: a review of studies on judgment-based predictions of performance time. Psychol Bull 138(2):238–271
Article Google Scholar
Petzschner FH, Glasauer S, Stephan KE (2015) A Bayesian perspective on magnitude estimation. Trends in Cogn Sci 19(5):285–293
Article Google Scholar
Jazayeri M, Shadlen MN (2010) Temporal context calibrates interval timing. Nat Neurosci 13(8):1020–1026
Article Google Scholar
Hollingworth HL (1910) The central tendency of judgment. J Philos Psychol Sci Methods 7(17):461–469
Google Scholar
Connolly T, Dean D (1997) Decomposed versus holistic estimates of effort required for software writing tasks. Manage Sci 43(7):1029–1045
Article Google Scholar
Josephs RA, Hahn ED (1995) Bias and accuracy in estimates of task duration. Organ Behav Hum Decis Process 61(2):202–213
Article Google Scholar
Ülkümen G, Thomas M (2013) Personal relevance and mental simulation amplify the duration framing effect. J Mark Res 50(2):194–206
Article Google Scholar
Rachlinski JJ, Wistrich AJ, Guthrie C (2015) Can judges make reliable numeric judgments: distorted damages and skewed sentences. Indiana Law J 90:695–739
Google Scholar
Pelham BW, Sumarta TT, Myaskovsky L (1994) The easy path from many to much: the numerosity heuristic. Cogn Psychol 26(2):103–133
Article Google Scholar
Monga A, Bagchi R (2012) Years, months, and days versus 1, 12, and 365: the influence of units versus numbers. J Consum Res 39(1):185–198
Article Google Scholar
Jørgensen M (2016) Unit effects in software project effort estimation: work-hours gives lower effort estimates than workdays. J Syst Softw 117:274–281
Article Google Scholar

Download references

Author information

Authors and Affiliations

Norwegian Institute of Public Health, Oslo, Norway
Torleif Halkjelsvik
Department of Software Engineering, Simula Research Laboratory, Fornebu, Norway
Magne Jørgensen

Authors

Torleif Halkjelsvik
View author publications
You can also search for this author in PubMed Google Scholar
Magne Jørgensen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Torleif Halkjelsvik .

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Halkjelsvik, T., Jørgensen, M. (2018). Time Prediction Biases. In: Time Predictions. Simula SpringerBriefs on Computing, vol 5. Springer, Cham. https://doi.org/10.1007/978-3-319-74953-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-74953-2_5
Published: 01 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-74952-5
Online ISBN: 978-3-319-74953-2
eBook Packages: Economics and FinanceEconomics and Finance (R0)

Publish with us

Policies and ethics