Careful Ranking of Multiple Solvers with Timeouts and Ties
In several fields, Satisfiability being one, there are regular competitions to compare multiple solvers in a common setting. Due to the fact some benchmarks of interest are too difficult for all solvers to complete within available time, time-outs occur and must be considered.
Through some strange evolution, time-outs became the only factor that was considered in evaluation. Previous work in SAT 2010 observed that this evaluation method is unreliable and lacks a way to attach statistical significance to its conclusions. However, the proposed alternative was quite complicated and is unlikely to see general use.
This paper describes a simpler system, called careful ranking, that permits a measure of statistical significance, and still meets many of the practical requirements of an evaluation system. It incorporates one of the main ideas of the previous work: that outcomes had to be freed of assumptions about timing distributions, so that non-parametric methods were necessary. Unlike the previous work, it incorporates ties.
The careful ranking system has several important non-mathematical properties that are desired in an evaluation system: (1) the relative ranking of two solvers cannot be influenced by a third solver; (2) after the competition results are published, a researcher can run a new solver on the same benchmarks and determine where the new solver would have ranked; (3) small timing differences can be ignored; (4) the computations should be easy to understand and reproduce. Voting systems proposed in the literature lack some or all of these properties.
A property of careful ranking is that the pairwise ranking might contain cycles. Whether this is a bug or a feature is a matter of opinion. Whether it occurs among leaders in practice is a matter of experience.
The system is implemented and has been applied to the SAT 2009 Competition. No cycles occurred among the leaders, but there was a cycle among some low-ranking solvers. To measure robustness, the new and current systems were computed with a range of simulated time-outs, to see how often the top rankings changed. That is, times above the simulated time-out are reclassified as time-outs and the rankings are computed with this data. Careful ranking exhibited many fewer changes.
Unable to display preview. Download preview PDF.
- [BO07]Brglez, F., Osborne, J.A.: Performance testing of combinatorial solvers with isomorph class instances. In: Workshop on Experimental Computer Science, San Diego. ACM, New York (2007) (co-located with FCRC 2007) Google Scholar
- [LBS04]Le Berre, D., Simon, L.: The essentials of the sat 2003 competition. In: Proc. SAT (2004)Google Scholar
- [Pul06]Pulina, L.: Empirical evaluation of scoring methods. In: Third European Starting AI Researcher Symposium (2006)Google Scholar
- [Sch03]Schulze, M.: A new monotonic and clone-independent single-winner election method. In: Tideman, N. (ed.) Voting Matters, vol. (17), pp. 9–19 (October 2003), http://www.votingmatters.org.uk
- [Tid06]Tideman, N.: Collective Decisions and Voting: the Potential for Public Choice. Ashgate (2006)Google Scholar