Constructing computer virus phylogenies

  • Leslie Ann Goldberg
  • Paul W. Goldberg
  • Cynthia A. Phillips
  • Gregory B. Sorkin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1075)


There has been much recent algorithmic work on the problem of reconstructing the evolutionary history of biological species. Computer virus specialists are interested in finding the evolutionary history of computer viruses — a virus is often written using code fragments from one or more other viruses, which are its immediate ancestors. A phylogeny for a collection of computer viruses is a directed acyclic graph whose nodes are the viruses and whose edges map ancestors to descendants and satisfy the property that each code fragment is “invented” only once. To provide a simple explanation for the data, we consider the problem of constructing such a phylogeny with a minimum number of edges. This optimization problem is NP-hard, and we present positive and negative results for associated approximation problems. When tree solutions exist, they can be constructed and randomly sampled in polynomial time.


Span Tree Greedy Algorithm Minimum Span Tree Directed Cycle Suffix Tree 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    M. Bellare, S. Goldwasser, C. Lund, and A. Russell. Efficient probabilistically checkable proofs and applications to approximation. In Proceedings of the 25th Annual ACM Symposium on the Theory of Computing, pages 294–304, 1993.Google Scholar
  2. 2.
    C. Benham, S. Kannan, M. Paterson, and T. Warnow. Hen's teeth and whale's feet: Generalized characters and their compatibility. Journal of Mathematical Biology, 2(4):515–525, 1995.Google Scholar
  3. 3.
    H. Bodlaender, M. Fellows, and T. Warnow. Two strikes against perfect phylogeny. In Proceedings of the 19th International Colloquium on Automata, Languages, and Programming, Lecture Notes in Computer Science, pages 273–283. Springer Verlag, 1992.Google Scholar
  4. 4.
    C. Colbourn and M. Jerrum, 1995. Personal communication.Google Scholar
  5. 5.
    C. Colbourn, W. Myrvold, and E. Neufeld. Two algorithms for unranking arborescences. Journal of Algorithms. To appear.Google Scholar
  6. 6.
    D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation, 9:251–280, 1990.Google Scholar
  7. 7.
    M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.Google Scholar
  8. 8.
    U. Feige. A threshold of ln n for approximating set cover. In Proceedings of the 28th Annual ACM Symposium on the Theory of Computing, pages 286–293, 1996.Google Scholar
  9. 9.
    A. Gibbons. Algorithmic Graph Theory. Cambridge University Press, 1985.Google Scholar
  10. 10.
    L. Goldberg, P. Goldberg, C. Phillips, E. Sweedyk, and T. Warnow. Computing the phylogenetic number to find good evolutionary trees. In Proceedings of the 6th Symposium on Combinatorial Pattern Matching, July 1995.Google Scholar
  11. 11.
    D. Gusfield. Efficient algorithms for inferring evolutionary trees. Networks, 21:12–28, 1991.Google Scholar
  12. 12.
    W. Joklik, H. Willett, D. Amos, and C. Wilfert, editors. Zinsser Microbiology. Appleton & Lange, Norwalk, Connecticut, 20th edition, 1992.Google Scholar
  13. 13.
    D. Karger, P. Klein, and R. Tarjan. A randomized linear-time algorithm to find minimum spanning trees. Journal of the Association for Computing Machinery, 42(2), 1995.Google Scholar
  14. 14.
    J. Kephart and W. Arnold. Automatic extraction of computer virus signatures. In R. Ford, editor, Proceedings of the 4th Virus Bulletin International Conference, pages 179–194. Virus Bulletin Ltd; 1994.Google Scholar
  15. 15.
    A. Nijenhuis and H. Wilf. Combinatorial Algorithms for Computers and Calculators. Academic Press, 2nd edition, 1978.Google Scholar
  16. 16.
    R. Prim. Shortest connection networks and some generalizations. Bell System Technical Journal, 36:1389–1401, 1957.Google Scholar
  17. 17.
    G. B. Sorkin. Grouping related computer viruses into families. In Proceedings of the IBM Security ITS, Oct. 1994.Google Scholar
  18. 18.
    M. Steel. The complexity of reconstructing trees from qualitative characters and subtrees. Journal of Classification, 9:91–116, 1992.Google Scholar
  19. 19.
    D. Wilson. Generating random spanning trees more quickly than the cover time. Submitted for publication, 1995.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1996

Authors and Affiliations

  • Leslie Ann Goldberg
    • 1
  • Paul W. Goldberg
    • 2
  • Cynthia A. Phillips
    • 3
  • Gregory B. Sorkin
    • 4
  1. 1.University of WarwickCoventryUK
  2. 2.Aston UniversityAston TriangleUK
  3. 3.Sandia National LabsAlbuquerque
  4. 4.IBM T.J. Watson Research CenterYorktown Heights

Personalised recommendations