select-star-species

Usage:

select-star-species [--dist-matrix] <subset size> <tree file>

Description:

Given a phylogentic tree, this program attempts to find the subset of species whose subtree is "closest" to a star topology. By "closest" we mean that the pairwise distances between distinct species are nearly equal, but are also as near as possible to the average pairwise distance for the complete tree.

Algorithm:

Inputs:
m: size of desired subset
T: phylogentic tree
S ← List of species in T
n ← size(S)
if (n choose m) > 1e5 then exit
M ← pairwise distance matrix for T
μM ← mean pairwise distance for M
C ← List of all subsets of S of size m
best_score ← max_float
best_subset ← ø
for each s in C
score ← score(M, μM, s)
if score < best_score then
best_score ← score
best_subset ← s
print best_subset

score(M, μM, s) ← Var(sM) / μM
Var(sM) ← ∑ a, b ∈ s, a ≠ b (M[a, b] - μs)2
μs ← ∑a, b ∈ s, a ≠ bM[a,b] / ((len(s)2 - len(s)) / 2)

Input:

A file containing a phylogentic tree in Newick format.

Output:

The list of species in the subset, one species per line, is sent to the standard output.

Options:

--dist-matrix - This option causes the matrix of interspecies distances for the full tree to be written to the standard output.