Hi - here's some PFAM analysis of the RefSeq vs. the non-RefSeq portion of KG3. I'm doing the analysis at the gene rather than transcript level, using the knownCanonical table to choose a representative transcript for each gene. RefSeq coding genes: 18464 (90.36%) non-RefSeq coding genes: 1969 (9.64%) RefSeq with pfam: 14569 (78.90%) non-RefSeq with pfam: 855 (43.42%) RefSeq total unique domains in proteins: 23766 Top 10 RefSeq domains: 2.82% (670) 7tm_1 "7 transmembrane receptor (rhodopsin family)" 2.50% (594) zf-C2H2 "Zinc finger, C2H2 type" 1.86% (443) Pkinase "Protein kinase domain" 1.66% (394) Pkinase_Tyr "Protein tyrosine kinase" 1.40% (332) ig "Immunoglobulin domain" 1.17% (279) KRAB "KRAB box" 1.02% (242) PH "PH domain" 0.96% (228) zf-C3HC4 "Zinc finger, C3HC4 type (RING finger)" 0.92% (219) WD40 "WD domain, G-beta repeat" 0.87% (206) V-set "Immunoglobulin V-set domain" non-RefSeq total unique domains in proteins: 1294 Top 10 nonRefSeq domains. 7.81% (101) V-set "Immunoglobulin V-set domain" 5.10% (66) ig "Immunoglobulin domain" 4.87% (63) zf-C2H2 "Zinc finger, C2H2 type" 2.94% (38) KRAB "KRAB box" 1.70% (22) Pkinase "Protein kinase domain" 1.70% (22) Ank "Ankyrin repeat" 1.39% (18) PH "PH domain" 1.31% (17) Pkinase_Tyr "Protein tyrosine kinase" 1.31% (17) 7tm_1 "7 transmembrane receptor (rhodopsin family)" 1.16% (15) NPIP "Nuclear pore complex interacting protein (NPIP)" It's probably not a horrible assumption to say that 80% of the *real* genes have a PFAM domain. A fairly simplistic view would be then that 855/0.8 or 1100 of the non-refSeqs are real. I am a little worried about all the immunoglobulin though. About 25 of these are actually T-cell Receptor variable fragments. (These are similar to antibody fragments, it's another immune-related thing that undergoes splicing at the DNA rather than RNA level). There also seem to be some antibody fragments that slipped through the filters. So maybe it's closer to 1000. The same assumption applied to RefSeqwould give 18200 genes there, or 19200 total. Michelle Clamp just did an analysis of RefSeq, Ensembl, and VEGA coming up with 20500.