Gene identifier disambiguation is an issue known to cause problems with the outputs of gene-based enrichment analysis such as GSEA. Duplicate genes with different identifiers can cause variation in analysis output (order of enriched gene sets) as well as alter the enrichment scores for certain gene sets, which changes the statistical significance of results. We propose a study using synthetic gene expression data and gene sets. By altering the percent of genes that are disambiguated between gene sets, and merging overlapping gene sets based on content similarity, we can quantify the effects of duplicate genes on GSEA results.
Learning Objective 1: Use synthetic data to analyze the performance of statistical gene enrichment analysis.
Lucy Wang (Presenter)
University of Washington
John Gennari, University of Washington