Duplicated rows in exp_seq files

genenastics · October 15, 2018, 9:02am

Hi,

I am analyzing RNASeq expression data (exp_seq files) from all available projects and I found duplicated rows while doing some quality controls. I show a reproducible example bellow:

require(data.table)
download.file('https://dcc.icgc.org/api/v1/download?fn=/release_27/Projects/PRAD-US/exp_seq.PRAD-US.tsv.gz','exp_seq.PRAD-US.tsv.gz')
mydt <- fread(cmd='gzip -dc exp_seq.PRAD-US.tsv.gz')

# Example for a random gene
set.seed(1)
mygene <- sample(unique(mydt$gene_id),1)
mydt[gene_id == mygene,.N, by=.(icgc_sample_id)][N > 1]
   # icgc_sample_id N
# 1:       SA416386 2
# 2:       SA414946 2
# 3:       SA522801 2
# 4:       SA522467 2
# Check one sample
mydt[gene_id == mygene & icgc_sample_id == 'SA416386']
   # icgc_donor_id project_code icgc_specimen_id icgc_sample_id
# 1:       DO36289      PRAD-US          SP80052       SA416386
# 2:       DO36289      PRAD-US          SP80052       SA416386
            # submitted_sample_id analysis_id gene_model gene_id
# 1: TCGA-HC-8265-01B-04R-2302-07        1893        GAF  CCDC39
# 2: TCGA-HC-8265-01B-04R-2302-07        1893        GAF  CCDC39
   # normalized_read_count raw_read_count fold_change assembly_version
# 1:          9.149805e-07             68          NA           GRCh37
# 2:          1.881015e-06             97          NA           GRCh37
         # platform total_read_count
# 1: Illumina HiSeq               NA
# 2: Illumina HiSeq               NA
                                                                                  # experimental_protocol
# 1: RNASeqV2_RSEM_genes https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor
# 2: RNASeqV2_RSEM_genes https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor
   # alignment_algorithm normalization_algorithm other_analysis_algorithm
# 1:                  NA                      NA                       NA
# 2:                  NA                      NA                       NA
   # sequencing_strategy raw_data_repository                   raw_data_accession
# 1:             RNA-Seq               CGHub da8ee6ab-5fde-462b-96ad-b45b9441e495
# 2:             RNA-Seq               CGHub da8ee6ab-5fde-462b-96ad-b45b9441e495
   # reference_sample_type
# 1:                    NA
# 2:                    NA

As you can see, the only difference between those rows is the raw_read_count. I am excluding those samples from the analysis, but I wanted to share it with the community to see if someone knows why they are duplicated.