Hi,
I am analyzing RNASeq expression data (exp_seq files) from all available projects and I found duplicated rows while doing some quality controls. I show a reproducible example bellow:
require(data.table)
download.file('https://dcc.icgc.org/api/v1/download?fn=/release_27/Projects/PRAD-US/exp_seq.PRAD-US.tsv.gz','exp_seq.PRAD-US.tsv.gz')
mydt <- fread(cmd='gzip -dc exp_seq.PRAD-US.tsv.gz')
# Example for a random gene
set.seed(1)
mygene <- sample(unique(mydt$gene_id),1)
mydt[gene_id == mygene,.N, by=.(icgc_sample_id)][N > 1]
# icgc_sample_id N
# 1: SA416386 2
# 2: SA414946 2
# 3: SA522801 2
# 4: SA522467 2
# Check one sample
mydt[gene_id == mygene & icgc_sample_id == 'SA416386']
# icgc_donor_id project_code icgc_specimen_id icgc_sample_id
# 1: DO36289 PRAD-US SP80052 SA416386
# 2: DO36289 PRAD-US SP80052 SA416386
# submitted_sample_id analysis_id gene_model gene_id
# 1: TCGA-HC-8265-01B-04R-2302-07 1893 GAF CCDC39
# 2: TCGA-HC-8265-01B-04R-2302-07 1893 GAF CCDC39
# normalized_read_count raw_read_count fold_change assembly_version
# 1: 9.149805e-07 68 NA GRCh37
# 2: 1.881015e-06 97 NA GRCh37
# platform total_read_count
# 1: Illumina HiSeq NA
# 2: Illumina HiSeq NA
# experimental_protocol
# 1: RNASeqV2_RSEM_genes https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor
# 2: RNASeqV2_RSEM_genes https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor
# alignment_algorithm normalization_algorithm other_analysis_algorithm
# 1: NA NA NA
# 2: NA NA NA
# sequencing_strategy raw_data_repository raw_data_accession
# 1: RNA-Seq CGHub da8ee6ab-5fde-462b-96ad-b45b9441e495
# 2: RNA-Seq CGHub da8ee6ab-5fde-462b-96ad-b45b9441e495
# reference_sample_type
# 1: NA
# 2: NA
As you can see, the only difference between those rows is the raw_read_count. I am excluding those samples from the analysis, but I wanted to share it with the community to see if someone knows why they are duplicated.