Download specific files of ICGC data using "icgc-get"

Hi,

I will try to downlad ICGC data using “igcg-get” from PCAWG repository. As we know, the ‘icgc-get’ command line to download data files is using “file-id” or manifest file. But, data file download from PCAWG repository is to download all associated files. Then, the manifest xml file for PCAWG repository is like this below, and ‘icgc-get’ will download all files associated with the “file ID”, f404cebb-04bd-42a4-9f20-5b52514ec309, right? Then, a directory of “f404cebb-04bd-42a4-9f20-5b52514ec309” is made, and 14 ~ 20 associated files are altogether downloaded. It takes file store space, and takes time to download all. However, the files that I need are just “.somatic.sv.vcf.gz”. Is there a way to download only the necessary files “.somatic.sv.vcf.gz” using ‘icgc-get’? Someone said that if I revise ‘icgc-get’ source code, I can make it download only “.somatic.sv.vcf.gz” files, is it true?

Result id=“1”>
analysis_id>f404cebb-04bd-42a4-9f20-5b52514ec309</analysis_id>
analysis_data_uri>https://gtrepo-osdc-tcga.annailabs.com/cghub/data/analysis/download/f404cebb-04bd-42a4-9f20-5b52514ec309</analysis_data_uri>
files>
file>
filename>1021b60d-f7b2-43b0-b2cc-f282d619d533.broad-dRanger_snowman.20150918.somatic.sv.vcf.gz
filesize>248027
checksum type=“md5”>c0785110b39fc060466382c8985f9bd5
/file>
file>
filename>1021b60d-f7b2-43b0-b2cc-f282d619d533.broad-dRanger.20150918.somatic.sv.vcf.gz
filesize>146351
checksum type=“md5”>62cd6751000ef9266c7be306a407264a
/file>
file>
filename>1021b60d-f7b2-43b0-b2cc-f282d619d533.broad-snowman.20150918.somatic.sv.vcf.gz
filesize>240780
checksum type=“md5”>565a58c5118ee74a7428ffc820d98c06
/file>
/files>
/Result>

Hi,

We have two different types of backend data stores. One is cloud storage system such as AWS S3 or Open Stack Ceph; the other type is called GNOS (a commercial platform). If the data files you are trying to download are from a GNOS server, the files will be downloaded in groups as you have experienced. Here is such as group of files: https://gtrepo-osdc-tcga.annailabs.com/cghub/metadata/analysisFull/f404cebb-04bd-42a4-9f20-5b52514ec309. This is called an ‘Analysis Object’ in GNOS.

The tool ‘icgc-get’ is a wrapper tool, it relies on underlying tool does the real download work. A tool called ‘gtdownload’ will be actually called when the file to be downloaded is on a GNOS server. ‘gtdownload’ will always download files in an analysis object all together. ‘icgc-get’ will not be able to change this behaviour.

One way to avoid this is to download the file from cloud storage systems when possible.

Hope this answers your question. Please feel free getting in touch with us if you have any further questions.

Junjun

Hi Junjun,

Thank you so much for your kind explanation and details. I understood much better. I didn’t know that ‘icgc-get’ relies on underling tool, ‘gtdownload’, to download files from GNOS server. Could you answer one more question please?

My PI said that I should modify ‘icgc-get’ source code, which is python, to identify only “.somatic.sv.vcf.gz” files in GNOS server, then he said that I probably could download only the specific single files. Have you heard about this…? Would it be possible? I need to confirm this. But, after listening to your explanation, I think it would be still impossible as ‘icgc-get’ still relies on ‘gtdowload’.

Thank you,
Sanghoon

Hi Sanghoon,

If the file is downloaded from a GNOS server, there is no possibility to change the behaviour.

As mentioned earlier, some files have been stored in multiple locations. Such as this one: https://icgc.org/ZBs. It exists in PCAWG - Chicago (ICGC), AWS - Virginia, Collaboratory - Toronto.

If you download it from ‘PCAWG - Chicago (ICGC)’ which is a GNOS server, you will get many other files. If you download from ‘AWS’ or ‘Collaboratory’, you will be able to get just one file.

Does this answer your question?

Best regards,
Junjun

It sounds like it is impossible for ‘icgc-get’ to download specific single files from GNOS server although I can modify the source code of ‘icgc-get’. There is no way to change the behavior of GNOS server.

Yes, I have downloaded single specific files from Collaboratory already. Then, I had to download some other data files, which are not available in Gollaboratory. So, I tried to download those data files from PCAWG-Chicago and I found that all related files were downloaded from GNOS server. So, I had to ask the question.

Thank you so much for your kindness.
Sanghoon