How to download PCAWG files from TCGA now?


#1

Hi everybody, I want to download the PCAWG data files. I noticed that if I want to download those files from PCAWG Chicago (TCGA), I need a token for GNOS. Indeed, I have applied the dbGAP authority and successfully downloaded some PCAWG files from GNOS 2 months ago. But when I tried to do ti the same way now, I find the web page where I can get the token is lost:
https://pancancer-token.annailabs.com/
Could anyone tell me how to download PCAWG files (TCGA part) now?
Thanks!


#2

Hi,

Did you try to re-download the GNOS token (key file)? I think you can download the data files with a new key file.

Thanks,
Sanghoon


#3

Hi,

For TCGA portion of the PCAWG data, at this time, you can still download from GNOS server hosted at University of Chicago. Please download access key from here: https://bionimbus-pdc.opensciencedatacloud.org/pcawg/

As GNOS server will be decommissioned soon, TCGA portion of the PCAWG data has been transferred to another long-term archive system - Protected Data Cloud (PDC) - also hosted at University of Chicago. Details on how to download PCAWG data from PDC will be documented here: http://docs.icgc.org/pcawg/

Hope this helps and feel free to reach out if you have any further questions.

Best
Junjun


#4

Thank you very much! I got the GNOS token from the website you provided.
I have another question, when I use icgc-get on my personal workstation to download files, it throw out an error like this.
[shiyang@dell-ser01 ~]$ sudo ./icgc-get download FI658589


Starting download(s) for files: FI658589 from: collaboratory


| nohup: redirecting stderr to stdout
| Starting…
|
| Running…
|
| Validating repository connection…
|
| ERROR: Command error: java.io.IOException: Access refused by repository. Ensure client is running as part of repository cloud.
|
| Please check the log for detailed error messages
collaboratory client exited with a nonzero error code 1.
Error: Please check client output for error messages

But the sudo icgc-get check can run successfully. Do you know how to solve this problem?
Thanks!


#5

Thank you for your kind help! I have solved my problem.


#6

Thank you very much! I got the GNOS token from the website you provided.
I have another question, when I use icgc-get on my personal workstation to download files, it throw out an error like this.
[shiyang@dell-ser01 ~]$ sudo ./icgc-get download FI658589

Starting download(s) for files: FI658589 from: collaboratory

| nohup: redirecting stderr to stdout
| Starting…
|
| Running…
|
| Validating repository connection…
|
| ERROR: Command error: java.io.IOException: Access refused by repository. Ensure client is running as part of repository cloud.
|
| Please check the log for detailed error messages
collaboratory client exited with a nonzero error code 1.
Error: Please check client output for error messages

But the sudo icgc-get check can run successfully. Do you know how to solve this problem?
Thanks!


#7

The file (FI658589) you are requesting in your example is not in GNOS. It is a file associated with non-TCGA project, https://dcc.icgc.org/repositories/files/FI658589.

The file exists in Collaboratory and AWS. In order to download from Collaboratory or AWS, the download client (icgc-get in your case) must be running on the same platform. This means if you download from Collab, the client must run on a Collab VM; if you download from AWS, the client must be run on an AWS VM. Running the client on your local will not work (as the error message shows).

The above rule does not apply to downloading from a GNOS server, for which the client can run any where.

Hope this helps, let me know if you have any further questions.

Best
Junjun


#8

Thanks for your kind help! It saves me a lot of time, otherwise I would persist in trying to run icgc-get on my personal workstation.
I can download a file from collaboratory now, by sudo ./icgc-get download fileid, but when I want to download thousands of files by manifestid,(sudo ./icgc-get download -m manifestid), it will stagnate at “Validating repository connection…” for a long time until my connection with the VM break, but nothing download. Though I can download files one by one by writing a script, I’m still curious about how to download thousands of files at once by manifest id. I would appreciate if you could help me out of my confusions.
Thanks!


#9

Personally I would stay downloading one file at a time. You have full control over how you want to script it. You can split the 1000 file IDs into 4 sub lists, then run a script to download each list, you can run 4 of them at the same time depending on bandwidth. If something happened in the middle of download a list of files, you easily know where it breaks and where you should resume downloading.

Downloading with manifest ID should be used for small number of files.

Best,
Junjun