Handoff between dcc-release and dcc-download


#1

Hi.

We’ve completed the dcc-release pipeline, at least as far as export.

At the same time, we have deployed an instance of dcc-download-server and have it communicating to the HDFS file system. We’ve manually populated a directory with the contents of dcc-download-server/src/test/resources/fixtures/input. We’ve configured proxies and can now download data from HDFS successfully.

As far as I can tell the dcc-release process is incomplete, in that it does not create a directory structure within HDFS that dcc-download expects.

There is code here in dcc-etl that looks like it might create the expected directory.

Is there any guidance you can share on how to prepare data for handoff between dcc-release and dcc-download?

Thanks very much for reading.

-Brian Walsh


#2

Note: discuss will only let me post two links in a message, a copy of this post with links can be found here:


#3

Hi @bwalsh,

The export job of dcc-release should create the required directory structure. The code responsible for it is here:

Though it only gets triggered if you’ve set the clean property to true in the application.yml

export:
  clean: true

#4

I do think that this conditional is totally not obvious in its intention: https://github.com/icgc-dcc/dcc-release/blame/develop/dcc-release-job/dcc-release-job-export/src/main/java/org/icgc/dcc/release/job/export/core/ExportJob.java#L92

That is something I can definitely improve.


#5

Thanks for the quick response.

We’ve re-run release and see the following:

`$ hdfs dfs -ls /dcc-release/work-projects-22-2/export
Found 2 items
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:19 /dcc-release/work-projects-22-2/export/data
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/headers

hdfs dfs -ls /dcc-release/work-projects-22-2/export/data
Found 22 items
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/ALL-US
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/AML-US
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/BAML-US
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/BRCA-EU
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/BRCA-FR
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/BRCA-KR
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/BRCA-UK
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/BRCA-US
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/CLLE-ES
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/CMDI-UK
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/DLBC-US
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/LAML-CN
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/LAML-KR
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/LAML-US
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/LIAD-FR
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/LICA-CN
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/LICA-FR
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/LIHC-US
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/LIHM-FR
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/LINC-JP
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/LIRI-JP
drwxr-xr-x - ubuntu hadoop 0 2017-02-18 21:39 /dcc-release/work-projects-22-2/export/data/MALY-DE`

However, we don’t see the expected directory structure:
$ ls -l dcc-download-server/src/test/resources/fixtures/input total 8 -rw-r--r-- 1 walsbr OHSUM01\Domain Users 18 Jan 30 16:12 README.txt drwxr-xr-x 3 walsbr OHSUM01\Domain Users 102 Jan 30 16:12 legacy_releases drwxr-xr-x 5 walsbr OHSUM01\Domain Users 170 Jan 30 16:12 release_20 drwxr-xr-x 7 walsbr OHSUM01\Domain Users 238 Jan 30 16:12 release_21

Is there anything obvious that we have overlooked?


#6

@andricDu
Hi. Sorry to interrupt. Have you have a chance to look into this?
We are a bit stuck, not seeing the behavior or code in dcc-release, and it looks like dcc-etl-export might be deprecated?
Thanks again,

-Brian


#7

Hey @bwalsh

Have not had the time yet. I will be investigating this soon and get back to you shortly.


#8

@andricDu Thank you. Much appreciated.


#9

Hey @bwalsh, I re-ran the export job and was going through our process, and you are correct that the final steps for making the data available for download are not done by dcc-release.

We actually have a manual process to do this that has some pretty good documentation courtesy of Vitalii:


###Update download server data

For a release directory layout refer to the download server documentation: https://github.com/icgc-dcc/dcc-download/tree/develop/dcc-download-server#directories-layout

First, update Elasticsearch export archives:

hdfs dfs -mv /<path>/export/es_export /<path>/export/es_export.bak
hdfs dfs -mv /<path>/release/ICGC24/0/es_export /<path>/export/es_export
hdfs dfs -rm -r -skipTrash /<path>/export/es_export.bak

Secondly, move release clinical download files and prepare the standard directory layout:

hdfs dfs -mv /<path>/release/ICGC24/0/export /<path>/download/release_24
hdfs dfs -mkdir /<path>/download/release_24/{projects_files,summary_files}
hdfs dfs -mv /<path>/release/ICGC24/0/simple_somatic_mutation.aggregated.vcf.gz /<path>/download/release_24/summary_files

Next, update/add README.txt files:

/<path>/download/README.txt
/<path>/download/release_23/README.txt
/<path>/download/release_23/projects_files/README.txt

Lastly, trigger the download server to load the latest release files in memory:

curl -k -XPUT https://<download_server>:8443/srv-info/release/release_24 -u <admin_user_name>:<admin_user_password>

I hope this helps, and let me know if there are things that are still unclear.


#10

@andricDu @vitalii

Thanks very much.

Our release directory after export contains files in this form:

ubuntu@dcc-etl-2:~/release2download$ hdfs dfs  -ls /dcc-release/work-projects-22-2/export/data/ALL-US/DOfff8e7b946df4c9b2575b3936fadf24f/donor
Found 1 items
-rw-r--r--   3 ubuntu hadoop        109 2017-02-18 21:20 /dcc-release/work-projects-22-2/export/data/ALL-US/DOfff8e7b946df4c9b2575b3936fadf24f/donor/part-00001.gz

The test fixture provided with dcc-download has this form:

ubuntu@dcc-etl-2:~/release2download$ hdfs dfs  -ls /bwalsh-release/release_20/Projects/TST1-CA/*
-rw-r--r--   3 ubuntu hadoop         43 2017-02-10 22:47 /bwalsh-release/release_20/Projects/TST1-CA/donor.TST1-CA.tsv.gz  

Is there a step missing to create the tsv.gz files?


#11

The part files should be correct.
You can verify their contents with zless part-00001.gz


#12

The test fixtures for the download server also contain part files:


#13

Hey Dusan,

We’ve got the correct directory structure going, and the Summary and Projects directories show up on the Release page, but what’s going on with the caching? We deleted an extra directory from the base DCC directory show in the portal and yet it continues to show up after restarting download-server.

Best,
G