Release pipeline without germline information

mayfielg · January 11, 2017, 12:26am

Hey OICR folks,

If you recall, back in September you gave us here at OHSU access to a version of your dataset stripped of germline data, essentially making it equivalent to what is generally available at dcc.icgc.org without DACO access. Thanks again for that.

However, we’re having an issue that this dataset can’t be run through the MASK stage of the dcc-release pipeline while missing this information, and if we run the dataset through the pipeline without the MASK stage, no genetic information is populated into Elasticsearch at all.

Is there a way to tell the pipeline that the data is already pre-masked that we’re missing? Or to skip the MASK task without failing to deal with any genetic information?

btiernay · January 11, 2017, 2:34am

Hi @mayfielg. We’ll have a closer look at the code tomorrow to see if we can make this work somehow!

mayfielg · January 11, 2017, 6:25pm

Okay, great! Any help when you’ve got a minute would be appreciated. Thanks.

andricDu · January 11, 2017, 7:19pm

Hey @mayfielg,

The first thing I should mention is that by design it is not possible to skip a job while running the ETL pipeline. Specifically you cannot do something like STAGE -> ID while skipping MASK. Each job depends on the directories and files written out to disk by the previous job, which it will use as its input. Without running the MASK job, moving forward, the ID job will not be able to see any mutations.

As for running the pipeline on the dataset that you got from us, what issue are you running into when running the MASK job? Any logs or stacktraces would be helpful. We are happy to investigate.

mayfielg · January 11, 2017, 7:49pm

The specific issue is that the control_genotype field of the ssm_p file type does not accept a ‘null’ or ‘unavailable’ response. The dataset we have has -888 is place of the expected -/- format entries. Which results in a java null pointer exception.

I know this is the problem with the columns that were stripped because for testing purposes I stuck in random genotypes in that column to see if it would pass the MASK stage, and it did with no errors, other than the fact that the results are basically bad data. (Don’t worry, it was only for testing, and was never released anywhere.)

This is the null pointer exception:

2017-01-11 19:33:15.105  INFO 10843 --- [           main] o.icgc.dcc.release.client.core.Workflow  : Executing job 'MASK'...
2017-01-11 19:33:15.105  INFO 10843 --- [           main] o.icgc.dcc.release.client.core.Workflow  : ----------------------------------------------------------------------------------------------------
2017-01-11 19:33:15.106  INFO 10843 --- [           main] o.i.dcc.release.core.task.TaskExecutor   : Starting 1 task(s)...
2017-01-11 19:33:15.106  INFO 10843 --- [           main] o.i.dcc.release.core.task.TaskExecutor   : Submitting 'delete-file-type-task:' task...
2017-01-11 19:33:15.107  INFO 10843 --- [           main] o.i.d.r.core.task.DefaultTaskContext     : bwalsh says the path is ... /mnt/etl/dcc-release-mayfielg/dcc-workflow/ssm_p_masked exists ... false
2017-01-11 19:33:15.107  INFO 10843 --- [           main] o.i.d.r.core.task.DefaultTaskContext     : bwalsh says the path is ... /mnt/etl/dcc-release-mayfielg/dcc-workflow/sgv_p_masked exists ... false
2017-01-11 19:33:15.107  INFO 10843 --- [           main] o.i.dcc.release.core.task.TaskExecutor   : Finished processing task 'delete-file-type-task: - 803.6 μs'
2017-01-11 19:33:15.107  INFO 10843 --- [           main] o.i.dcc.release.core.task.TaskExecutor   : Finished executing 1 tasks in 1.142 ms!
2017-01-11 19:33:15.107  INFO 10843 --- [           main] o.i.dcc.release.core.task.TaskExecutor   : Finished 1 task(s) in 1.465 ms
2017-01-11 19:33:15.109  INFO 10843 --- [           main] o.i.dcc.release.core.task.TaskExecutor   : Starting 2 task(s)...
2017-01-11 19:33:15.110  INFO 10843 --- [           main] o.i.dcc.release.core.task.TaskExecutor   : Submitting 'ssm-p-masking-task:ssm_p_masked:LAML-KR' task...
2017-01-11 19:33:15.110  INFO 10843 --- [           main] o.i.d.r.core.task.DefaultTaskContext     : bwalsh says the path is ... /mnt/etl/dcc-release-mayfielg/dcc-workflow/ssm_p/project_name=LAML-KR exists ... true
2017-01-11 19:33:15.115  INFO 10843 --- [           main] org.apache.spark.storage.MemoryStore     : Block broadcast_78 stored as values in memory (estimated size 126.7 KB, free 1099.1 KB)
2017-01-11 19:33:15.122  INFO 10843 --- [           main] org.apache.spark.storage.MemoryStore     : Block broadcast_78_piece0 stored as bytes in memory (estimated size 12.6 KB, free 1111.7 KB)
2017-01-11 19:33:15.123  INFO 10843 --- [er-event-loop-9] o.apache.spark.storage.BlockManagerInfo  : Added broadcast_78_piece0 in memory on localhost:57596 (size: 12.6 KB, free: 9.6 GB)
2017-01-11 19:33:15.123  INFO 10843 --- [           main] org.apache.spark.SparkContext            : Created broadcast 78 from hadoopRDD at JavaRDDs.java:84
2017-01-11 19:33:15.137  INFO 10843 --- [           main] o.apache.hadoop.mapred.FileInputFormat   : Total input paths to process : 3
2017-01-11 19:33:15.147  INFO 10843 --- [           main] org.apache.spark.SparkContext            : Starting job: saveAsHadoopFile at JavaRDDs.java:150
2017-01-11 19:33:15.148  INFO 10843 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler  : Got job 39 (saveAsHadoopFile at JavaRDDs.java:150) with 3 output partitions
2017-01-11 19:33:15.148  INFO 10843 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler  : Final stage: ResultStage 39 (saveAsHadoopFile at JavaRDDs.java:150)
2017-01-11 19:33:15.148  INFO 10843 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler  : Parents of final stage: List()
2017-01-11 19:33:15.149  INFO 10843 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler  : Missing parents: List()
2017-01-11 19:33:15.149  INFO 10843 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler  : Submitting ResultStage 39 (MapPartitionsRDD[318] at mapToPair at ObjectNodeRDDs.java:126), which has no missing parents
2017-01-11 19:33:15.161  INFO 10843 --- [uler-event-loop] org.apache.spark.storage.MemoryStore     : Block broadcast_79 stored as values in memory (estimated size 44.2 KB, free 1155.9 KB)
2017-01-11 19:33:15.164  INFO 10843 --- [uler-event-loop] org.apache.spark.storage.MemoryStore     : Block broadcast_79_piece0 stored as bytes in memory (estimated size 15.7 KB, free 1171.5 KB)
2017-01-11 19:33:15.164  INFO 10843 --- [er-event-loop-7] o.apache.spark.storage.BlockManagerInfo  : Added broadcast_79_piece0 in memory on localhost:57596 (size: 15.7 KB, free: 9.6 GB)
2017-01-11 19:33:15.165  INFO 10843 --- [uler-event-loop] org.apache.spark.SparkContext            : Created broadcast 79 from broadcast at DAGScheduler.scala:1006
2017-01-11 19:33:15.166  INFO 10843 --- [uler-event-loop] org.apache.spark.scheduler.DAGScheduler  : Submitting 3 missing tasks from ResultStage 39 (MapPartitionsRDD[318] at mapToPair at ObjectNodeRDDs.java:126)
2017-01-11 19:33:15.166  INFO 10843 --- [uler-event-loop] o.a.spark.scheduler.TaskSchedulerImpl    : Adding task set 39.0 with 3 tasks
2017-01-11 19:33:15.166  INFO 10843 --- [uler-event-loop] o.a.s.scheduler.FairSchedulableBuilder   : Added task set TaskSet_39 tasks to pool default
2017-01-11 19:33:15.167  INFO 10843 --- [r-event-loop-11] o.apache.spark.scheduler.TaskSetManager  : Starting task 0.0 in stage 39.0 (TID 64, localhost, partition 0,PROCESS_LOCAL, 2264 bytes)
2017-01-11 19:33:15.167  INFO 10843 --- [launch worker-0] org.apache.spark.executor.Executor       : Running task 0.0 in stage 39.0 (TID 64)
2017-01-11 19:33:15.171  INFO 10843 --- [launch worker-0] org.apache.spark.rdd.HadoopRDD           : Input split: file:/mnt/etl/dcc-release-mayfielg/dcc-workflow/ssm_p/project_name=LAML-KR/part-00001:0+851797
2017-01-11 19:33:15.182  INFO 10843 --- [launch worker-0] org.apache.hadoop.io.compress.CodecPool  : Got brand-new decompressor [.deflate]
2017-01-11 19:33:15.182  INFO 10843 --- [launch worker-0] org.apache.hadoop.io.compress.CodecPool  : Got brand-new decompressor [.deflate]
2017-01-11 19:33:15.182  INFO 10843 --- [launch worker-0] org.apache.hadoop.io.compress.CodecPool  : Got brand-new decompressor [.deflate]
2017-01-11 19:33:15.182  INFO 10843 --- [launch worker-0] org.apache.hadoop.io.compress.CodecPool  : Got brand-new decompressor [.deflate]
2017-01-11 19:33:15.195 ERROR 10843 --- [launch worker-0] org.apache.spark.executor.Executor       : Exception in task 0.0 in stage 39.0 (TID 64)

java.lang.NullPointerException: null
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
	at com.google.common.base.Splitter.split(Splitter.java:386)
	at org.icgc.dcc.release.job.mask.function.MarkSensitiveRow.getUniqueAlleles(MarkSensitiveRow.java:104)
	at org.icgc.dcc.release.job.mask.function.MarkSensitiveRow.matchesAllControlAlleles(MarkSensitiveRow.java:72)
	at org.icgc.dcc.release.job.mask.function.MarkSensitiveRow.call(MarkSensitiveRow.java:55)
	at org.icgc.dcc.release.job.mask.function.MarkSensitiveRow.call(MarkSensitiveRow.java:41)
	at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1015)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1197)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1205)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

2017-01-11 19:33:15.204  INFO 10843 --- [er-event-loop-2] o.apache.spark.scheduler.TaskSetManager  : Starting task 1.0 in stage 39.0 (TID 65, localhost, partition 1,PROCESS_LOCAL, 2264 bytes)
2017-01-11 19:33:15.206  INFO 10843 --- [launch worker-0] org.apache.spark.executor.Executor       : Running task 1.0 in stage 39.0 (TID 65)
2017-01-11 19:33:15.207  WARN 10843 --- [result-getter-0] o.apache.spark.scheduler.TaskSetManager  : Lost task 0.0 in stage 39.0 (TID 64, localhost): java.lang.NullPointerException
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
	at com.google.common.base.Splitter.split(Splitter.java:386)
	at org.icgc.dcc.release.job.mask.function.MarkSensitiveRow.getUniqueAlleles(MarkSensitiveRow.java:104)
	at org.icgc.dcc.release.job.mask.function.MarkSensitiveRow.matchesAllControlAlleles(MarkSensitiveRow.java:72)
	at org.icgc.dcc.release.job.mask.function.MarkSensitiveRow.call(MarkSensitiveRow.java:55)
	at org.icgc.dcc.release.job.mask.function.MarkSensitiveRow.call(MarkSensitiveRow.java:41)
	at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1015)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1197)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1250)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1205)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

If it’s impossible to skip MASK, is there a way to tell it that it doesn’t need to do anything? Perhaps this is an issue with the validation step. When you don’t have data for this column, what is that lack represented as? Or is that not a scenario that happens?

As far as I understand the goal of the MASK task is to do exactly the germline information masking that in this case has already been done. Is that correct or am I missing something?

andricDu · January 11, 2017, 8:44pm

Some good news, with our testing infrastructure I can force the same stacktrace you are seeing, so we have a clear idea of what is going on.

Since we don’t have the archive that was given to you on hand, can you tell me what percentage of the rows from the TSV data have -888? If it’s some small percentage then I suggest removing those rows as according to our dictionary, those fields cannot be null (or -888) and are not valid input. http://docs.icgc.org/dictionary/viewer/#?viewMode=details&dataType=ssm_p&q=genotype

Are we right in our understanding that the data we provided was to be used for internal testing of the pipeline or did you have other use for it?

mayfielg · January 11, 2017, 9:01pm

100%. That column (and one or two others I think) are completely wiped to -888.

I know that my department head ultimately wants to have an instance of the portal running for OHSU with both OHSU data and ICGC data in it. In the long run, he would like a federated system that looks at both our data here and your data there and can query across both sites. My team has informed him that that is a conversation that needs to be had with your top people.

andricDu · January 11, 2017, 9:16pm

I see. For internal testing purposes it should be sufficient to do as you did and populate it with dummy data, though the aggregations will differ from our production data when making comparisons.

As for having access to the raw ICGC data and constructing a federated system, you are right that there need to be discussions at the top levels as to what will happen.

mayfielg · January 12, 2017, 7:03pm

Okay. Thanks for your input Dusan! Let’s continue this conversation after those management discussions have taken place.