Release 'Summarize' task error when running pipeline with ssm files

Hi there,

I have successfully run the release/ETL pipeline at OHSU on OHSU clinical (donor, sample, specimen) files. I have now attempted to run it again on clinical and ssm files and have recieved the follwoing error:

2016-09-30 16:15:38.572 ERROR 8027 — [ main] o.i.dcc.release.core.task.TaskExecutor : Aborting task(s) executions due to exception…

java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 188.0 failed 1 times, most recent failure: Lost task 0.0 in stage 188.0 (TID 744, localhost): com.fasterxml.jackson.databind.JsonMappingException: [no message for java.lang.NullPointerException]
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:261)
at com.fasterxml.jackson.databind.ObjectWriter._configAndWriteValue(ObjectWriter.java:802)
at com.fasterxml.jackson.databind.ObjectWriter.writeValueAsBytes(ObjectWriter.java:700)
at org.icgc.dcc.release.core.util.SmileSerializer.write(SmileSerializer.java:71)
at org.icgc.dcc.release.core.util.SmileSerializer.write(SmileSerializer.java:42)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:194)
at org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:147)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:185)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at com.fasterxml.jackson.dataformat.smile.SmileGenerator._writeFieldName(SmileGenerator.java:603)
at com.fasterxml.jackson.dataformat.smile.SmileGenerator.writeFieldName(SmileGenerator.java:473)
at com.fasterxml.jackson.databind.node.ObjectNode.serialize(ObjectNode.java:258)
at com.fasterxml.jackson.databind.node.ObjectNode.serialize(ObjectNode.java:264)
at com.fasterxml.jackson.databind.node.ObjectNode.serialize(ObjectNode.java:264)
at com.fasterxml.jackson.databind.ser.std.SerializableSerializer.serialize(SerializableSerializer.java:44)
at com.fasterxml.jackson.databind.ser.std.SerializableSerializer.serialize(SerializableSerializer.java:29)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:250)
… 16 more

Previously, we had errors in the summarize task because the ‘summary’ field was non-existent in the Project collection in Mongo. Is this potentially related to that? Must the summary field not only be existent but populated? Here is the (redacted) related Project document.

{ “_id” : ObjectId(“—”), “_project_id” : “BAML-US”, “icgc_id” : “12345”, “primary_site” : “Blood”, “project_name” : “AML - OHSU, US”, “tumour_type” : “Blood cancer”, “tumour_subtype” : “Acute lymphoblastic leukemia”, “primary_countries” : [ “United States” ], “_summary” : { “_ssm_tested_donor_count” : 0, “_sgv_tested_donor_count” : 0, “_cnsm_tested_donor_count” : 0, “_cngv_tested_donor_count” : 0, “_stsm_tested_donor_count” : 0, “_stgv_tested_donor_count” : 0, “_meth_array_tested_donor_count” : 0, “_meth_seq_tested_donor_count” : 0, “_mirna_seq_tested_donor_count” : 0, “_exp_array_tested_donor_count” : 0, “_exp_seq_tested_donor_count” : 0, “_pexp_tested_donor_count” : 0, “_jcn_tested_donor_count” : 0, “_available_data_type” : , “_total_donor_count” : 0, “_total_sample_count” : 0, “_total_specimen_count” : 0, “_total_live_donor_count” : 0, “_state” : “live”, “repository” : , “available_experimental_analysis_performed” : , “experimental_analysis_performed_sample_count” : { } } }

Any thoughts would be appreciated.

Thanks,
Georgia

It would appear that the error is coming about because the summary is not being joined correctly. There shouldn’t be a requirement to have the summary field populated prior to this task as this is the tasks purpose. Roughly, it is this part of the code:

Make sure that projectSummaryBroadcast contains the information you expect. I would suggest adding some logging here.