Release 'Summarize' task error when running pipeline with ssm files

mayfielg · September 30, 2016, 4:53pm

Hi there,

I have successfully run the release/ETL pipeline at OHSU on OHSU clinical (donor, sample, specimen) files. I have now attempted to run it again on clinical and ssm files and have recieved the follwoing error:

2016-09-30 16:15:38.572 ERROR 8027 — [ main] o.i.dcc.release.core.task.TaskExecutor : Aborting task(s) executions due to exception…

java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 188.0 failed 1 times, most recent failure: Lost task 0.0 in stage 188.0 (TID 744, localhost): com.fasterxml.jackson.databind.JsonMappingException: [no message for java.lang.NullPointerException]
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:261)
at com.fasterxml.jackson.databind.ObjectWriter._configAndWriteValue(ObjectWriter.java:802)
at com.fasterxml.jackson.databind.ObjectWriter.writeValueAsBytes(ObjectWriter.java:700)
at org.icgc.dcc.release.core.util.SmileSerializer.write(SmileSerializer.java:71)
at org.icgc.dcc.release.core.util.SmileSerializer.write(SmileSerializer.java:42)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:194)
at org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:147)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:185)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at com.fasterxml.jackson.dataformat.smile.SmileGenerator._writeFieldName(SmileGenerator.java:603)
at com.fasterxml.jackson.dataformat.smile.SmileGenerator.writeFieldName(SmileGenerator.java:473)
at com.fasterxml.jackson.databind.node.ObjectNode.serialize(ObjectNode.java:258)
at com.fasterxml.jackson.databind.node.ObjectNode.serialize(ObjectNode.java:264)
at com.fasterxml.jackson.databind.node.ObjectNode.serialize(ObjectNode.java:264)
at com.fasterxml.jackson.databind.ser.std.SerializableSerializer.serialize(SerializableSerializer.java:44)
at com.fasterxml.jackson.databind.ser.std.SerializableSerializer.serialize(SerializableSerializer.java:29)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:250)
… 16 more

Previously, we had errors in the summarize task because the ‘summary’ field was non-existent in the Project collection in Mongo. Is this potentially related to that? Must the summary field not only be existent but populated? Here is the (redacted) related Project document.

{ “_id” : ObjectId(“—”), “_project_id” : “BAML-US”, “icgc_id” : “12345”, “primary_site” : “Blood”, “project_name” : “AML - OHSU, US”, “tumour_type” : “Blood cancer”, “tumour_subtype” : “Acute lymphoblastic leukemia”, “primary_countries” : [ “United States” ], “_summary” : { “_ssm_tested_donor_count” : 0, “_sgv_tested_donor_count” : 0, “_cnsm_tested_donor_count” : 0, “_cngv_tested_donor_count” : 0, “_stsm_tested_donor_count” : 0, “_stgv_tested_donor_count” : 0, “_meth_array_tested_donor_count” : 0, “_meth_seq_tested_donor_count” : 0, “_mirna_seq_tested_donor_count” : 0, “_exp_array_tested_donor_count” : 0, “_exp_seq_tested_donor_count” : 0, “_pexp_tested_donor_count” : 0, “_jcn_tested_donor_count” : 0, “_available_data_type” : , “_total_donor_count” : 0, “_total_sample_count” : 0, “_total_specimen_count” : 0, “_total_live_donor_count” : 0, “_state” : “live”, “repository” : , “available_experimental_analysis_performed” : , “experimental_analysis_performed_sample_count” : { } } }

Any thoughts would be appreciated.

Thanks,
Georgia

btiernay · September 30, 2016, 5:07pm

It would appear that the error is coming about because the summary is not being joined correctly. There shouldn’t be a requirement to have the summary field populated prior to this task as this is the tasks purpose. Roughly, it is this part of the code:

github.com

icgc-dcc/dcc-release/blob/develop/dcc-release-job/dcc-release-job-summarize/src/main/java/org/icgc/dcc/release/job/summarize/task/ProjectSummarizeTask.java#L101


      .map(joinProjectSummary(projectSummaryBroadcast));


  writeOutput(taskContext, output, PROJECT_SUMMARY);
}


private Function<ObjectNode, ObjectNode> joinProjectSummary(Broadcast<Map<String, ObjectNode>> projectSummaryBroadcast) {
  return project -> {
    Map<String, ObjectNode> projectSummaries = projectSummaryBroadcast.getValue();
    String projectName = textValue(project, PROJECT_ID);
    ObjectNode projectSummary = projectSummaries.get(projectName);
    ObjectNode summary = projectSummary == null ? createDefaultProjectSummary() : projectSummary;
    project.set(FieldNames.PROJECT_SUMMARY, summary);


    return project;
  };
}


private JavaRDD<ObjectNode> readProjects(TaskContext taskContext) {
  return readInput(taskContext, FileType.PROJECT);
}

Make sure that projectSummaryBroadcast contains the information you expect. I would suggest adding some logging here.