Will re-running ETL IMPORT overwrite old data in staging area?

kibri · January 26, 2017, 8:15pm

A run of the ETL pipeline had errors when accessing MongoDB and Postgres in the IMPORT step, but the processing continued until it failed in SUMMARIZE. I fixed the DB access problems. Now, is there a way to tell if the IMPORT created bad data in the staging area? If I re-run IMPORT, will it overwrite bad data, or will it skip IMPORT processing if the output files already exist? Is there a way to remove just files created by IMPORT? Or do I need to delete staging and start over?

Thanks,
Brian K.

andricDu · January 26, 2017, 8:29pm

Hi Brian,

Rerunning a job should output and overwrite the files from the previous run of that job.

If you know the INDEX job didn’t run with the correct data then you can just start the ETL pipeline from that job and it will overwrite and continue with the correct data.

./release.sh -j INDEX- <release_name>

andricDu · January 26, 2017, 8:32pm

Just to be specific, the first step of a job is to clean the output of any previous run.

github.com

icgc-dcc/dcc-release/blob/develop/dcc-release-job/dcc-release-job-import/src/main/java/org/icgc/dcc/release/job/imports/core/ImportJob.java#L63




/**
 * Dependencies.
 */
@NonNull
private final MongoProperties properties;


@Override
public JobType getType() {
  return JobType.IMPORT;
}


@Override
public void execute(@NonNull JobContext jobContext) {
  clean(jobContext);
  imports(jobContext);
}


private void clean(JobContext jobContext) {
  delete(jobContext, PROJECT, GENE, GENE_SET, DIAGRAM, DRUG, CLINVAR, CIVIC);
}