Hello,
Our ETL DOCUMENT step fails on a workflow of 38 projects with an RPC timeout. The task where the failure occurs is
Failed to execute task 'mutation-centric-document-task:mutation-centric'
HDFS logs have messages saying
error processing WRITE_BLOCK operation
WARN Slow ReadProcessor read fields took 32320ms (threshold=30000ms);
I’ve reduced the replication factor to 1, and raised the dfs.namenode.handler.count, which allowed me to finish processing a set of 22 projects, but I’m still stuck at 38 projects in the DOCUMENT step. What HDFS settings are you using? Here’s our ansible setup for roles/hdfs/defaults/main.yml:
hdfs_namenode_properties:
# need to add user submitting job to the "hadoop" group, or turn filesystem
# security off.
# https://hadoop.apache.org/docs/r1.2.1/hdfs_permissions_guide.html#Configuration+Parameters
- { name: "dfs.permissions.superusergroup", value: "hadoop" }
- { name: "dfs.namenode.name.dir", value: "/media/persistent0" }
- { name: "dfs.replication", value: "1" }
# set handler threads to min(20*log2(cluster size), 200)
# https://community.hortonworks.com/questions/63511/namenode-handler-count.html
- { name: "dfs.namenode.handler.count", value: 70 }
jdk_home: /usr/lib/jvm/java-8-oracle/
hdfs_datanode_properties:
- { name: "dfs.permissions.superusergroup", value: "hadoop" }
- { name: "dfs.datanode.data.dir", value: "{{ hdfs_disks | map(attribute='mount_point') | join(',') }}" }
- { name: "dfs.datanode.max.transfer.threads", value: "5000" }
Thanks,
Brian K.