HDFS configuration?

Hello,

Our ETL DOCUMENT step fails on a workflow of 38 projects with an RPC timeout. The task where the failure occurs is

Failed to execute task 'mutation-centric-document-task:mutation-centric'

HDFS logs have messages saying
error processing WRITE_BLOCK operation
WARN Slow ReadProcessor read fields took 32320ms (threshold=30000ms);

I’ve reduced the replication factor to 1, and raised the dfs.namenode.handler.count, which allowed me to finish processing a set of 22 projects, but I’m still stuck at 38 projects in the DOCUMENT step. What HDFS settings are you using? Here’s our ansible setup for roles/hdfs/defaults/main.yml:

hdfs_namenode_properties:
  # need to add user submitting job to the "hadoop" group, or turn filesystem
  # security off.
  # https://hadoop.apache.org/docs/r1.2.1/hdfs_permissions_guide.html#Configuration+Parameters
  - { name: "dfs.permissions.superusergroup", value: "hadoop" }
  - { name: "dfs.namenode.name.dir", value: "/media/persistent0" }
  - { name: "dfs.replication", value: "1" }
  # set handler threads to min(20*log2(cluster size), 200)
  # https://community.hortonworks.com/questions/63511/namenode-handler-count.html
  - { name: "dfs.namenode.handler.count", value: 70 }
jdk_home: /usr/lib/jvm/java-8-oracle/
hdfs_datanode_properties:
  - { name: "dfs.permissions.superusergroup", value: "hadoop" }
  - { name: "dfs.datanode.data.dir", value: "{{ hdfs_disks | map(attribute='mount_point') | join(',') }}" }
  - { name: "dfs.datanode.max.transfer.threads", value: "5000" }

Thanks,
Brian K.

We are using a dfs.replication of 3 and a dfs.blocksize of 134217728.

Just curious, how are you building your HDFS cluster? Standalone, Horton Works or Cloudera?

Dusan, thanks, we have the same blocksize. The HDFS is done by the ansible scripts for Coudera repostiory:

- name: Configure Cloudera APT key
  apt_key: url="http://archive.cloudera.com/cdh5/ubuntu/{{ ansible_distribution_release }}/amd64/cdh/archive.key"
           state=present

- name: Configure the Cloudera APT repositories
  apt_repository: repo="deb [arch=amd64] http://archive.cloudera.com/cdh5/ubuntu/{{ ansible_distribution_release }}/amd64/cdh {{ ansible_distribution_release }}-{{ hdfs_cloudera_distribution }} contrib"
                  state=present

- name: Install Hadoop DataNode and client
  apt: pkg={{ item }}
       state=present
  with_items:
    - hadoop-hdfs-datanode
    - hadoop-client