You're right, that's 30 threads, not processes. I didn't realize that htop shows threads by default.
I started looking into the process performance after I got some statistics using collectl when a worker died. collectl shows 23K "Minor Page Faults" per second when the RPC timeout happened, and the "Resident Virtual Memory" is at 25MB.
Date Time PID User S VmSize VmLck VmRSS VmData VmStk VmExe VmLib VmSwp MajF MinF
02/09 22:34:46 6381 root S 33679M 0 25370M 33600M 136K 4K 17596K 0 0 23K /usr/lib/jvm/java-8-oracle/jre/bin/java
MinF: Minor Page Faults per second
VmRSS: Size of Resident Virtual Memory
VmStk: Size of Virtual Memory used for stack
Minor Page Faults can be satisfied by re-assigning pages between processes. It seems strange that the OS would be doing so much page sharing between processes.
I'm re-running with the SPARK_WORKER_CORES=3, on 6 core machines. It's running longer than the last time, which is a good sign.