31b59bd805
### What changes were proposed in this pull request? When starting python processes, set `OMP_NUM_THREADS` to the number of cores allocated to an executor or driver if `OMP_NUM_THREADS` is not already set. Each python process will use the same `OMP_NUM_THREADS` setting, even if workers are not shared. This avoids creating an OpenMP thread pool for parallel processing with a number of threads equal to the number of cores on the executor and [significantly reduces memory consumption](https://github.com/numpy/numpy/issues/10455). Instead, this threadpool should use the number of cores allocated to the executor, if available. If a setting for number of cores is not available, this doesn't change any behavior. OpenMP is used by numpy and pandas. ### Why are the changes needed? To reduce memory consumption for PySpark jobs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Validated this reduces python worker memory consumption by more than 1GB on our cluster. Closes #25545 from rdblue/SPARK-28843-set-omp-num-cores. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
---|---|---|
.. | ||
benchmarks | ||
src | ||
pom.xml |