spark-instrumented-optimizer/common/network-shuffle
“attilapiros” 1b3fc9a111 [SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service
### What changes were proposed in this pull request?

Improving file path name normalisation by removing the approximate transformation from Spark and using the path normalization from the JDK.

### Why are the changes needed?

In the external shuffle service during the block resolution the file paths (for disk persisted RDD and for shuffle blocks) are normalized by a custom Spark code which uses an OS dependent regexp. This is a redundant code of the package-private JDK counterpart. As the code not a perfect match even it could happen one method results in a bit different (but semantically equal) path.

The reason of this redundant transformation is the interning of the normalized path to save some heap here which is only possible if both transformations results in the same string.

Checking the JDK code I believe there is a better solution which is perfect match for the JDK code as it uses that package private method. Moreover based on some benchmarking even this new method seams to be more performant too.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

As we are reusing the JDK code for normalisation no test is needed. Even the existing test can be removed.

But in a separate branch I have created a benchmark where the performance of the old and the new solution can be compared. It shows the new method is about 7-10 times better than old one.

Closes #28967 from attilapiros/SPARK-32149.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-07-11 22:55:26 +09:00
..
src [SPARK-32149][SHUFFLE] Improve file path name normalisation at block resolution within the external shuffle service 2020-07-11 22:55:26 +09:00
pom.xml [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT 2020-02-25 19:44:31 -08:00