Even with the new more aggressive shuffle, minimizing the number of transfers (maps * reduces) is very important to the performance of the job. Notice that in the petabyte sort, each map is processing 15 GB instead of the default 128 MB and each reduce is handling 50 GB.
—
Looks like the bottleneck and optimizations happened this time in the shuffle phase of Hadoop M/R.