During this process, if some tasks deal with much more data than others (i.e. Due to preemptive scheduling model of Spark, E x C task executing units will preemptively execute the P tasks until all tasks are finished. Then, each reduce stage needs to run P tasks except the initial map stage. This parameter determines the number of reduce tasks and impact the query performance significantly.įor example, a Spark SQL query runs on E executors, C cores for each executor, and shuffle partition number is P. In Spark SQL, shuffle partition number is configured via, and the default value is 200. Lastly, we will share the benefits of Adaptive Execution’s adoption in Baidu’s Big SQL platform. This article will describe detail of the challenges, dispose Adaptive Execution architecture and our optimization work, and then raise the performance gain on 99 nodes cluster for the TPC-DS benchmark in 100TB data scale. To address these challenges, Intel big data team and Baidu infrastructure team refined the Adaptive Execution implementation based on existing upstream work. However, Spark SQL still suffers from some ease-of-use and performance challenges while facing ultra large scale of data in large cluster. Spark SQL* is the most popular component of Apache Spark* and it is widely used to process large-scale structured data in data center. Co-authors : Chenzhao Guo, Hao Cheng (Intel), Yucai Yu, Yuanjian Li (Baidu)
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |