machine learning algorithms were evaluated. Experimental results show that GBM
and RF have a good accuracy and computational performance across diverse spark
workloads.
In the future, further research could be conducted on parameter selection in order
to add more suitable spark parameters. The other revelent aspects affecting the
performance prediction, such as the resource competition between the jobs and
fluctuation of network bandwidth could also be explored.
References
1. Zemin Chao, Shengfei Shi, Hong Gao, Jizhou Luo, and Hongzhi Wang. A gray-box
performance model for apache spark. Future Generation Computer Systems, 89, 06
2018.
2. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy
McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed
datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings
of the 9th USENIX Conference on Networked Systems Design and Implementation,
NSDI’12, pages 2–2, Berkeley, CA, USA, 2012. USENIX Association.
3. Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong,
Fatma Bilgen Cetin, and Shivnath Babu. Starfish: A self-tuning system for big data
analytics. In In CIDR, pages 261–272, 2011.
4. Palden Lama and Xiaobo Zhou. Aroma: Automated resource allocation and configu-
ration of mapreduce environment in the cloud. In Proceedings of the 9th International
Conference on Autonomic Computing, ICAC ’12, pages 63–72, New York, NY, USA,
2012. ACM.
5. Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica.
Improving mapreduce performance in heterogeneous environments. In Proceedings
of the 8th USENIX Conference on Operating Systems Design and Implementation,
OSDI’08, pages 29–42, Berkeley, CA, USA, 2008. USENIX Association.
6. Karthik Kambatla, Abhinav Pathak, and Himabindu Pucha. Towards optimizing
hadoop provisioning in the cloud. Hot Topics in Cloud Computing, 06 2009.
7. D. Wu and A. Gokhale. A self-tuning system based on application profiling and perfor-
mance analysis for optimizing hadoop mapreduce cluster configuration. In 20th Annual
International Conference on High Performance Computing, pages 89–98, Dec 2013.
8. R. Zhang, M. Li, and D. Hildebrand. Finding the big data sweet spot: Towards au-
tomatically recommending configurations for hadoop clusters on docker containers.
In 2015 IEEE International Conference on Cloud Engineering, pages 365–368, March
2015.
9. Kewen Wang and Mohammad Maifi Hasan Khan. Performance prediction for apache
spark platform. In Proceedings of the 2015 IEEE 17th International Conference on High
Performance Computing and Communications, 2015 IEEE 7th International Sympo-
sium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on
Embedded Software and Systems, HPCC-CSS-ICESS ’15, pages 166–173, Washington,
DC, USA, 2015. IEEE Computer Society.
10. Nezih Yigitbasi, Theodore L. Willke, Guangdeng Liao, and Dick H. J. Epema. Towards
machine learning-based auto-tuning of mapreduce. 2013 IEEE 21st International Sym-
posium on Modelling, Analysis and Simulation of Computer and Telecommunication
Systems, pages 11–20, 2013.