Skip to main content

Spark

2021


Dynamic Allocate Executors when Executing Jobs in Spark

·4 mins

I wrote a Spark program to process logs. The number of logs always changes as time goes by. To ensure logs can be processed instantly, the number of executors is calculated by the maximum of logs per minutes. As a consequence, the CPU usage is low in executors. In order to decrease resource waste, I tried to find a way to schedule executors during the execution of program.

2020


Program Crash Caused by CPU Instruction

·3 mins

It’s inevitable to dealing with bugs in coding career. The main part of coding are implementing new features, fixing bugs and improving performance. For me, there are two kinds of bugs that is difficult to tackle: those are hard to reproduce, and those occur in code not wrote by you.

Import custom package or module in PySpark

·1 min

First zip all of the dependencies into zip file like this. Then you can use one of the following methods to import it.

|-- kk.zip
|   |-- kk.py

Using –py-files in spark-submit #

When submit spark job, add --py-files=kk.zip parameter. kk.zip will be distributed with the main scrip file, and kk.zip will be inserted at the beginning of PATH environment variable.