Using JSONField before Django 3.1

KK published on 2021-09-11 included in Programming

In Django 3.1, Django support save python data into database as JSON encoded data and it is also possible to make query based on field value in JSONField. The detailed usage can be found here. If you are using older version and want to try this feature. Though there are many packages ported this function, I recommend django-jsonfield-backport.

django-jsonfield-backport

This package save data as JSON in database and also support JSON query. If your database meet the requirements (MySQL > 5.7, PG > 9.5, MariaDB > 10.2 or SQLite > 3.9 with JSON1 extension), you can use JSONField like Django’s native implementation.

Dynamic Allocate Executors when Executing Jobs in Spark

KK published on 2021-07-18 included in Misc

I wrote a Spark program to process logs. The number of logs always changes as time goes by. To ensure logs can be processed instantly, the number of executors is calculated by the maximum of logs per minutes. As a consequence, the CPU usage is low in executors. In order to decrease resource waste, I tried to find a way to schedule executors during the execution of program.

As shown below, the maximum number of logs per minutes can be a dozen times greater than the minimum number in one day.

Improve Kafka throughput

KK published on 2021-05-28 included in Misc

Kafka is a high-performance and scalable messaging system. Sometimes when handling big data. The default configuration may limit the maximum performance. In this article, I’ll explain how messages are generate and saved in Kafka, and how to improve performance by changing configuration.

Kafka Internals

How does Producer Send Messages?

In short, messages will assembled into batches (named RecordBatch) and send to broker.

The producer manages some internal queues, and each queue contains RecordBatch that will send to one broker. When calling send method, the producer will look into the internal queue and try to append this message to RecordBatch which is smaller than batch.size (default value is 16KB) or create new RecordBatch.

Fix Error: Cask 'java' is unavailable in Homebrew

KK published on 2021-03-07 included in Misc

After update brew to latest version, when calling cask related command, it always outputs Error: Cask 'java' is unavailable: No Cask with this name exists., such as brew list --cask. However, the brew command works.

After doing some research, I found Java has been moved to homebrew/core. This makes sense now. I installed java by cask, but it’s not available now and cask throw this error. If I uninstall java from cask, the error should disappear.

Timezone in JVM

KK published on 2020-10-18 included in Misc

I wrote a Scala code to get the current time. However, the output is different on the development server and docker.

import java.util.Calendar

println(Calendar.getInstance().getTime)

On my development server, it outputs Sun Oct 18 18:01:01 CST 2020, but in docker, it print a UTC time.

I guess it related to the timezone setting and do a research, here is the result.

Using cibuildwheel to Create Python Wheels

KK published on 2020-07-29 included in Programming

Have you ever tried to install MySQL-python? It contains the C code and need to compile the code while install the package. You have to follow the steps in this articles: Install MySQL and MySQLClient(Python) in MacOS. Things get worse if you are using Windows.

Luckily, as new distribution format Wheel has been published in PEP 427.

The wheel binary package format frees installers from having to know about the build system, saves time by amortizing compile time over many installations, and removes the need to install a build system in the target environment.