Skip to main content

Python

2019


Torchtext snippets

·1 min

Load separate files #

data.Field parameters is here.

When calling build_vocab, torchtext will add <unk> in vocabulary list. Set unk_token=None if you want to remove it. If sequential=True (default), it will add <pad> in vocab. <unk> and <pad> will add at the beginning of vocabulary list by default.

Circular Import in Python

·2 mins

Recently, I found a really good example code for Python circular import, and I’d like to record it here.

Here is the code:

1
2
3
4
5
6
7
8
# X.py
def X1():
    return "x1"

from Y import Y2

def X2():
    return "x2"
1
2
3
4
5
6
7
8
# Y.py
def Y1():
    return "y1"

from X import X1

def Y2():
    return "y2"

Guess what will happen if you run python X.py and python Y.py?

Python Dictionary Implementation

·3 mins

Overview #

  1. CPython allocation memory to save dictionary, the initial table size is 8, entries are saved as <hash,key,value> in each slot(The slot content changed after Python 3.6).
  2. When a new key is added, python use i = hash(key) & mask where mask=table_size-1 to calculate which slot it should be placed. If the slot is occupied, CPython using a probing algorithm to find the empty slot to store new item.
  3. When 2/3 of the table is full, the table will be resized.
  4. When getting item from dictionary, both hash and key must be equal.

Resizing #

When elements size is below 50000, the table size will increase by a factor of 4 based on used slots. Otherwise, it will increase by a factor of 2. The dictionary size is always \(2^{n}\).

2018


CSRF in Django

·2 mins

CSRF(Cross-site request forgery) is a way to generate fake user request to target website. For example, on a malicious website A, there is a button, click it will send request to www.B.com/logout. When the user click this button, he will logout from website B unconsciously. Logout is not a big problem, but malicious website can generate more dangerous request like money transfer.

Create Node Benchmark in Py2neo

·2 mins

Recently, I’m working on a neo4j project. I use Py2neo to interact with graph db. Although Py2neo is a very Pythonic and easy to use, its performance is really poor. Sometimes I have to manually write cypher statement by myself if I can’t bear with the slow execution. Here is a small script which I use to compare the performance of 4 different ways to insert nodes.

Deploy Nikola Org Mode on Travis

·3 mins

Recently, I enjoy using Spacemacs, so I decided to switch to org file from Markdown for writing blog. After several attempts, I managed to let Travis convert org file to HTML. Here are the steps.

Install Org Mode plugin #

First you need to install Org Mode plugin on your computer following the official guide: Nikola orgmode plugin.

Using Chinese Characters in Matplotlib

·1 min

After searching from Google, here is easiest solution. This should also works on other languages:

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.font_manager as fm
f = "/System/Library/Fonts/PingFang.ttc"
prop = fm.FontProperties(fname=f)

plt.title("你好",fontproperties=prop)
plt.show()

Output:

2017


Enable C Extension for gensim on Windows

·1 min

These days, I’m working on some text classification works, and I use gensim ’s doc2vec function.

When using gensim, it shows this warning message:

C extension not loaded for Word2Vec, training will be slow.

I search this on Internet and found that gensim has rewrite some part of the code using cython rather than numpy to get better performance. A compiler is required to enable this feature.