/images/avatar.png

Difference between Value and Pointer variable in Defer in Go

defer is a useful function to do cleanup, as it will execute in LIFO order before the surrounding function returns. If you don’t know how it works, sometimes the execution result may confuse you.

How it Works and Why Value or Pointer Receiver Matters

I found an interesting code on Stack Overflow:

type X struct {
    S string
}

func (x X) Close() {
    fmt.Println("Value-Closing", x.S)
}

func (x *X) CloseP() {
    fmt.Println("Pointer-Closing", x.S)
}

func main() {
    x := X{"Value-X First"}
    defer x.Close()
    x = X{"Value-X Second"}
    defer x.Close()

    x2 := X{"Value-X2 First"}
    defer x2.CloseP()
    x2 = X{"Value-X2 Second"}
    defer x2.CloseP()

    xp := &X{"Pointer-X First"}
    defer xp.Close()
    xp = &X{"Pointer-X Second"}
    defer xp.Close()

    xp2 := &X{"Pointer-X2 First"}
    defer xp2.CloseP()
    xp2 = &X{"Pointer-X2 Second"}
    defer xp2.CloseP()
}

The output is:

Near-duplicate with SimHash

Before talking about SimHash, let’s review some other methods which can also identify duplication.

Longest Common Subsequence(LCS)

This is the algorithm used by diff command. It is also edit distance with insertion and deletion as the only two edit operations.

This works good for short strings. However, the algorithm’s time complexity is \(O(m*n)\), if two strings’ lengths are \(m\) and \(n\) respectively. So it’s not suitable for large corpus. Also, if two corpus consists of same paragraph but the order is not same. LCS treat them as different corpus, and that’s not we expected.

Jaeger Code Structure

Here is the main logic for jaeger agent and jaeger collector. (Based on jaeger 1.13.1)

Jaeger Agent

Collect UDP packet from 6831 port, convert it to model.Span, send to collector by gRPC

Jaeger Collector

Process gRPC or process packet from Zipkin(port 9411).

Jaeger Query

Listen gRPC and HTTP request from 16686.

The Annotated The Annotated Transformer

Thanks for the articles I list at the end of this post, I understand how transformers works. These posts are comprehensive, but there are some points that confused me.

First, this is the graph that was referenced by almost all of the post related to Transformer.

Transformer consists of these parts: Input, Encoder*N, Output Input, Decoder*N, Output. I’ll explain them step by step.

Input

The input word will map to 512 dimension vector. Then generate Positional Encoding(PE) and add it to the original embeddings.

Different types of Attention

\(s_t\) and \(h_i\) are source hidden states and target hidden state, the shape is (n,1). \(c_t\) is the final context vector, and \(\alpha_{t,s}\) is alignment score.

\[\begin{aligned} c_t&=\sum_{i=1}^n \alpha_{t,s}h_i \\ \alpha_{t,s}&= \frac{\exp(score(s_t,h_i))}{\sum_{i=1}^n \exp(score(s_t,h_i))} \end{aligned}\]

Global(Soft) VS Local(Hard)

Global Attention takes all source hidden states into account, and local attention only use part of the source hidden states.

Content-based VS Location-based

Content-based Attention uses both source hidden states and target hidden states, but location-based attention only use source hidden states.

Torchtext snippets

Load separate files

data.Field parameters is here.

When calling build_vocab, torchtext will add <unk> in vocabulary list. Set unk_token=None if you want to remove it. If sequential=True (default), it will add <pad> in vocab. <unk> and <pad> will add at the beginning of vocabulary list by default.

LabelField is similar to Field, but it will set sequential=False, unk_token=None and is_target=Ture

INPUT = data.Field(lower=True, batch_first=True)
TAG = data.LabelField()

train, val, test = data.TabularDataset.splits(path=base_dir.as_posix(), train='train_data.csv',
                                                validation='val_data.csv', test='test_data.csv',
                                                format='tsv',
                                                fields=[(None, None), ('input', INPUT), ('tag', TAG)])

Load single file

all_data = data.TabularDataset(path=base_dir / 'gossip_train_data.csv',
                               format='tsv',
                               fields=[('text', TEXT), ('category', CATEGORY)])
train, val, test = all_data.split([0.7, 0.2, 0.1])

Create iterator

train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train, val, test), batch_sizes=(32, 256, 256), shuffle=True,
    sort_key=lambda x: x.input)

Load pretrained vector

vectors = Vectors(name='cc.zh.300.vec', cache='./')

INPUT.build_vocab(train, vectors=vectors)
TAG.build_vocab(train, val, test)

Check vocab sizes

You can view vocab index by vocab.itos.