Posts

2020

Retrieve Large Dataset in Elasticsearch

21 June 2020·5 mins

It’s easy to get small dataset from Elasticsearch by using size and from. However, it’s impossible to retrieve large dataset in the same way.

Deep Paging Problem #

As we know it, Elasticsearch data is organised into indexes, which is a logical namespace, and the real data is stored into physical shards. Each shard is an instance of Lucene. There are two kind of shards, primary shards and replica shards. Replica shards is the copy of primary shards in case nodes or shards fail. By distributing documents in an index across multiple shards, and distributing those shards across multiple nodes, Elasticsearch can ensure redundancy and scalability. By default, Elasticsearch create 5 primary shards and one replica shard for each primary shards.

Program Crash Caused by CPU Instruction

17 May 2020·3 mins

It’s inevitable to dealing with bugs in coding career. The main part of coding are implementing new features, fixing bugs and improving performance. For me, there are two kinds of bugs that is difficult to tackle: those are hard to reproduce, and those occur in code not wrote by you.

C-m, RET and Return Key in Emacs

11 April 2020·2 mins

I use Emacs to write blog. In the recent update, I found M-RET no longer behave as leader key in org mode, but behave as org-meta-return. And even more strange is that in other mode, it behave as leader key. And M-RET also works in terminal in org mode. In GUI, pressing C-M-m can trigger leader key.

Import custom package or module in PySpark

2 April 2020·1 min

First zip all of the dependencies into zip file like this. Then you can use one of the following methods to import it.

|-- kk.zip
|   |-- kk.py

Using –py-files in spark-submit #

When submit spark job, add --py-files=kk.zip parameter. kk.zip will be distributed with the main scrip file, and kk.zip will be inserted at the beginning of PATH environment variable.

Time boundary in InfluxDB Group by Time Statement

29 March 2020·4 mins

These days I use InfluxDB to save some time series data. I love these features it provides:

High Performance #

According to to it’s hardware guide, a single node will support more than 750k point write per second, 100 moderate queries per second and 10M series cardinality.

C3 Linearization and Python MRO(Method Resolution Order)

14 March 2020·3 mins

Python supports multiple inheritance, its class can be derived from more than one base classes. If the specified attribute or methods was not found in current class, how to decide the search sequence from superclasses? In simple scenario, we know left-to right, bottom to up. But when the inheritance hierarchy become complicated, it’s not easy to answer by intuition.

2019

Difference between Value and Pointer variable in Defer in Go

19 December 2019·3 mins

defer is a useful function to do cleanup, as it will execute in LIFO order before the surrounding function returns. If you don’t know how it works, sometimes the execution result may confuse you.

How it Works and Why Value or Pointer Receiver Matters #

I found an interesting code on Stack Overflow:

Near-duplicate with SimHash

4 December 2019·4 mins

Before talking about SimHash, let’s review some other methods which can also identify duplication.

Longest Common Subsequence(LCS) #

This is the algorithm used by diff command. It is also edit distance with insertion and deletion as the only two edit operations.

Jaeger Code Structure

22 September 2019·1 min

Here is the main logic for jaeger agent and jaeger collector. (Based on jaeger 1.13.1)

Jaeger Agent #

Collect UDP packet from 6831 port, convert it to model.Span, send to collector by gRPC

The Annotated The Annotated Transformer

1 September 2019·4 mins

Thanks for the articles I list at the end of this post, I understand how transformers works. These posts are comprehensive, but there are some points that confused me.

First, this is the graph that was referenced by almost all of the post related to Transformer.