Uncategorized – Kevin Martin Jose

On training binary neural networks

September 15, 2025 KevinLeave a comment

I’ve been thinking about binary neural networks lately, and specifically how to do it without computing full precision gradients.

Broadly, there are two camps in binary neural networks. The most prominent (and effective) way to train binary neural networks is to keep the weights in full precision and binarise during forward pass. I’d say BitNet and ReactNet falls under that camp. The problem is of course that you are not taking advantage of the binary weights during the backward pass. Most binary NNs are motivated by the need for fast inference, not training, so full precision backward passes are not generally considered a problem. In fact, keeping latent full-precision weights (i.e non-binary weights) in memory during training is a tried-and-tested technique to train binary neural networks that don’t suck.

But, Shibuya et. al has been doing research on training binary neural networks without keeping a copy of the weights in full-precision in memory. They had an ICLR 2024 submission that unfortunately didn’t get accepted, and then a more recent paper titled Binary Stochastic Flip Optimization for Training Binary Neural Networks. They still use real-valued gradients, but they keep the weights binary throughout the training. The core idea is very intuitive:

Imagine a neural network with binary weights (0 and 1)

The activation function is a sign() function. This is of course not differentiable. So you use a straight-through-estimator (STE) as an approximation. Essentially, you pretend that you were using a hard-tanh as an activation function and compute the gradients as if you were using this tanh activation instead of sign()

The gradient computation proceeds normally – for a weight matrix W_l from layer l, you will have real-valued gradients. Let’s call this gradient matrix G_l

If we were using standard real-valued SGD, the weight would be updated as

$W_{l}^{t} = W_{l}^{t-1} + \eta G_{l}^{t}$

Where

$\eta$ is the learning rate, and is usually a small value like 0.001

But you want to keep the weights W_l binary. If you add real-valued gradients to binary W_l, the updated weights will be no longer binary.

The authors get around this problem by simply binarising -G_l. Positive gradient values will be replaced by 1 and negative values will be replaced by a 0

But you can’t just add the binarised gradients to the weight matrix. That would be conceptually equivalent to having a learning rate of 1! That’s too high! And since the gradient matrix is now binary, there’s no concept of a “learning rate”.

Remember, G_l is now a binary matrix, and tells you what the updated weight matrix would look like. Conceptually, if our learning rate (whatever that means – bear with me) was 1, we would just add the gradient matrix to the weight matrix. In our case, instead of adding the gradient matrix to the weights, we would just replace the weight matrix with the gradient matrix. This is the ugly side effect of using binary weights.

Let that sink in. When your gradients are binarised, you can’t “add” the gradient to the weight. The gradient value is going to be either a 1 or a 0. Your weight too is going to be either a 1 or a 0. Your decision is to either match the gradient value or not to match it. For example, if the weight is 0 and the gradient is 1, you have no option but to set the weight to 1 – because that’s what the gradient is telling you to do in order to minimise the loss!

But if we do what the binarised gradient tell us to do, we’ll be flipping our binary weights all the time. In other words, our learning rate is too high! But what does it mean to have a “low” learning rate when the gradient matrix just contains 1s and 0s?

The authors of Binary Stochastic Flip paper suggest that we use a binary mask and do an element-wise multiplication with the gradient matrix. This way, you can select a subset of the gradient matrix to be applied. In other words, instead of setting all the weights according to the gradient matrix, we update only a fraction of the (eg: 10%) of the weights to match the gradient matrix. When the learning rate is “1” (or when it is “maximum”), the mask is a tensor with every element set to 1. To user a lower learning rate, the mask matrix is created in such a way that it will only have a few ones.:

$W_{l}^{t} = \neg M \odot W_{l}^{t-1} + M \odot G_{l}^{t}$

$\odot$ represents element-wise multiplication. Multiplying the existing weight $W_{l}$ with $\neg M$ selects the elements from the original weight that we want to keep. Everything else is zeroed-out. Adding $M \cdot G_{l}^{t}$ to this sets these zeroed out elements to the value in the gradient matrix $G_{l}^{t}$

The paper even mentions that setting the Mask randomly works rather well.

In PyTorch, I think it will look something like this:

class HyperMaskOptimizer(Optimizer):
    def __init__(self, params, delta=1e-3):
        """
        Args:
            params: Model parameters (binary weights in {0,1})
            delta: Probability for random mask (δ_t in paper)
        """
        defaults = dict(delta=delta)
        super(HyperMaskOptimizer, self).__init__(params, defaults)
        self.delta = delta

    def step(self, closure=None):
        for group in self.param_groups:
            for param in group['params']:
                grad = param.grad

                current_weights = param.data.clone()  # w_{t-1} in {0,1}
                target_weights = (-grad >= 0).float()  # w*_t in {0,1}

                mask = torch.bernoulli(torch.full_like(grad, self.delta))  
                mask_not = 1 - mask  # m̄_t (NOT operation)

                new_weights = (mask_not * current_weights) + (mask * target_weights)

                param.data.copy_(new_weights)

I wasn’t able to reproduce the numbers that the authors reported. Here’s my implementation. My 4-layer perceptron achieved only an unimpressive 33% test accuracy on MNIST after training for 200 epochs. Unfortunately the paper does not mention a reference implementation for me to cross-check with. I’ve reached out to someone whose name matches with that of the first author on LinkedIn. If I’m lucky, they’ll open source their implementation.

Meanwhile, if you can spot bugs in my snippet, please comment on the Github gist (or here) 😀

Thinking aloud: Can we speed up model training by using binary weights?

September 8, 2025September 8, 2025 Kevin1 Comment

When I was at Amazon’s LLM pre-training team, our pre-training jobs used to run for weeks. Being impatient as I am, it was frustrating to wait for a whole week to see if the changes we made (mostly to the data) worked. The magic number was 1 trillion tokens, and even a smallish model (eg: 7 billion parameters) will take a few days to reach this point even with the amount of GPUs we had access to.

Now imagine you want to pre-train a model. All you have is access to one GPU, let’s say you want to train it on all of slimpajama and you want to be done with pre-training in a day. That’s roughly 600 billion tokens. Currently, this is a pipe dream. For a model to be trained on 600B tokens a day, your training throughput needs to be 7 million tokens a second. The typical training speeds you see are sub 10k tokens per second per GPU:

Model	Size	Hardware	Training speed
Gemma 3 270M	270M	Apple M3 Pro	4000 toks/second
TinyLlama	1.1B	16x A100	1500 toks/second per GPU
Llama 1 65B	65B	2048x A100	380 toks/second per GPU
MPT 7B	7B	440x A100	2800 toks/second per GPU

And we are talking about 7 million tokens per second. Even if we assume 100% utilisation of the hardware (which is rare), we’ll end up with some pretty small theoretical limits to how large our model can be. Let’s try to work that out.

Target 7 million tokens/sec – how large can our model be?

To process 7 million tokens a second, the model will have to do quite a lot of floating point operations per second:

$FloatcomputationsPerSecond = 7 million \times (FLOPs_{forward} + FLOPs_{backward})$

But there are theoretical upper limits for the number of floating point operations the GPU can do per second.

GPU	Tera FLOPs per second (fp16)
RTX 5090	209
Apple M4	2.9 – 4.3 (fp32) ~ 8 tflops for fp16
A100	312
H100	950 (without sparsity)

Plugging this into our equation:

$(FLOPs_{forward} + FLOPs_{backward}) = \frac{FloatcomputationsPerSecond}{7 million}$

Let’s assume that FLOPs for backward pass is 2.5x that of the forward pass. This is the assumption made in FlashAttention2 paper and likely holds only for self-attention, but it sounds like a reasonable assumption. Let’s also assume that every parameter in the model results in 2 FLOPs in the forward pass – every parameter in a weight matrix will be involved in one multiplication and one addition as part of matmul. Let’s ignore other layers (eg: softmax, layer norm) for the time being:

$(FLOPs_{forward} + 2.5 \times FLOPs_{forward}) = \frac{FloatcomputationsPerSecond}{7 million}$
$3.5 \times FLOPs_{forward} = \frac{FloatcomputationsPerSecond}{7 million}$
$3.5 \times 2 \times parameters = \frac{FloatcomputationsPerSecond}{7 million}$
$parameters = \frac{FloatcomputationsPerSecond}{7 million \times 7}$

If we substitute FloatComputationsPerSecond with 950, which is the theoretical best we can get (on an H100), we’ll have to limit our model to 19.3 million parameters if the training is to proceed at 7 million tokens per second. If we assume a more realistic 40% utilisation of the GPU (350 TFLOPS), we’ll be limited to 7 million parameters for our model.

During inference, we can quantise the model to improve inference speed by multiples while losing only a fraction of the accuracy. Why can’t we do that during training? Better yet, why can’t we go all the way and make all our weights binary? Yes, we will lose model performance, but it is not obvious if model training will be much faster.

But is computation even the bottleneck?

Even if your model fits in a single GPU, large distributed pre-training runs (i.e data-parallel runs) are often bottlenecked by the communication overhead. Each GPU has a copy of your model and computes its gradients. At the end of the backward pass, you must average the gradients across all GPUs. This all_reduce step can absolutely be a bottleneck.

If your model is too big to fit in a single GPU, then you must use model-parallelism to shard the model across multiple GPUs. This introduces yet another network bandwidth-heavy operation (all_gather)

You also have to read data from the disk, and in case of data-parallel runs, probably from a network-mounted drive.

So no, computation is usually not the bottleneck.

But, if your model fits in a single GPU and you are not using data-parallelism, then you have zero communication overhead and training might indeed be limited by compute.

Can’t you just use a lower precision during training?

Yes you can. The newer H100 GPUs introduced fp8 tensor cores and Nvidia’s Transformer Engine reports a 1.46x speedup when training Llama3 8B with fp8. It is harder to achieve stable training with lower precision floats or integers – the first successful fp4 pre-training is very recent.

Has anyone tried using binary weights? Not really. There’s BitNet and the variants. However, they don’t report training throughput numbers. Moreover, I wouldn’t expect training throughput to improve since BitNet maintains latent weights in full precision and the weights are binarised on-the-fly during forward pass. There’s Training Binary Neural Networks in a Binary Weight Space by Shibuya et. al, and that’s the closest work I could find that trains a binary neural network without having to compute latent weights in full precision. However, the authors do not mention how they implemented the binary weights – for all I know the weights were still fp32 but artificially confined to values of +-1. I suspect this is why table 2 in the paper reports analytical memory usage rather than experimental memory usage. It’s a pity that the paper was rejected for ICLR 2024. All the reviewers gave high “soundness” and “contribution” scores and then proceeded to criticise the experiments section. One reviewer mentioned that “the experimental results in Tab.3 seem extremely bad on medium-sized datasets such as CIFAR-10 and Tiny-ImageNet” – but that’s not the point! The results were proof that the method worked, and that should have been sufficient to justify an acceptance. Also, the “extremely bad” error rate 2.5% for the binary network vs 1.46% for the full precision baseline. The circus of academic publishing. But I digress.

Why binary weights might increase computation throughput

Let’s not even talk about memory, apart from just stating that compared to fp16, binary weights will take 16x less memory.

But why would computation be sped up? Imagine taking the dot product of two float vectors of size n. This will require n multiplications and n additions, for a total of 2n floating point operations. Now imagine your vectors were binary and the values were confined to -1 or +1. Instead of doing a dot product, you can now represent your vectors as two integers – also known as a packed integer – and do an xnor + bitcount operation to get the same result as a naive dot product. That’s just 2 operations, compared to 2n operations for a dot product, resulting in theory a speedup by a factor of n. See the figures below

*Figure 1: Naive dot product resulting in `n` multiplications and `n` additions*

The intuition is as follows – when all the elements in the vector are all either -1 or 1, the result of multiplication (when you do a dot product) of two elements is 1 if they are the same (i.e both are 1 or both are -1). If the elements are not the same, the result of multiplication is -1. The dot product is just a sum of these element-wise multiplications. So the final result of dot product can be seen as “the number of 1 results minus the number of -1 results in element-wise multiplication.”.

This is exactly what an XNOR (i.e an exclusive NOR) does – it outputs “1” if both the operands are 1s or 0s. If the elements are not the same (i.e 1 and 0 or 0 and 1), the output of XNOR is 0. GPUs (and CPUs) provide a single operation to count the number of set bits in an integer, called population count. In CUDA, it’s the __popc() function. The XNOR-Net paper popularised this. Thus we can re-write the dot product for binarised vectors a and b as:

$a \cdot b = numberOfOnes(a \odot b)) - numberOfZeroes(a \odot b)$
$a \cdot b = numberOfOnes(a \odot b) - (length(a)- numberOfOnes(a \odot b))$
$a \cdot b = 2 \times numberOfOnes(a \odot b) - length(a)$

If a and b are binary vectors of length 64, they can be represented by a single 64-bit integer each. CUDA provides a __popc() method that is a single operation that returns the number of set bits (i.e 1s) in an integer. Since length of a and b are the same – call it n – and numberOfOnes is given by the __popc() method in CUDA:

dot(a,b) = 2 x __popc(~(a^b)) – n

That is, the dot product of two vectors of length 64 takes 3 operations – a multiplication by 2, the __popc() and the addition with -n. See figure 2 below for an example. The naive dot product would have taken 128 operations – 64 multiplications and 64 additions.

*Figure 2: XNOR and population count (i.e bitcount) gives the same result as the dot product, but with just two operations*. *See the XNOR-Net paper.*

Ok, so dot products are now theoretically 40x faster. So?

Matrix multiplication is just a series of dot products. A linear layer during the forward pass simply multiplies an activation matrix with a weight matrix – that operation is now 20x faster. We are of course ignoring the fact that instead of floating point weights you just have 1s and -1s and it’s unlikely that a neural network can learn anything with binary weights – we’ll look into that later. Assuming that there’s a magic algorithm that will let us train performant binary neural networks, having pure binarised weights during training will likely give us an improvement in training throughput.

But how much improvement? Not that much, unless gradients are binary too

The rule of thumb in computing FLOPs is that each parameter results in 2 floating point operations. If you think of a neural network as one big matrix multiplication between an activation/data matrix of shape (1, k) and a weights matrix of shape (k, n), this makes sense – the total number of parameters is k x n and the number of operations in a vanilla matrix multiplication will be 2 * 1 * k * n. But with the XNOR and popcount() trick, we’ll incur only 3 operations per 64 parameters!

We had also assumed that 1 backward FLOP is equal to 2.5 forward FLOPs. But using the XNOR + Popcount() trick we discussed above, we can dramatically reduce the number of operations in the forward pass. Notice how I did not say FLOPs. That’s because XNOR + popcount are not floating point operations. Let’s not dwell on that for now. I also explicitly mentioned the forward pass – the backward pass will need full precision gradients, as mentioned in Shibuya et al. So no XNOR + popcount in the backward pass. This also means that the backward pass will have 2 x number of parameter operations, just like a non-binarised neural network. If you recall, the relationship between number of FLOPs per second (well, OPs – not necessarily floats) in a model and the OPs per second required to hit 7 million tokens per second was:

$(FLOPs_{forward} + FLOPs_{backward}) = \frac{FloatcomputationsPerSecond}{7 million}$

We had already expressed the FLOPs for the backward pass as 2.5 times the FLOPs for forward pass. It’s not a very accurate conversion but it will serve our purposes well. With binary neural networks, we know that 64 parameters will contribute only 3 OPs in the forward pass. So we can rewrite the above equation as:

$(\frac{3}{64} \times parameters + FLOPs_{backward}) = \frac{FloatcomputationsPerSecond}{7 million}$

And the backward flops will be 2.5 times the original (i.e without the XNOR trick) forward FLOPs. But the original forward FLOPs is equal to twice the number of parameters in the model.

$(\frac{3}{64} \times parameters + 2.5 \times 2 \times parameters) = \frac{FloatcomputationsPerSecond}{7 million}$

$5.0468 \times parameters = \frac{FloatcomputationsPerSecond}{7 million}$

$parameters = \frac{FloatcomputationsPerSecond}{7 million \times 5.0468}$

Replacing with FloatComputationsPerSecond with our very optimistic 350 TFLOPS estimate for the H100, the number of binary parameters our model can have is around 9.9 million. This is only around 40% larger than what the neural network would have been if we had trained it in fp16. This is really not that much, considering how much precision we are giving up.

But, what if backward pass could also use XNOR + popcount? Then our binary network could have been around 300 million parameters and still be trained at 7 million tokens per second. 300 million parameters is useful territory. But no-one has figured out a way to do that yet.

What now?

There are a few things we can do, even though the elephant in the room is the full precision gradient computations.

First, is xnor + popcount actually faster than a vanilla bf16 matrix multiplication? Even though it uses fewer operations, the CUDA tensor cores can do bf16 matrix multiplications very fast. The 40x speedup we imagined (because we were thinking in terms of the number of operations) might fizzle out to a paltry 20-30% increase, and we’d be giving up a lot of accuracy. I already have some benchmarks showing that this is indeed the case. I’ll publish them here when I get the time.

There should also be a way to make the backpropagation more efficient. Or you know, we can ditch backpropagation altogether. Launay et. al has made direct feedback alignment work for transformers. It’s definitely not going to get us the same results as backpropagation – but hey, our weights are binary, we weren’t going to be the next Mistral 7B anyway. Maybe binary neural nets + DFA is the way to build the binformer.

matmul() using PyTorch’s MPS backend is faster than Apple’s MLX

April 21, 2025April 22, 2025 KevinLeave a comment

Disclaimer: I do not know why PyTorch + MPS is faster (yet)

I recently came across Apple’s MLX framework. It is a numpy-like interface that will let you run computations on Apple’s own GPUs if you have a Mac with an m-series chip. It uses Apple’s metal API under the hood. I wanted to see how much faster would it be to do matrix multiplications on apple silicon using MLX compared to Numpy + CPU, and more importantly, PyTorch. PyTorch has been supporting device="mps" for a while now.

Results

MLX + GPU (M3 Pro) is faster than Numpy + CPU. No surprises there, so I’ll omit CPU from the plots below. But MLX + GPU was surprisingly slower than PyTorch 2.6 + GPU on my MacBook with the M3 Pro chip.

I am going to ignore the more interesting phenomenon for now – that the matrix init times are slower for PyTorch when the matrix dimensions start to become huge. For the subplot titled “Matrix multiplication times”, this is not what I expected – I had expected PyTorch + MPS to be on par or slightly slower than MLX. . I still do not know why MLX is slower, but these were the hypotheses I have considered:

Very likely there’s something wrong with my benchmarking approach. This is the most likely hypothesis. Here’s the full notebook that I used to benchmark MLX.
Dtypes – Both mlx and torch arrays were explicitly set to float32, so it can’t be this.
GPU utilisation – I used asitop to monitor the GPU usage during the timeit runs. Both MLX and PyTorch used 100% GPU at 1380Mhz. So at least we are sure that both frameworks were using the GPU.
Compiled vs non-compiled MLX: Doing mx.compile(mx.matmul(a,b)) did not make a material difference in the runtimes.

Can we reproduce the same trend on a single 128×128 matrix?

Yes. Here’s a very simple test:

mlx_a = mx.random.uniform(0, 1, (128, 128), dtype=mx.float32)
mlx_b = mx.random.uniform(0, 1, (128, 128), dtype=mx.float32)

def mlx_single_matmul():
    return mx.eval(mx.matmul(mlx_a, mlx_b))

timeit.timeit(mlx_single_matmul, number=10000)

That results in 1.15 seconds. I double checked that mx.default_device() prints Device(gpu, 0) – so MLX is indeed using the GPU. For PyTorch:

torch_a = torch.rand((128, 128), device="mps", dtype=torch.float32)
torch_b = torch.rand((128, 128), device="mps", dtype=torch.float32)

def torch_single_matmul():
    return torch.matmul(torch_a, torch_b)

timeit.timeit(torch_single_matmul, number=10000)

0.21 seconds.

I changed the matrices in the above snippet to be of shape 128×30 and 30×128. Similar results. So this is not some optimisation that PyTorch has just for square matrices.

If you have ideas on why MLX is slower, or if you spot a problem with my script that would explain the discrepancy, let me know!

VimCharm: Approximating PyCharm on vim

November 22, 2020December 13, 2020 Kevin4 Comments

Disclaimer: All I’ve done is write a few config files.

This is for you if:

Your python IDE of choice is PyCharm
You wish you had a command-line replacement for PyCharm on all the places you ssh into
You wish you had access to at least some of the niceties of PyCharm when editing a one-off script, without having to create/import it into a new project.
You are somewhat familiar with vim and can comfortably edit a single file on vim. You also know what a .vimrc is

Motivation

PyCharm has worked wonderfully well for me, and the only time where I have to use something else is when I ssh into a server to put together a quick script. That something else tends to be vim, and this is an attempt to get vim as close to PyCharm as possible – especially the shortcuts so that I can work on vim the way I work on PyCharm (well, almost). The end result is still a far cry from PyCharm, but it makes navigating a codebase over ssh significantly less painful (at least for me)

But PyCharm can work over ssh

Yeah, but I don’t use PyCharm for one-off scripts. Besides, it pleases me to know that if I can ssh into the server (from an ipad, a phone, or a washing machine), I have an (approximate) IDE I can work on.

List of working approximations

Sorta kinda uses the same colorscheme as PyCharm
Toggles a project navigation sidebar (NERDTree) using alt + 1 . Approximates PyCharm’s Ctrl+1
Comment/uncomment multiple lines using Ctrl+/, just like PyCharm
Autocomplete
Navigate to the definition of a method/variable using Ctrl + Left click or Ctrl + b, just like PyCharm
Jump to the previous location using Alt + - . Approximates PyCharm’s Ctrl + Alt + Left Arrow
Fuzzy search for files using Ctrl + o. Approximates PyCharm’s double shift press
Search the entire code base using Alt + f. Approximates PyCharm’s Ctrl + Shift + f
Edits made in the search results window are reflected on to the underlying file, just like PyCharm
Syntax and linter errors show up as you type, just like PyCharm
If you are editing files that are part of a git repository, there are indicators on the gutter to show added, modified and subtracted lines, just like PyCharm
Pressing F12 brings up a terminal. Approximates PyCharm’s Alt + F12
Code folding using the minus (i.e -) key. Approximates PyCharm’s Ctrl+- and Ctrl + +
Automatic file saves, just like PyCharm
Rename methods/variables and automatically fix imports e.t.c across files, just like PyCharm

Why not just use python-mode?

I simply could not figure out the shortcuts that python-mode used. I thought it would be easier and more flexible if I install and configure the plugins myself.

Prerequisites

vim 8
A Python virtualenv (or a conda environment). There’s some pip install involved, though this is optional
Patience

TL;DR

Go here for the .vimrc

Let’s start from a blank .vimrc

If you need a project-specific .vimrc, see this. If not, everything goes into your ~/.vimrc

Let us begin by being able to see line numbers and some sort of syntax highlighting everywhere. Put these on your .vimrc:

" Some basic stuff
set number
set showmatch
syntax on
imap jj <Esc>

The set showmatch is for highlighting matching parentheses – that’s useful. The last line maps double pressing the j key in insert mode to <Esc> – no more reaching for that far away Escape key using your pinky!

NERDTree for the sidebar

We’ll start our plugin-hoarding with NERDTree. With Vim 8, we can simply copy over a plugin directory to a certain place and Vim would just “pick it up” – there’s no need to use a plugin manager for achieving VimCharm. Create the necessary directory structure and clone NERDTree:

mkdir ~/.vim/pack/vimcharm/start/ -p
git clone https://github.com/preservim/nerdtree.git ~/.vim/pack/vimcharm/start/nerdtree

Open some file (any file) on vim and type :NERDTreeToggle – you should see the sidebar. Executing the same command again would close the NERDTree. PyCharm by default opens/closes the sidebar using Ctrl + 1 . However, the terminal (and consequently Vim) cannot differentiate between 1 and Ctrl + 1, so we’ll map this to Alt + 1 instead. Before we do that, we need to determine what characters are sent by the terminal when we press the key combo. Simply run cat without any arguments, and press Alt + 1. You should see something like this:

You would also see that 1 and Ctrl+1 produces the same character on cat – as far as I know, there’s no way around this.

We need to map the character sequence for Alt + 1 to :NERDTreeToggle on our .vimrc:

set <A-1>=^[1
nmap <A-1> :NERDTreeToggle<CR>

Take care not to simply copy paste the character sequence on to your .vimrc! That won’t work. You should open your .vimrc on Vim, go to insert mode, and press Ctrl+v – this would put a caret under your cursor – now press Alt + 1 and it should fill in the necessary characters there. Restart vim, and Alt+1 should now open and close NERDTree.

Making it look like PyCharm

Gruvbox looks like the default PyCharm theme, kinda, so let’s get that:

git clone https://github.com/morhetz/gruvbox.git ~/.vim/pack/vimcharm/start/gruvbox

According to the installation page, we need to add this to our .vimrc:

autocmd vimenter * ++nested colorscheme gruvbox

Commenting and Uncommenting lines using Ctrl + /

We are going to use NERDCommenter for this. Clone it into the right directory, just as before:

 git clone https://github.com/preservim/nerdcommenter ~/.vim/pack/vimcharm/start/nerdcommenter

Ctrl+/ cannot be directly mapped on your .vimrc. So just like before, we insert the correct escaped character sequence into our .vimrc by going into insert mode, pressing Ctrl+v, and then pressing the desired keycombo (Ctrl + / in this case).

" The part after "=" in the below line should be inserted using Ctrl+v while in insert mode and then pressing Ctrl+/
set <F13>=^_
noremap <F13> :call NERDComment(0,"toggle")<CR>

" So that NERDCommenter can automatically decide how to comment a particular filetype
filetype plugin on

We are telling Vim to map the character sequences that Ctrl + / produces to the F13 key, which probably does not exist on your keyboard, and then we map F13 to the appropriate command to toggle comments. Restart Vim, open (or create) a python file and try Ctrl + / while in normal mode – it should comment/uncomment the line.

Autocomplete

Autocomplete on Vim does not feel as “fluid” as on PyCharm – for example, I haven’t managed to get it to work on imports – but it is still besser als nichts. Get jedi-vim:

 git clone https://github.com/davidhalter/jedi-vim ~/.vim/pack/vimcharm/start/jedi-vim

Restart vim, and Ctrl + space should already be giving you autocomplete suggestions. IMHO, jedi-vim displays too much information (what even is that bar thing that comes up on top?) – all I wanted was a simple autocomplete prompt. Put this on your .vimrc to Make Autocomplete Great Again:

let jedi#show_call_signatures = 0
let jedi#documentation_command = ""
autocmd FileType python setlocal completeopt-=preview

EDIT: I also installed supertab along with jedi:

 git clone https://github.com/ervandew/supertab ~/.vim/pack/vimcharm/start/supertab

Go-to definition using Ctrl+click

This is something that jedi-vim already does – all we need are some .vimrc entries:

set mouse=a
let g:jedi#goto_command = "<C-LeftMouse>"
map <C-b> <C-LeftMouse>

The above lines enable mouse support on Vim, sets Ctrl + left click as the combination for jedi’s goto command, and then recursively maps Ctrl+b (which is also what PyCharm uses by default) to Ctrl + left click as a keyboard-friendly alternative.

I also prefer the goto command to open a new tab when navigating to a different file. Here’s how to enable that, along with Shift+j and Shift+k to move between tabs:

let g:jedi#use_tabs_not_buffers = 1
nnoremap J :tabp<CR>
nnoremap K :tabn<CR>

Jump to previous location using Alt + minus

If you end up navigating multiple tabs away using Ctrl+b, you can press Ctrl+o repeatedly to jump back to your original position. Press Ctrl+i to go in the other direction. These would come in handy if you have to quickly gg to the beginning of the file to add an import – you can then press Ctrl+o to go back to the line you were editing. I believe in PyCharm the default mappings for this are Ctrl + Alt + Left arrow and Ctrl + Alt + right arrow respectively. I remapped these to Alt + - and Alt + Shift + - (that would be in fact Alt + _ ):

set <A-->=^[-
noremap <A--> <C-O>
set <A-_>=^[_
noremap <A-_> <C-I>

Remember that copy-pasting these won’t work and you will have to enter insert mode and press Ctrl+v and then Alt + -

Fuzzy file search

Fuzzy file search is what PyCharm does when you press the Shift key twice.

Download CtrlP:

git clone https://github.com/kien/ctrlp.vim.git ~/.vim/pack/vimcharm/start/ctrlp

By default pressing Ctrl+p should bring up the search prompt. We can’t map this to double shift (as in PyCharm) since vim can’t recognize Shift key presses (unless it’s combined with a printable character). I decided to map this to Ctrl + o instead (“o” for open), though this is not any better than the default setting. On your .vimrc:

let g:ctrlp_map='<C-O>'
let g:ctrlp_cmd='CtrlPMixed'

The second line above specifies that the search should be over files, buffers, tags e.t.c – you may omit it if you do not want buffers to show up on your search. ctrl+t on a search result will open it in a new tab.

Search everywhere and replace

One of the most useful features in PyCharm is the “search in project” dialog that Ctrl + Shift + f brings up. For example, if I delete/rename a hard-coded string literal, this is the dialog that I would bring up to look for all occurrences of that string literal so that I can rename/delete all of them – right from the search window.

Instead of using the built-in vimgrep or making an external call to the ubiquitous grep, we are going to use ack because it excludes annoying things like gitignore and binaries from the search results by default.

Somehow get ack on your target system
git clone ack.vim to ~/.vim/pack/vimcharm/start/ack.vim

We are going to map Alt + F to a python-only smart-cased search with ack. Add these to your .vimrc:

set <A-F>=^[f
nnoremap <A-F> :Ack! -S --python

Remember to Ctr+v and then press Alt+F to get those escaped character sequence right. Also, there is an extra space after the --python. Without it, the search term (eg: “foo”) that you type after pressing Alt+F would end up being “–pythonfoo”.

Restart vim and press Alt+f in a python file, enter your search term, and press enter. The results will be shown in a quick-fix window. Move your cursor to a search result and press enter to jump to that location. Press t to open that location in a new tab. Either of these would shift the focus to the editor. Press ctrl + w twice to shift focus back to the quick-fix window.

I usually use /<pattern> to search within the file, but sometimes it’s useful to do a slightly fancier search. I’ve wired Ctrl+f (the regular PyCharm find-in-this-file) to do a search within the open file:

nnoremap <C-F> :Ack!  %<Left><Left>

By default, you cannot make any changes to the contents of the quick-fix window. In Pycharm, the search results are editable and the changes are reflected on the underlying file. We can pull this off using the quickfix-reflector:

git clone https://github.com/stefandtw/quickfix-reflector.vim.git ~/.vim/pack/vimcharm/start/quickfix-reflector.vim

That’s it! Now your edits on the search results should be reflected on the underlying files.

Spot syntax and linter errors as you type

The most annoying thing about writing Python on Vim, at least for me, was that the silly syntax errors I make won’t be discovered until I actually try to run my script – PyCharm usually catches these as you type. Let’s set this up on vim using ALE:

git clone https://github.com/dense-analysis/ale.git ~/.vim/pack/vimcharm/start/ale

You should also have a linter insalled. I use pylint, and a quick pip install pylint does the trick. Restart vim and open a python file, and it should already be linting it as you type. Since ALE works asynchronously, there will be a slight delay (around a second) between you making a mistake and it being flagged on Vim – but in my opinion this is much better than a synchronous linting which freezes Vim, which is why I chose ALE instead of syntastic. However, the default ALE + Pylint combo is too whiny for my taste – I don’t want warnings about how I’m not writing a docstring for every single method; I have this on my .pylintrc:

[MESSAGES CONTROL]
disable=trailing-whitespace,missing-function-docstring,missing-module-docstring,no-else-return,miss    ing-class-docstring,invalid-name

The above is far from how I would like linter to be configured, but it serves as an initial config. I also do not care for highlighting the offending word in a line – all I want is a non-invasive indication in the gutter. On your .vimrc:

let g:ale_set_highlights = 0

Show lines modified after the previous commit

Put vim-gutter at ~/.vim/pack/vimcharm/start/vim-gutter and restart vim. If you edit a file in a repo (or set up git on your current folder with git init), the modified lines would be marked in the gutter. By default it takes 4 seconds for the appropriate mark to appear on your gutter – let us reduce it by putting this line in the .vimrc:

set updatetime=100

The end result is rather unflattering. ALE and git-gutter does not work well together – git-gutter’s modification marks are drawn over by the linter warnings, and in some-cases ALE ends up marking the wrong line with a warning. This thread suggests that there’s probably a way to get them to work the way I want, but I haven’t invested much time here.

Have a terminal handy

In Vim :term will open a terminal in a split window. Mapping this to F12 is trivial, but we want to hide this terminal (instead of killing it) and bring it back again on pressing F12. I could not get the “hide terminal on F12” part working, but I did figure out how to bring up a hidden terminal if it exists (or create a new terminal if it doesn’t) on pressing F12. Before we write a script for it, let’s go through the motions manually:

Open Vim
Type :term to open a terminal in a split window
Type something on the terminal
Press Ctrl+w and then type :hide to hide our terminal window
To show our hidden terminal, type :sbuffer /bin/bash. This would open in a split window a buffer that has “/bin/bash” in its name. If you use something other than bash, you will have to change this string accordingly.

Here’s a LoadTerminal() Vim script I wrote that would bring up an existing bash buffer if it exists, or create a new one if it doesn’t:

function! LoadTerminal()
    let bnr = bufnr('!/bin/bash')
    if bnr > 0 
        :sbuffer !/bin/bash
    else
        :term
    endif
endfunction

Save it as load_terminal.vim at ~/.vim/pack/vimcharm/start and add the following lines to your .vimrc:

source ~/.vim/pack/vimcharm/start/load_terminal.vim
map <F12> :call LoadTerminal()<CR>

The annoying part here is that we can’t map key presses on the terminal window – so you’ll have to press Ctrl + w and type :hide to hide the terminal. Do let me know if you find a way to map this to a keystroke.

Code folding

When you deal with large files, code folding (those tiny “-” signs that you click on PyCharm to collapse an entire class/method) is a godsend. Fortunately vim supports code folding right out of the box and all we need is this on our .vimrc:

set foldmethod=indent
set foldlevelstart=99
nnoremap - za
map _ zM
map + zR

According to our mappings above, there are no folds (i.e every code block is “open”) when we open a file (this is what foldlevelstart specifies). Shift + - (i.e Shift and the minus key) will collapse all blocks, and Shift + + will open all blocks. Use the minus key (i.e -) to toggle collapsing a single fold. You might also want to check out this answer for a quick overview of what’s supported.

Auto save

PyCharm saves the file as you type, sparing you from the hassle of having to press Ctrl+S across multiple tabs. We can get Vim to do this with vim-auto-save. Clone the repo to ~/.vim/pack/vimcharm/start/vim-autosave and add these to your .vimrc to enable auto-save:

let g:auto_save = 1                                                                                
let g:auto_save_events = ["InsertLeave", "TextChanged"]

A word of caution before we proceed – auto-saving can get quite annoying if enabled globally. I use project-specific vimrcs and use auto-save along with git – so if I accidentally auto-save something that I shouldn’t have, a git diff is all I need to see what went wrong.

Refactor across files

Jedi-vim can do simple renaming, but I wanted to something more powerful. Enter ropevim. You need to pip install rope, ropemode and ropevim. I have a miniconda environment set up, but you can install the packages to your global scope if you want to. We just need one file from the ropevim repo:

wget -O ~/.vim/pack/vimcharm/start/python_ropevim.vim https://raw.githubusercontent.com/python-rope/ropevim/master/ftplugin/python_ropevim.vim

Now let’s source it in our .vimrc:

 source ~/.vim/pack/vimcharm/start/python_ropevim.vim

Now add these to your .vimrc:

nnoremap <C-z> :RopeUndo<CR>                                                                       
set <A-z>=^[z                                                                                      
map <A-z> :RopeRedo<CR>                                                                            
map <F6> :RopeRename<CR>

We have mapped F6 to the rename operation, and Ctrl+z and Alt + z to undo and redo respectively. Remember not to copy paste the mapping for Alt + z, and press Ctrl+v and then the desired keycombo to enter it in your .vimrc.

Restart Vim, open a python file, and try to rename a variable using F6. You will get prompts to create a new ropevim project – press ‘y’ to create one locally, and then proceed to apply the rename. If you get an import error for ropevim when you start Vim, it’s probably because Vim uses the system python (which is probably a different version than the python in your virtualenv) and you pip-installed rope, ropemode and ropevim to a virtualenv. An alternative would be to do conda install -c conda-forge vim on your anaconda/miniconda env so that the Vim in your env will use the local python (and hence your installed pip packages) instead of the system one.

Final thoughts

If anything this exercise has made be better appreciate the work that the Jetbrains devs have put into their IDEs – all I wanted was a working subset of PyCharm’s basic features and what I got was a rather modest approximation. Do let me know (open an issue on Github?) if you managed to get any closer to PyCharm than this.

Yet another kindle vs paper books post

May 23, 2019June 3, 2019 Kevin2 Comments

TL;DR: Buy a kindle already. Reading multiple books at a time is surprisingly productive.
I now read while I eat. There's a list at the end comparing the amount of reading I got
done before and after I switched to a Kindle

I love smelling books. I also like stacking my books on a table, or on a shelf, so that I can look at them from time to time and be pleased with myself. The stacks also double as cheap room decor – books make the room more me. Then there is the added social benefit of being able to show off to anyone who cares to visit that I have read Thoreau and DDIA*.

* Only half-way through. It has been a year since I purchased the (physical) book. Sigh.

Despite all this, despite arguing with my friends that books are more than just the sum of its parts (late realization: a book has only one “part” that matters – the text) and that smelling a book and then flipping through it is a huge part of the “experience”, I switched to a kindle.

I feel ya, fellow book smellers.
Image credits: I got this from a Facebook photo comment :shrugs:

Before you brand me as a traitor and proclaim me unworthy of all the paper books that I have ever smelled, let me assure you that I did not succumb to the dark side easily. I borrowed my dad’s kindle paperwhite and tried it out for an entire month. Then I went out and bought myself a kindle.

The anti-library argument

The number of books I have left partially read has skyrocketed after I switched to the Kindle. And this is a good thing! I have always been a one-book-at-a-time man – I used to carry around the book I was currently reading everywhere, and I would promptly pick up another book after I was done reading the current one. Fast forward to the Kindle era, I find myself reading multiple books at the same time. I had imagined that this would be counter-productive. I’m so happy that I was wrong. The ability to switch to a book on a whim has let me read more than usual since I am reading what I feel like reading now, instead of trying to finish a book that I happened to pick up a week ago out of curiosity. My (unintentional) reading pattern until a few days ago looked like this: Deep Work by Cal Newport during the day, when I find it easy to focus, The Fountainhead for reading on the bed, and The Prince (40% complete) and The rise and fall of the Third Reich (17% complete – this one is a tome) whenever I feel like it. I can confidently say that I would never have made any progress on the last two books if I had stuck to the one book at a time policy that paper books unintentionally forced me to adopt. I would have given up and moved on to the next shiny thing 3 chapters into a history book.

I might never complete reading some of the books that I have on my kindle (looking at you, Third Reich), but that is not the point. In his book Black Swan, Nassim Nicholas Taleb introduces the concept of an anti-library:

The writer Umberto Eco belongs to that small class of scholars who are encyclopedic, insightful, and nondull. He is the owner of a large personal library (containing thirty thousand books), and separates visitors into two categories: those who react with “Wow! Signore professore dottore Eco, what a library you have. How many of these books have you read?” and the others—a very small minority—who get the point is that a private library is not an ego-boosting appendage but a research tool. The library should contain as much of what you do not know as your financial means allow you to put there. You will accumulate more knowledge and more books as you grow older, and the growing number of unread books on the shelves will look at you menacingly. Indeed, the more you know, the larger the rows of unread books. Let us call this collection of unread books an antilibrary.

Replace “unread books” with “partially read books” and you can immediately see how switching to the Kindle has benefitted my anti-library.

The “I now read more” argument

I have been reading on a kindle for about 5 months now, and I do not see myself going back. If there ever was a single compelling advantage that a kindle gave me over paper books, this is it: I read more on a kindle. Much, much more.

I now read while I wait, while I eat, and while I poop. Because the device pleasantly fits into my palm, I can now read while I’m having dinner instead of watching something on YouTube. You may think that this is not a big deal – but for me, it makes all the difference. Unlike finding time to read, finding time to eat is something that I must do in the interest of self-preservation. Coupling eating with reading is a win-win.

But can’t you just read on your laptop/phone while you eat?

Even if I gloss over the possibility of food on my keyboard, a laptop on the dinner table is just outright inconvenient. Reading on the phone might work. I really do not have a solid reason (apart from the LED screen) as to why I could not bring myself to read on my phone regularly.

I am going to deliberately avoid discussing all the other nice things about using an e-reader. I do find myself taking a lot of notes while I read – something I never used to do with paper books since I couldn’t be bothered to carry around a pen. It is also useful to highlight interesting anecdotes/quotes in a book and then later see them in a compact list. But IMHO these are fringe benefits.

Some raw data

Pre-kindle. List of books I read from September 2017- January 2019 (16 months), in no particular order. This includes both physical books and the few books that I had read on my phone :

God of small things, Arundhati Roy (on my phone)
Animal Farm, George Orwell (small book)
Ministry of utmost happiness, Arundhati Roy
Designing Data-Intensive Applications, Martin Kleppman (physical book, still reading)
The God Delusion, Richard Dawkins
Walden, Henry David Thoreau
The old man and the sea, Ernest Hemingway (on my phone, small book)
Catch 22, Joseph Heller
The White Tiger, Aravind Adiga
Meditations, Marcus Aurelius (on my phone, read a few pages here and there)
Blink, Malcolm Gladwell (Read around half of it)

The post-kindle list, spanning the duration from February 2019 to May 20, 2019. (4 months):

Crime and Punishment, Fyodor Dostoevsky
1Q84, Haruki Murakami
Black Swan, Nassim Nicholas Taleb
Antifragile, Nassim Nicholas Taleb
The Fountainhead, Ayn Rand
Stories of Your Life and Others, Ted Chiang
Deep work, Cal Newport (44% complete)
The Prince, Nicholas Macchiavelli (40%)
The Rise and Fall of the Third Reich, William. L. Shirer (17%)
Aatujeevitham, Benyamin (36%)
Flow, Mihaly Csikszentmihalyi (17%. No intention of returning to this book)
The New Evolution Diet, Arthur De Vany (28%, No intention of returning to this book)

Though I am not a “voracious” reader by any stretch of the imagination, you can see that when compared to the pre-kindle rate of 10 books in 16 months, 6 books in 4 months is indeed an improvement. Note that this is considering only the completed books – 17% of “The Rise and Fall of the Third Reich” is as long as some independent books -_-. I admit that the pre-kindle list is likely to be incomplete – I do not remember all the books that I have picked up and left halfway. Nevertheless, the lists should be able to convince you, albeit rather unscientifically, that I read more after I switched to a Kindle

Lessons from learning to play the violin

January 27, 2019January 28, 2019 KevinLeave a comment

TL;DR:

Learning to play the violin introduced me to western classical music, amateur orchestras, and deliberate practice. Even though I will never seriously pursue music, it was well worth my time.

Some background

I have been taking violin lessons as an adult beginner from 2016 December till now. I have stopped my lessons temporarily since I will be moving to a different city soon. As of January 2019, I am at Suzuki book 3. The decision to pursue violin as an adult might have been influenced by the Carnatic violin lessons I took when I was eleven years old [see sunk cost fallacy].

Also, partly owing to my limited knowledge, I am going to collectively refer to baroque, renaissance and classical music as just “classical music”.

1. Classical music is cool!

I would often come home from a practice session and look up the music we learned that day on Youtube. Though it started as an exercise to get more familiar with the music, I found myself listening to music more actively rather than just letting it play “in the background”. This simple act of being more attentive to what I listen to helped a lot in letting me appreciate classical music.

I was never much interested in the classical genre – most of the pieces I had encountered earlier were simply too long for my short attention span. The lack of an obvious, simple, repeating “chorus” in the genre was something I found hard to come to terms with. However, listening attentively led to the realization that the sophistication and the “cleverness” in the music is something that I could enjoy. My first “aha!” moment was when I stumbled upon the Canon in D. I could see how nuanced the composition was (to my untrained ear), and how each of the violinists seemed to be playing something entirely different yet similar. This was brilliant.

Then I discovered Antonio Vivaldi. I was blown away.

Then there are compositions like the “Moonlight ” sonata, which I did not quite like the first time I heard it and now I cannot imagine how I could have not loved it all this while. There is clearly a method to the madness.

My favorite rendition of the Canon in D

2. It is better to progress slowly, but surely.

The vibrato is a technique that every budding violinist hopes to master one day. Six to seven months ago, my vibrato was barely audible – I had to strain my ears to recognize it. Even though I am still a long long way from a respectable vibrato, today I can do some vibrato. A shitty vibrato feels much better than no vibrato.

I did not have to practice particularly hard or long to achieve this. I learn the violin for leisure and is in accordance rather leisurely when it comes to practice. I am happy that though I do not play daily (not a good thing), the 40 minutes of practice I put in 3-4 times a week actually let me (slowly) progress in my lessons. This was new for me. I did not have to work hard to progress – I just had to work somewhat consistently. If I had applied this principle to other things in my life, such as contributing to an open source project or going to the gym, I would have had today a much more braggable resume and much less belly fat.

3. Short, deliberate practice is much better than long hours of unfocused practice

When learning new music, my teacher often tells me that once you learn to play the hardest part the rest becomes very easy. The developer in me resonates with this idea – there is no point in optimizing the rest of your code unless and until you address the bottlenecks. The bottleneck, in my violin lessons, is often fast sections of a composition or parts where I am required to use a new finger position. I would often try to avoid putting in the work and won’t practice the difficult parts separately, partly because playing just the difficult parts is just boring. It is much more pleasurable to attempt the music as a whole and enjoy playing at least a part of the composition, instead of tackling just the difficult parts and consequently sound like a cat being tortured. Inevitably, a few days later, I would realize that I am no closer to playing the music successfully because the difficult parts are holding me back. To make progress, I have always had to prioritize learning the difficult portions.

Some parallels that I can draw to software development include learning new programming paradigms or tackling problems outside your usual domain of expertise. I have recently started reading this wonderful book on mathematics even though I have covered most of the topics as part of my CS bachelors degree. Writing code for the exercises at the end of each chapter is sure to get me out of my comfort zone, and using rust to attempt those exercises will make things more interesting.

4. Use social commitments to your advantage

I performed on a stage for the very first time on October 29, 2018. Even though I played the relatively easier second violin part, the pieces that my teacher chose for the orchestra were beyond my skill level. I had four months to “get my shit together” and “man up” for the big day. Horror-struck by the idea of embarrassing myself in front of a crowd, I started pouring extra time into my practice sessions. Vivaldi’s Summer was a particular pain in the ass – it was simply too fast for me. Eventually, we stopped following the Suzuki books in my personal classes and focused only on being able to play the second violin part of Summer by October.

When the big day came, I was not even nearly ready for that performance. I played a lot of wrong notes and to make things worse, my music stand’s hinge broke and I had to try and read from a stand in the next row. I felt terrible at the end of the day. When I talked to my teacher about how disappointed I was with myself, this was his response:

It does not matter. Do you really think that I do not make mistakes? The final performance was not at all significant compared to what you learned in the months of preparation leading to it.
Raja Singh, The creative school of math and music, New Delhi

The final performance was just an excuse to get the students to punch above their weight class. I must say that it worked – I would not have put in the extra time and effort in the absence of a social commitment. The orchestra also taught me how to follow a conductor, and I could not help but chuckle when I realized that the conductor is just a glorified metronome. Something similar happened when I committed to writing something for my employer’s engineering blog. We were trying to create a brand around the culture we strove to build in the engineering team, and I did not want to do a sloppy job. While I usually invest only a couple of hours into a blog post, this particular one took an entire weekend and went through multiple iterations. The result was head and shoulders above anything I had written till date. Social commitments FTW!

Us performing Mozart’s Symphony No. 25. I’m the tall-ish guy at center-right last row who seems to be barely playing. I need to use more bow *sigh*.

5. It is okay to not like something

My opinion before my introduction to classical music:

Country/Acoustic/Pop > Rock > Hip Hop > Metal > Classical

What I thought my opinion would be after (a mere) 2 years of violin lessons:

Classical music > everything-else > Metal

Unfortunately, such PC master race > console peasantry type comparisons are useless in music. For example, I do not get why people love Chopin. I mean yeah this sounds nice, and I would very much like to claim that I listen to Chopin and thus validate my “superior” taste in music. But the truth is, I like Tarzan and Jane much more than I like Frédéric Chopin. To each his own.

Squad takes the Joel test

February 15, 2018 KevinLeave a comment

As originally published in Squad Engineering

Disclaimer: Though this post sounds like fancy startup propaganda, it is not. I still stand by what I’ve written, even though I no longer work at Squad.

Considering that the Joel test dates back to the turn of the century, a time when Pentium III was state of the art and Linux was still obscure, it has aged quite gracefully. It is no more the golden standard against which development teams are rated, but it still is surprisingly relevant (for the most part).

At Squad we strive to build an engineering culture of doing more with less, and having a super smooth kick-ass development workflow is a necessity, not a luxury. Here we go.

1. Do you use source control?

Yes, git. All our code lives on Github. This one’s a no-brainer.

2. Can you make a build in one step?

One step builds are nice, but no builds are even nicer. In python-land, you don’t build — you deploy. As a developer, all I have to do is commit my changes to the staging branch, and do a fab deploy and voila! I can now test my changes on our staging server to my heart’s content . The whole deploy to staging process takes less than 5 minutes. Deploying to production is just another 5 minutes away, assuming you tested your feature on the staging server thoroughly.

3. Do you make daily builds?

As I said, we don’t really have ‘builds’ and that’s a good thing. What we do have is continuous integration using CircleCI. Unit tests are run automatically every time I commit to the repo and with a Slack integration, the team gets notified whenever a build is completed.

4. Do you have a bug database?

Yes, we track all our features and bugs on Pivotal Tracker. We have git integrations and every commit to the repo is automatically recorded under the relevant story. All discussions relevant to a bug/story happens at a single place. Did I mention we seldom use emails? Yup, that’s right, I can count on my fingers the number of times I had to use email for communicating with my teammates.

5. Do you fix bugs before writing new code?

Depends. Before you go berserk thinking “Why would Squad dooo thaaat?” let me humbly point you to Jeff Atwood’s take on the matter — not all bugs are worth fixing. Since we strive to be as lean as possible, every hour we sink into a bug has to be justified. In fact, our developers ruthlessly confirm the ‘ROI’ first before diving into the code base to hunt down and fix the bug.

That being said, you’ll never see a Squad dev building a feature on top of an existing bug. If thou see-eth the bug, thou fixeth the bug while writing thy feature. Also we don’t believe in titles, so the decision to whether fix a bug or not usually comes down to you and not to a mystical manager two tiers above you.

6. Do you have an up-to-date schedule?

We have ‘solver teams’ that are super committed to solving a focused problem and each solver team has a set of Objectives and Key Results (OKRs) for a quarter. This results in a very transparent schedule, agreed on by the whole team, and meeting or breaking it (if necessary) is always a collective decision and not a directive from up top.

What this means is that if Squad was trying to colonize Mars, every member of the solver team would know exactly when to work on making the landing lights look nice and when to focus on just hurtling the shuttle in the approximate direction of Mars.

From when to fix the ignition switch to saying ‘Hi’ to the alien, the solver team has it all figured out.

But despite our best efforts, sometimes deadlines are missed. We try to shed more weight (no landing lights on the rocket) and invoke ** ‘focus on focus’ to get the most critical components shipped. What happens if we’re still unable to meet the schedule? That’s when we own up that our estimates were incorrect and do it better the next time.

** Focus solely on what is worth focusing on. Thanks, Rishabh

7. Do you have a spec?

Walking on water and developing software from a specification are easy if both are frozen — Edward V. Berard>

Remember what I said about stories in Pivotal Tracker? Our specs too live inside stories. This is the process we follow:

Discuss with all stakeholders and draft a requirements doc. This doc will act as the source of truth for what needs to be solved and what needs to be built. This is our spec, and it is frozen as far as this story is concerned
Design a solution. What’s the easiest, most elegant way to meet the specs? Write down the design, the tasks involved and estimate the story
A teammate reviews your design, and a meaningful discussion about the merits of your design and alternate solutions ensue
You and your reviewer agree on a design, and you happily code away

8. Do programmers have quiet working conditions?

Oh this is my favorite part.

We have two really cool ‘silent rooms’ anybody can go to when they feel that the office bullpen is getting too loud. Once inside, you are not allowed to talk (unless you and your buddy have super-human whispering skills) and your phones should be on silent. Not on vibrate, but on silent. This is the metaphorical cave where programmers disappear into and come out with bags full of features. Silent rooms are our rock and our refuge, the one place where we won’t be disturbed. The rooms are insulated so that if someone revs their bike right outside the window, we wouldn’t know about it. And yes, I’m writing this blog sitting in one of the silent rooms.

9. Do you use the best tools money can buy?

Yes, developers bring their own device to work. I actually moved my assembled desktop into the office since I was convinced that an underpowered laptop without a discrete GPU won’t be able to ‘handle me’. Whatever makes you productive.

IDEs too are up to you to decide, and my favorite contester is PyCharm. We also have employed a slew of awesome tools like Slack, Sentry, and Loggly to make sure that our developers are as productive as they can be. Look at our StackShare page for more.

However, this does not mean we have a paid subscription for insert your favorite tool for x here. We don’t give MacBook Pros to our developers (but we do finance interest-free EMIs if they choose to buy one). We just recently got Slack premium. We are moving from Asana to Clickup. We have not maxed out the parallelism of our CircleCI builds. We also don’t swim in money. We only buy things that make us more productive.

10. Do you have testers?

No. The developers are responsible for their stories/features and they are expected to test it well before pushing to production. However, since another pair of eyes is always better, the person who requested the story (usually a Product Manager from the team) would do an ‘Acceptance Testing’ to make sure what you wrote conforms to the (hopefully frozen, rock solid) specs.

11. Do new candidates write code during their interview?

Yes, an insane amount of code. These are the steps that you would go through to get hired as a Product Engineer at Squad:

A call from our awesome CTO, Vikas
A design round, over the phone
A take-home assignment. (During my interview, I spent around 2 days on it and wrote a lot of code)
The team at Squad runs your code, and see if it works as intended. Bonus points if you have a readme.md that makes our life easier
The team then reviews your code. I want to stress this part. We actually read your code, and it ‘works’ does not mean you pass
A meaningful discussion over email about your chosen design and execution
You get invited for an on-premise interview. First round is ‘activity extension’ — extend your take-home assignment to add a couple of features. If your methods/modules were too tightly coupled and inflexible, you’d have a hard time here
Another design round. You are expected to design the solution to a problem, and write skeleton code to solve it. Given the time constraint, you get bonus points if your code actually works
A bug smash round. You are given a code base and your task is to refactor it according to your definition of ‘good code’

Yes, that’s a lot of code and I guarantee you that it will be the best technical interview you’ve ever had!

12. Do you do hallway usability testing?

Remember what I said about our high octane HQ? It is almost impossible to keep a feature under the wraps till the time it’s released. “Oh Kevin it looks cool but I don’t think it is really useful because of x and y” happens a lot and I’m grateful for it. Continuous improvement and constant iteration is part of our DNA. That being said, there has also been instances where I waited too long to show my features, or I committed to production and then asked the stakeholder to go through the feature. Hallway testing is a guideline. Though we strive to follow the best practices and guidelines whenever possible, it is not always possible. As they say, it is easier to ask forgiveness than to ask for permission. Not always, but sometimes.

So we’re finally here — time to count the points. I would claim that we scored a solid 11, but a skeptic would argue that we barely made a 10. You be the judge.

Also there are a lot of questions that the test does not address, like how easy it is for developers to work remotely (very easy, our team is partly remote) and how often do you refactor code (very often). But I’ll defer that to another post.

Image courtesies:

This post was much ‘shittier’ than it is now. Thanks to my wonderful friends at Squad for helping me fix it

Django code review for dummies

December 10, 2017 KevinLeave a comment

After 2 years at an enterprise backup software firm, I finally took the plunge and joined a startup. I love the engineering culture we have here at Squad, and rigorous code reviews are very much a part of it. Since I often found myself repeating the same mistakes again (and again, and again..), I went ahead and wrote a checklist to help me.

This is not a generic list and is very much tied to me and the mistakes that I made. The list helped me, but your mileage may vary

The TL;DR

Read through the code you wrote, and stop and ponder at each line.

The actual (noobie) list

1. No create/update inside a loop

Are you making create() or update() calls in a loop? Have you considered whether they could be replaced with bulk_create() and bulk_update()? See the django-bulk-update package.

2. No unnecessary model attribute fetches

Are you writing post.id where you could have gotten away with post_id? For example:

class Blog(models.Model):
     title = models.CharField(max_length=200)


 class Post(models.Model):
     blog = models.ForeignKey(Blog)
     time = models.TimeField()
     author = models.CharField(max_length=100

Let’s say you want to know the id of the blog to which a particular post belongs to. One way is to do:

 post = # some how get a reference to a Post object
 print post.blog.id

But the better way is to do:

 print post.blog_id

What’s the difference? In the first case, we are asking Django to fetch the id attribute from the post’s blog entry. As you would probably know, Post and Blog are separate tables in the database. So when you ask for post.blog.id, Django will query the Blog table to fetch the id that we need. That’s an extra query. However, this is not really necessary because we have the information we need in the Postobject itself. Since we have a foreign key relationship from Post to Blog, django must somehow keep track of which post is related to which blog. Django does this by storing a special blog_id attribute in Post which would store the primary key of its parent Blog. So post.blog_id would give us the information we need, without resulting in an extra query.

3. Docstrings and comments are important

Read through the docstrings and comments. It might seem unimportant to read the docstrings and comments when you have a feature to ship. But a wrong comment/docstring can throw the next developer who reads your code off track, and trust me you don’t want to be that guy.

4. If possible, limit your business logic to the model classes

Be mindful of where you write your business logic. Coming from jquery world, I had a tendency to put my business logic wherever I please. Avoid writing business logic in ModelAdmin classes or ModelForm classes (yikes!) and write them where they belong – Modelclasses. This would ensure:

Consistency in the codebase. If it is business logic, there’s only one place it could be.
Better tests. If it’s in a Model class, then you know you should write unit tests for it

5. Is it covered by unit tests?

Speaking of unit tests, how do you decide when to write unit tests and when not to? If it’s business logic, it needs unit tests. No buts, no maybes, just write the damn tests. If it is something in your ModelAdmin, then you can afford not to write unit tests for that as long as you don’t do any fancy if..elses there. Test your business logic, not your boilerplate

6. Think how your changes affect the existing data

In some cases, for example, when you introduce a new field, you might have to write a data migration so that existing rows in the table would have a sane value for that field. I made the rookie mistake of happily coding away on my feature with nary a thought about the existing data in the database and regretted it afterward. See here for a primer on Django migrations

7. Use that cache!

Keep an eye out for things that can be cached. Find yourselves fetching ‘top 10 scorers of all time’ from the db everytime the page loads? Cache it! Though this should be fairly obvious, it’s easy to forget about the cache when you are busy writing your feature.

8. Offload non-critical heavy tasks to an async queue

Okay, this one is a little specific to our particular stack. Let’s say you have a feature where your user presses ‘generate cats report’ button and you wade through the entire database to figure out how much of your total traffic involved pictures of grumpy cats. It is probably not a good idea to make your user stare at a loading screen while we crunch gigabytes of cat data. Here’s what you could have done:

When the user presses the button, start an async task to calculate grumpy cat traffic volume. We use celery to make this happen.
Once you fire the async task, immediately respond to the http request with the message “Looking for grumpy cats in the system, we will let you know when we are done”. Now your user can use his/her time for something more productive
Message the user in slack/send an email/display a button on the page when your async task is done.

This will let us offload heavier tasks to spare EC2 instances so that more critical requests/queries do not get slower because of grumpy cats

9. Be aware of popular optimizations

Know your python. Use list comprehensions over for loops, izip over zip, et cetera. This comes with time and practice, so don’t sweat it. Oh and don’t forget this:

need_refuel = None
if fuel_level < 0.2:
    need_refuel = True
else
    need_refuel = False

The above mess can be refactored into:

need_refuel = fuel_level < THRESHOLD or False

Whether this aids or impairs readability is a whole different debate. Things like these are subjective, and it is okay to have opinions.

10. Query only what you need

If you had a model like this:

class Banana(models.Model):
    gorilla = models.CharField()
    tree = models.CharField()	
    jungle = models.CharField()

(God save you if you actually have a class structure like that in production, but it would serve our example well) And you want to do something like this:

banana = Banana.objects.get(id=3)

What you wanted was a banana, but what you got was a gorilla holding the banana with the tree it was sitting on along with the entire fricking jungle (thanks, Joe Armstrong for the quote). Not cool.

What you can do instead is:

banana = Banana.objects.get(id=3).only('id')

Here’s the documentation for only. No more gorillas, just the banana. However, I prefer using .value_list('id', flat=True) over only('id') because the latter might result in extra queries if we carelessly try to use any attribute other than id in our code. Using .values() is very explicit and conveys to the programmer that you only need this particular attribute here. It is also faster.

For the love of bananas, just query only what you need

11. Learn to use Django Debug Tools

Django debug tools is your friend. Lavishly using only() and defer()could bite you back if not careful. If you defer loading attributes that you don’t think you will need, but you end up needing them anyways, that would be an extra DB query. At least in the Django list pages, this would result in the dreaded n+1 query. Let’s say you want to tabulate bananas and gorillas:

id	Gorilla Details
1	Chump, Amazon rainforest
2	Rick, Cambodia
3	Appu, Kerala

You thought all you need is id and Gorilla, so you did Banana.objects.all().only('id', 'gorilla'), so that we don’t need the tree and the jungle. But 3 months later, you thought it would be a good idea to display where the gorilla came from in your table. So you fire up a custom function in the ModelAdmin to do this:

def gorilla_details(self, obj)
    return '{0} {1}'.format(obj.gorilla, obj.jungle)

And everything worked smooth. But unbeknownst to you, Django is making DB queries in a loop. We had told Django to get only id and gorilla through the only() method. We now need the jungle as well. So whenever we access obj.jungle, Django queries the DB to get the jungle because we specifically told it not to fetch the jungle earlier. So we end up making 10 calls to the db for 10 gorillas (or bananas, whatever). The fix is to include jungle in the only() clause, but more often than not we do not even know that we are making an n+1 query.

Enter django debug tools.

Among many other things, DDT will tell you how many queries were fire to load your page. So if our banana-gorilla table makes 35 queries to the database to load, we know something’s wrong. Always look at what the debug toolbar has to say before you send in that pull request for review

Sorry for the long post. Here’s a potato.

This tiny potato will get you through it

Authentication with React-router 4.x

June 2, 2017 KevinLeave a comment

This article is inspired by the excellent tutorial by Scott Luptowski on how to add authentication to you React app. I attempt to re-invent the wheel again because the original article cites an older version of react-router and the instructions do not work if you are using react-router 4.x. There are a lot of breaking changes when you migrate from 3.x to 4.x, and there is an answer to all the whys here

Disclaimer

I’m not a Reactjs ninja or rockstar or paladin or anything of that sort. Just a dude with good intentions who had to spend an entire evening trying to figure out how authentication with react-router 4.x works, when the internet had only tutorials that use 3.x. So take my advice with a pinch of salt – I might not be following the best practices.

The goal

Your glorious new app requires the user to log in before they are allowed to do certain things. For example, if you were building the next Twitter, your users shouldn’t be able to tweet unless they are logged in. The idea here is to put certain URL patterns/pages behind an authentication wall so that if a user visits that page and the user is not logged in, he/she should be redirected to a login page. If the user is already logged in, proceed to show the requested content – and the user will have no idea about the karate chops we did behind the stage. Should the user try to navigate to a page that does not exist, we should show a 404 component as well.

The how-to

The solution is simple enough. Just like Scott explained in the original article, we create a React component that contains the login logic. This component wraps all of the routes that require authenticated users. Our entry point to the app would look something like this:

ReactDOM.render(
  <Router>
	<App />
  </Router>,
  document.getElementById('app')
);

But where did all the routes go? From react-router 4.x, you don’t get to define all your routes in one place. Yep, you read that right. So our Appcomponent will be doing its part in routing:

<pre class="wp-block-syntaxhighlighter-code brush: jscript; notranslate">class App extends Component {
	constructor(props){
		super(props);
	}

	render() {
		return (
			<div>
				All the awesomeness in the world converged to a single component.
				
					
  					
  				
			</div>
		
		)
	}
}</pre>

So what are we doing here? If the url exactly matches /, we render a Homecomponent. For everything else that is a subset of /, we render RootRouteWrapper which will subsequently route our requests. So all the other url patterns (eg: /pizza, /pizza/yummy) would go on to render the RootRouteWrapper component. But what’s that Switch component doing there? If we had not enclosed the routes in a Switch, react-router would have rendered all routes that matched the url. So if the user visits your-awesome-app.com, all the routes for / will trigger – both Home and RootRouteWrapper! If your routes are enclosed in Switch, react-router will render only the first match – in our example the Home component.

OK. So now we can show a home page. What does the RootRouteWrappercomponent do again?

<pre class="wp-block-syntaxhighlighter-code brush: jscript; notranslate">class RootRouteWrapper extends Component {
	render() {
		return (
			<div id="RootRouteWrapper">
				
					
					
					
				
			</div>
		)
	}
}</pre>

We define 2 routes here – /login to show the user a login prompt and /tweet to let the user post a tweet. Now /tweet should be accessible only if the user is logged in. EnsureLoggedInContainer is the magic component that will handle the login logic for us. The idea is to configure all routes that need authentication to render the EnsureLoggedInContainer. You can also see that we have defined a route that will render the PageNotFoundcomponent if the URL does not match any configured routes. On to our login logic:

import {Route, Switch, withRouter} from 'react-router-dom';

class EnsureLoggedInContainer extends Component {
	constructor(props){
		super(props);
	}

	componentDidMount(){
		const {dispatch, currentURL, isLoggedIn} = this.props;

		if(!isLoggedIn){
			this.props.history.replace('/login');
		}
	}

	render() {
		const {isLoggedIn} = this.props;

		return (
			<Switch>
				<Route path="/tweet" component={Tweet} />
			</Switch>
		)

	}
}


export default withRouter(EnsureLoggedInContainer);

The assumption is that the Tweet component shows the user an input box to type a message. Notice how we have declared a Route for /tweet again inside the EnsureLoggedInContainer. When the user navigates to /tweet, RootRouteWrapper renders EnsureLoggedInContainer which in turn renders Tweet. If the user is not logged in, componentDidMount will redirect the user to the login page. Remember that you need to export the class with withRouter for the history to be available in the props. Also, you would need to maintain the state of the application separately – this article assumes that you have laid down the necessary plumbing to pass isLoggedIn as a prop to EnsureLoggedInContainer. isLoggedIn should come from your application state – and react-redux seems to be the most popular choice here. How to use react-redux to pass properties to your component is beyond the scope of this article. If you are interested, there’s a really good introduction here

In case you wanted to add another page that displays a tweet – say /tweet/1– that would show the tweet with id 1 in a TweetContainercomponent – you would have to write the necessary routing logic inside the Tweet component. /tweet/:id would automatically require authentication since its parent route – /tweet – renders EnsureLoggedInContainer.

Caveats

You have to make changes at 2 places to add a new route that needs authentication – in the RootRouteWrapper component and then again in EnsureLoggedInContainer. I wonder if there is a more elegant solution

Pomodoros for Programming

November 10, 2015 KevinLeave a comment

If you’ve never heard of the pomodoro method, go read this. I might have just saved your life. I’ve been using this neat little technique since college, and now work. Pomodoros for exam prep was easy – you sit in front of the book while the clock is ticking and you go for a walk when it isn’t. But using pomodoros for programming turned out to be slightly more complicated than that. Here’s what I found:

The Don’ts

Do not use pomodoros for debugging. You cannot estimate when you will figure out what is causing that bug. It can take anything between 2 hours to 2 days.
Do not use pomodoros to set up your dev environment. You can install visual studio and SQL server while wading through nonsense at /r/nonsense. Save up your pomodoros for tasks that actually require focus.
Do not try to do 14 programming pomodoros a day. If you can do 8 a day, fantastic – you’ve done a lot of work. 6 Pomodoros, is good. I think anything more than 8 means you will be staying late in the office. That’s okay too. The thing is, manage to tick away more than 6 solid pomodoros despite all the email-replying and chit-chat and gazing-into-the-infinity then it is not a wasted day.
Do not freak out when your pomodoros are interrupted. Instead of losing your shit when people interrupt your pomodoros, avoid interruptions in the first place by setting clear expectations around your maker’s schedule and manager’s schedule

The Dos

Set aside pomodoros for designing systems. This is a good way to force yourself to think hard about a problem before jumping into execution.
Reply to emails on pomodoro breaks. If there are no emails to reply to, take a walk.
It’s okay to extend your 5 minute break by another 2 minutes. When it comes to personal productivity, it is about following the spirit of the law rather than the letter of the law.
I’ve never been able to do those 4 pomodoros in a row and take the bigger 15 minutes break. But do not let that stop you from trying things out until you figure out what works for you.
The real reward of using pomodoros is not (just) that you do more work per day, but that you can now measure how much work you do. If you can’t measure it, you can’t improve it.
kanbanflow is a pretty good tool with a built-in timer. Arguably better than pen and paper. But ticking off pomodoros on a big whiteboard is more satisfying.

Estimated 4 pomodoros and did it in 2. What a wonderful day!

Update

After further experimentation with the technique, I have decided that pomodoros for programming are not my cup of tea. There’s nothing wrong with the technique itself – a good friend and co-worker of mine has been using pomodoros (for programming) for almost a year now and he’s happy with it. It’s just that I am not very productive when there’s the threat of a forced break looming on the horizon.