Django code review for dummies

After 2 years at an enterprise backup software firm, I finally took the plunge and joined a startup. I love the engineering culture we have here at Squad, and rigorous code reviews are very much a part of it. Since I often found myself repeating the same mistakes again (and again, and again..), I went ahead and wrote a checklist to help me.

Very wise words from Jake the dog

This is not a generic list and is very much tied to me and the mistakes that I made. The list helped me, but your mileage may vary

The TL;DR

Read through the code you wrote, and stop and ponder at each line.

The actual (noobie) list

1. No create/update inside a loop

Are you making create() or update() calls in a loop? Have you considered whether they could be replaced with bulk_create() and bulk_update()? See the django-bulk-update package.

2. No unnecessary model attribute fetches

Are you writing post.id where you could have gotten away with post_id? For example:

class Blog(models.Model):
     title = models.CharField(max_length=200)


 class Post(models.Model):
     blog = models.ForeignKey(Blog)
     time = models.TimeField()
     author = models.CharField(max_length=100

Let’s say you want to know the id of the blog to which a particular post belongs to. One way is to do:

 post = # some how get a reference to a Post object
 print post.blog.id

But the better way is to do:

 print post.blog_id

What’s the difference? In the first case, we are asking Django to fetch the id attribute from the post’s blog entry. As you would probably know, Post and Blog are separate tables in the database. So when you ask for post.blog.id, Django will query the Blog table to fetch the id that we need. That’s an extra query. However, this is not really necessary because we have the information we need in the Postobject itself. Since we have a foreign key relationship from Post to Blog, django must somehow keep track of which post is related to which blog. Django does this by storing a special blog_id attribute in Post which would store the primary key of its parent Blog. So post.blog_id would give us the information we need, without resulting in an extra query.

3. Docstrings and comments are important

Read through the docstrings and comments. It might seem unimportant to read the docstrings and comments when you have a feature to ship. But a wrong comment/docstring can throw the next developer who reads your code off track, and trust me you don’t want to be that guy.

4. If possible, limit your business logic to the model classes

Be mindful of where you write your business logic. Coming from jquery world, I had a tendency to put my business logic wherever I please. Avoid writing business logic in ModelAdmin classes or ModelForm classes (yikes!) and write them where they belong – Modelclasses. This would ensure:

  • Consistency in the codebase. If it is business logic, there’s only one place it could be.
  • Better tests. If it’s in a Model class, then you know you should write unit tests for it

5. Is it covered by unit tests?

Speaking of unit tests, how do you decide when to write unit tests and when not to? If it’s business logic, it needs unit tests. No buts, no maybes, just write the damn tests. If it is something in your ModelAdmin, then you can afford not to write unit tests for that as long as you don’t do any fancy if..elses there. Test your business logic, not your boilerplate

6. Think how your changes affect the existing data

In some cases, for example, when you introduce a new field, you might have to write a data migration so that existing rows in the table would have a sane value for that field. I made the rookie mistake of happily coding away on my feature with nary a thought about the existing data in the database and regretted it afterward. See here for a primer on Django migrations

7. Use that cache!

Keep an eye out for things that can be cached. Find yourselves fetching ‘top 10 scorers of all time’ from the db everytime the page loads? Cache it! Though this should be fairly obvious, it’s easy to forget about the cache when you are busy writing your feature.

8. Offload non-critical heavy tasks to an async queue

Okay, this one is a little specific to our particular stack. Let’s say you have a feature where your user presses ‘generate cats report’ button and you wade through the entire database to figure out how much of your total traffic involved pictures of grumpy cats. It is probably not a good idea to make your user stare at a loading screen while we crunch gigabytes of cat data. Here’s what you could have done:

  • When the user presses the button, start an async task to calculate grumpy cat traffic volume. We use celery to make this happen.
  • Once you fire the async task, immediately respond to the http request with the message “Looking for grumpy cats in the system, we will let you know when we are done”. Now your user can use his/her time for something more productive
  • Message the user in slack/send an email/display a button on the page when your async task is done.

This will let us offload heavier tasks to spare EC2 instances so that more critical requests/queries do not get slower because of grumpy cats

The grumpy cat programmer

9. Be aware of popular optimizations

Know your python. Use list comprehensions over for loopsizip over zip, et cetera. This comes with time and practice, so don’t sweat it. Oh and don’t forget this:

need_refuel = None
if fuel_level < 0.2:
    need_refuel = True
else
    need_refuel = False

The above mess can be refactored into:

need_refuel = fuel_level < THRESHOLD or False

Whether this aids or impairs readability is a whole different debate. Things like these are subjective, and it is okay to have opinions.

10. Query only what you need

If you had a model like this:

class Banana(models.Model):
    gorilla = models.CharField()
    tree = models.CharField()	
    jungle = models.CharField()

(God save you if you actually have a class structure like that in production, but it would serve our example well) And you want to do something like this:

banana = Banana.objects.get(id=3)

What you wanted was a banana, but what you got was a gorilla holding the banana with the tree it was sitting on along with the entire fricking jungle (thanks, Joe Armstrong for the quote). Not cool.

What you can do instead is:

banana = Banana.objects.get(id=3).only('id')

Here’s the documentation for only. No more gorillas, just the banana. However, I prefer using .value_list('id', flat=True) over only('id') because the latter might result in extra queries if we carelessly try to use any attribute other than id in our code. Using .values() is very explicit and conveys to the programmer that you only need this particular attribute here. It is also faster.

For the love of bananas, just query only what you need

11. Learn to use Django Debug Tools

Django debug tools is your friend. Lavishly using only() and defer()could bite you back if not careful. If you defer loading attributes that you don’t think you will need, but you end up needing them anyways, that would be an extra DB query. At least in the Django list pages, this would result in the dreaded n+1 query. Let’s say you want to tabulate bananas and gorillas:

idGorilla Details
1Chump, Amazon rainforest
2Rick, Cambodia
3Appu, Kerala

You thought all you need is id and Gorilla, so you did Banana.objects.all().only('id', 'gorilla'), so that we don’t need the tree and the jungle. But 3 months later, you thought it would be a good idea to display where the gorilla came from in your table. So you fire up a custom function in the ModelAdmin to do this:

def gorilla_details(self, obj)
    return '{0} {1}'.format(obj.gorilla, obj.jungle)

And everything worked smooth. But unbeknownst to you, Django is making DB queries in a loop. We had told Django to get only id and gorilla through the only() method. We now need the jungle as well. So whenever we access obj.jungle, Django queries the DB to get the jungle because we specifically told it not to fetch the jungle earlier. So we end up making 10 calls to the db for 10 gorillas (or bananas, whatever). The fix is to include jungle in the only() clause, but more often than not we do not even know that we are making an n+1 query.

Enter django debug tools.

Among many other things, DDT will tell you how many queries were fire to load your page. So if our banana-gorilla table makes 35 queries to the database to load, we know something’s wrong. Always look at what the debug toolbar has to say before you send in that pull request for review

Sorry for the long post. Here’s a potato.

This tiny potato will get you through it

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s