Lessons learned from making a SaaS* completely serverless**


* Software as a Service

** serverless as in everything runs on AWS lambda.

Short summary

I recently launched TweetScreenr. Going completely serverless kept the cloud costs low during development. I used the serverless framework to deploy my python flask API end points as AWS lambda functions. However this slowed down the development speed and I ran into tricky issues and limitations.

The full story

I recently launched TweetScreenr, a service that would create a personalized news feed out of your Twitter timeline, and I decided to use a completely serverless stack.

Why I decided to go serverless

I decided to re-write my app as serverless in an effort to avoid issues I faced in the past with regular AWS EC2 instances. Skip this section if you do not care about my motivation behind switching to serverless. Summary – I thought it would be cheaper and will require less babysitting.

I had launched the same service (minus some nice features) under a different name a year ago. It was a regular python flask web app with sqlite as the database and rabbitMQ as the message broker. I wasn’t expecting much traffic, so everything – the dabatase, the message broker and the web server – was running on an AWS EC2 t2.micro. It had 2 vCPUs and 1 GB of RAM and costed around $5 a month. Needless to say, it couldn’t handle the traffic from being on the front-page of HN. This was expected. But instead of requests just taking longer or the service being temporarily unavailable, the EC2 instance just went into a failed state and required manual intervention to restore the service. This wasn’t expected. I was hoping that the t2.micro would become unresponsive in the face of overwhelming traffic and would become functional again as the traffic died down. I didn’t expect it to crash and require a manual restart.

What was happening was that my t2.micro instance was running out of CPU credits and was throttling to 5% of the CPU performance, which isn’t enough to run the kernel. Burstable instances provides a baseline CPU performance and has the ability to burst above this baseline when the workload demands it. You accumulate CPU credits when the CPU is running at the baseline level and you use up these credits when you are bursting. I didn’t know that using up all your CPU credits for the instance can prevent the kernel from running. Using a t2.small didn’t solve the issue – I eventually ran out of CPU credits and the instance failed and required a manual intervention. The need to intervene manually meant that if the service goes down in the middle of the night, it stays down until I wake up the next morning.

You can argue that I was using the wrong EC2 instance type for the job and you would be right. I chose a t2.micro because it was the cheapest. The cheapest non-burstable instance I could find was an a1.medium for $18 a month, or $11 a month if I reserve it for a year. For a side project that didn’t have a plan to charge its users (yet), I considered that expensive. I considered moving to a $5 linode, but I was worried I’d run into variants of the same issue. Given the choices, going serverless sounded like a good idea. Each request to my service will be executed in a different lambda function and hence won’t starve for resources, even when there is high traffic. Moreover, I would be paying only for the compute I use. I did some calculations and figured that I can probably stay under the limits of the AWS free tier. It took me around a year to re-write the app to be completely serverless, add some new features and a paid tier, and launch again on HN. This time, the app did not go down. But the post also didn’t make it to the front-page, so I do not know what will happen if it’s subjected to the same amount of traffic.

The serverless stack

I wanted to use python flask during development and deploy each API route as a different lambda function. I used the confusingly named serverless framework to do exactly that. The serverless framework is essentially a wrapper around a cloud provider (AWS in my case) and automates the annoying job of creating an AWS API gateway end-point for each of the API routes in your app. It also has a bunch of plugins to handle things like managing a domain name, using static s3 assets e.t.c.

I had to use dynamoDB. If I had gone with a relational database, I’d again have to decide where to host the database (eg: t2.micro?). Instead of self-hosting RabbitMQ, I decided to use AWS SQS because my usage would fit in the free tier and allows me to easily configure a lambda function to process messages in the queue. If I had self-hosted RabbitMQ I would have had to use something like celery to process messages added to the queue, and that would have been an additional headache.

The good

Costs were low

I was able to keep costs exceptionally low during development. I wanted to have separate test, dev and prod stages. All experimental features are tested on test, and then promoted to dev once they are stable enough. If nothing explodes in dev for a while, the changes get deployed to prod. This would have required 3 EC2 instances running round the clock. Even if I were to use t2.micros, it would have been $15 a month to keep all three running all the time. It costs $0 with my AWS + serverless framework setup. Costs continued to remain low (i.e zero) even after I launched. I currently have 8 active users (including me) and I’m yet to exceed the AWS free-tier.

Serverless framework gives you free monitoring

The serverless framework gives you error reporting for free. Instead of fiddling around with AWS cloudwatch or sentry, I can open up the serverless dashboard and see an overview of the health of the app. I’ve tried setting up something similar using cloudwatch and gave up because of the atrocious UX.

Some default graphs from the serverless dashboard. I can quickly see if my lambda functions are erroring out.

Infrastructure as code

I was forced into using infrastructure as code and that’s a good thing. The serverless framework requires you to write a serverless.yml file that describes the resources your application needs. For TweetScreenr, this included the dynamoDB table names, global secondary indexes, the SQS queue name, the domain to deploy to e.t.c. When you deploy using serverless deploy (this is another nice thing – I can deploy to prod with a single command), the serverless framework will create these resources for you. This made things like setting up an additional deployment stage (eg: a test instance) or deploying to a different AWS account really easy.

Serverless framework had excellent customer support. When something did not work (which was often. More on that later), I could ask for help using the chat in the dashboard and someone from customer support would help me resolve my issue. This happened twice. Oh, I’m a free user. I do not want to promote serverless framework but their great customer support definitely deserves a mention. If I was treated so well as a free user, I imagine that they are treating their paid customers better.

The ugly

Despite the fantastic savings in cost, the niceties of infrastructure as code and the convenience of single-command deployments, my development experience with serverless framework + AWS was frustrating. Most of these are shortcomings of the broader serverless paradigm and are not specific to either AWS or the serverless framework. But a lot of them were just AWS being a pain in the ass and a few of them were problems introduced by the serverless framework..

Lambda functions are slow

My lambda functions take 2 seconds to start up (cold start). According to this post, the main culprit seems to be the botocore library. Another quirk is that AWS lambda couples memory and cpu power, and the cpu power scales linearly from 128MB to 1.7Gb. At 1.7GB AWS allocates your function an entire cpu core. The lambda functions on TweetScreenr’s test and dev instances are configured to use 128mb of memory and they are slooooow. In the production instance of TweetScreenr I configured the functions to use 512mb and this made the cold starts considerably faster, even though none of the underlying lambda functions use more than 100mb of RAM during execution.

Lambda functions can’t get too large

There is also a limit to how large your lambda function can get. I wrote my web app as a regular python flask app and thus used a sane amount of libraries/dependencies. I quickly ran into the 50mb limit for lambda packages. Fortunately there’s a serverless framework plugin for lambda layers. I was able to put all my dependencies into a layer to keep the deployment size under 50mb.

DynamoDB limitations

Among all the things that are wrong with serverless, this was the most infuriating.

DynamoDB has StringSet attribute that can be used to store set of strings. Turns out that you cannot do subset checks with SS. In TweetScreenr, I wanted to check if the set of domains in a tweet is a subset of the set of the domains the user has blocked. This cannot be done. I have to do the equivalent of contains(block_list, x) for each x. This is bad, since I’ll have to retrieve all the tweets from the database (and pay for this retrieval) and apply the filter in python. In postgres, I could have easily done this with postgres arrays and the @> operator (a.k.a the bird operator).

DynamoDB also won’t let you create an index (a GSI) on a bool attribute. I have an is_user attribute that is a boolean, and the idea was to create an index on is_user so that I can quickly get a list of all users by checking whether is_user is True. Nope. No GSIs allowed on bool. I had to make is_user a string attribute to create an index on it.

Also, pagination sucks with DynamoDB. There’s no way to get the total number of items (well, items having certain attributes. Not the overall size of the database) in dynamodb. This is why pagination in TweetScreenr uses simple next and prev buttons instead of displaying the total number of pages.

I know what you are thinking – DynamoDB is not a good fit for my use case. But my use case is to simply pull tweets from Twitter and associate it with a user. No fancy joins required. If DynamoDB (and No-SQL in general) is not a good fit for such a contained use-case, then what is the intended use-case for DynamoDB?

Errors thrown by the serverless framework cli were misleading

Not everything was rosy in the development front either. Mistakes in serverless.yml were hard to debug. For example, I had this (mis-)configured yml:

send_digest:
    handler: src.usermodel.send_digest_for_user
    memorySize: 128
    events:
      - sqs:
          arn: !Ref DigestTopicStaging
          topicName: "DigestTopicStaging"

The problem here was that I was passing the reference to a topic, but according to the yml it was expecting an SQS queue.This is the stacktrace I got when I ran serverless deploy:

✖ Stack core-dev failed to deploy (12s)
Environment: linux, node 16.14.0, framework 3.7.2 (local) 3.7.2v (global), plugin 6.1.5, SDK 4.3.2
Credentials: Local, "serverless" profile
Docs:        docs.serverless.com
Support:     forum.serverless.com
Bugs:        github.com/serverless/serverless/issues

Error:
TypeError: EventSourceArn.split is not a function
    at /home/ec2-user/environment/paperdelivery/node_modules/serverless/lib/plugins/aws/package/compile/events/sqs.js:71:37
    at /home/ec2-user/environment/paperdelivery/node_modules/serverless/lib/plugins/aws/package/compile/events/sqs.js:72:15
    at Array.forEach (<anonymous>)
    at /home/ec2-user/environment/paperdelivery/node_modules/serverless/lib/plugins/aws/package/compile/events/sqs.js:46:28
    at Array.forEach (<anonymous>)
    at AwsCompileSQSEvents.compileSQSEvents (/home/ec2-user/environment/paperdelivery/node_modules/serverless/lib/plugins/aws/package/compile/events/sqs.js:36:47)
    at PluginManager.runHooks (/home/ec2-user/environment/paperdelivery/node_modules/serverless/lib/classes/plugin-manager.js:530:15)
    at async PluginManager.invoke (/home/ec2-user/environment/paperdelivery/node_modules/serverless/lib/classes/plugin-manager.js:564:9)
    at async PluginManager.spawn (/home/ec2-user/environment/paperdelivery/node_modules/serverless/lib/classes/plugin-manager.js:585:5)
    at async before:deploy:deploy (/home/ec2-user/environment/paperdelivery/node_modules/serverless/lib/plugins/deploy.js:40:11)
    at async PluginManager.runHooks (/home/ec2-user/environment/paperdelivery/node_modules/serverless/lib/classes/plugin-manager.js:530:9)
    at async PluginManager.invoke (/home/ec2-user/environment/paperdelivery/node_modules/serverless/lib/classes/plugin-manager.js:563:9)
    at async PluginManager.run (/home/ec2-user/environment/paperdelivery/node_modules/serverless/lib/classes/plugin-manager.js:604:7)
    at async Serverless.run (/home/ec2-user/environment/paperdelivery/node_modules/serverless/lib/serverless.js:174:5)
    at async /home/ec2-user/environment/paperdelivery/node_modules/serverless/scripts/serverless.js:687:9

The error message was utterly unhelpful. I solved this using the good old “stare at the config until it dawns on you” technique. Not recommended.

Serverless framework doesn’t like it if you change things using the AWS console

If I decide to start over and delete the app using serverless remove, it would not work – it complains that the domain name config I’ve associated with an API endpoint must be manually deleted. Fine, I did that. While I was at it, I also manually deleted the API gateways – they were going to be removed by serverless remove anyway. Running serverless remove again now resulted in an error because it could not find the app, because I deleted the API gateways manually. I wish serverless framework would have ignored that and continued to delete the rest of the CloudFormation stack it had created. Since the serverless cli wouldn’t help me, I had to click around the AWS console a bazillion times and delete everything manually. Arghhhhhh.

Something similar happened when I manually deleted a lambda function and tried to deploy again. My expectation was that the serverless framework would see that one lambda end-point is missing and re-create just that. Instead, I got this:

UPDATE_FAILED: PollUnderscoreforUnderscoreuserLambdaFunction (AWS::Lambda::Function)
Resource handler returned message: "Lambda function core-dev-poll_for_user could not be found" (RequestToken: dcc0e4a3-5627-5d7a-2569-39e25c268ff2, HandlerErrorCode: NotFound)

It really doesn’t like you changing things directly in the AWS console.

Outdated documentation about the serverless framework

I was trying to get the serverless framework to create an SQS queue. This blog post from 2018 explicitly mentions that serverless will not create a queue for you – you have to manually create it using the AWS console and use the ARN in the serverless.yml. That information is likely outdated since this stack overflow answer tells you how to get serverless to create the queue for you. There are more examples of outdated documentation on the serverless website.

Conclusion

Making the app go completely serverless was a painful experience. I don’t want to do that ever again. But serverless makes it so cheap to run your app if you don’t have heavy traffic. I should also stay away from AWS. But again, they are the cheapest. Arghh.

Maybe I should set more realistic expectations on what it costs to host a side project. If I am willing to pay for two (one for the web server and one for the database) a1.medium (or the equivalent non-aws) instances I would be a happy man. That’s $18 a month, or $216 ($132 if I reserve them) a year. That’s not trivial, but that’s affordable. However, I tend to work on multiple side projects. $100+ a year to host each of them is not practical. Let me know in the comments if you have ideas.

2 thoughts on “Lessons learned from making a SaaS* completely serverless**”

  1. It feels like lots of your complaints are about Serverless Framework rather than Serverless “paradigm” or AWS Lambda themselves. Maybe you should try out something better instead? Like AWS CDK, SST, or even Terraform

    Though cold starts might get tricky.

    > I configured the functions to use 512mb and this made the cold starts considerably faster

    And AWS bills GB*seconds, so you might actually lower your bill by allocating more expensive memory if that makes your functions run considerably faster https://docs.aws.amazon.com/lambda/latest/operatorguide/computing-power.html

    If cold starts are still an issue you might consider AWS ECS with Fargate compute for instance

    > DynamoDB limitations

    For the serverless database you might want to check out this Serverless PostgreSQL option https://neon.tech/

    Like

Leave a comment