Serverless Deployment of Sentence Transformer models

May 09, 2022 • ml mlops aws serverless sentence-transformer semantic-search

Learn how to build and serverlessly deploy a simple semantic search service for emojis using sentence transformers and AWS lambda.

I’ll be honest with you: deploying a serverless function is quite a shitty experience. There’s a huge number of steps you need to take, lots of provider specific configuration, and an extremely painful debugging process. But once you get your service to work, you end up with a really cost-effective and super-scalable solution for hosting all kinds of services, from your custom-built machine learning models to image processors to all kinds of cron jobs.

The goal of this post is to keep things as simple as possible without cutting corners. There are certainly simpler ways to deploy serverless functions (e.g. using chalice), but they come at the cost of severely limiting your flexibility. As in my previous post on Python Environment Management, I want to give you just enough information to cover most of your needs—not more, but also not less.

At the end of this article you will be able to deploy a serverless function in a way that allows you to move freely to other cloud providers (like Google Cloud Functions or Azure Functions with slight modifications) and is not limited to Python but allows to deploy any Docker image.

Let’s get started!

Setup

First off, we’ll install a couple of packages that we need to get started.

Serverless is a framework that takes the pain out of deploying serverless functions on different cloud providers. It provides a simple CLI to create project boilerplate in different programming languages and allows defining your serverless functions’ configurations through simple YAML files. To install serverless on MacOS/Linux run

curl -o- -L https://slss.io/install | bash

Next, we need the AWS CLI. Nothing special here. On Mac the easiest way to do this is through Homebrew:

brew install awscli

We configure the AWS CLI by running aws configure and input our AWS credentials and our preferred region. The output field can be left empty. You’ll find your access key id & the secret access key in your AWS console in the top right dropdown menu under Security credentials.

Emoji (Semantic) Search

The final application will include the following files:

❯ tree -L 1 .
.
├── Dockerfile        # Dockerfile, duh
├── embeddings.npy    # precomputed emoji embeddings
├── emojifinder.py    # business logic
├── emoji-en-US.json  # list of emojis
├── justfile          # justfile (like a Makefile)
├── handler.py        # REST handler
├── model             # directory with model
├── requirements.txt  # pip requirements
├── save_model.py     # script to save model to model/
└── serverless.yml    # deployment instructions

I put everything on GitHub so you can clone the repo and follow along more easily.

First things first: Finding a Dataset

On GitHub, I found a repo called muan/emojilib with a JSON file containing ~1800 emojis with associated keywords. To download the file’s latest version using wget:

wget https://raw.githubusercontent.com/muan/emojilib/main/dist/emoji-en-US.json

The file looks like this:

  {
    "🤗": [
        "hugging_face",
        "face",
        "smile",
        "hug"
    ],
  }

That’s pretty awesome: just what we need and already in a very clean state, perfect for our little semantic search application! A big thank you to Mu-An Chiou, the creator of https://emoji.muan.co/ for maintaining this dataset.

Semantic Search Logic

The business logic is relatively straight-forward:

We encode the texts built out of the eomjis’ keywords into vectors (get_vectors) and save them to disk (so we don’t have to precompute them).
Whenever a client sends a query, we encode that query and look for items with the biggest cosine similarity between the query embedding and our precomputed emoji embeddings (find_emoji).

from dataclasses import dataclass
from pathlib import Path
from typing import Union

import numpy as np
import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim


@dataclass
class Emoji:
    symbol: str
    keywords: list[str]


def get_vectors(
    model: SentenceTransformer,
    emojis: list[Emoji],
    embeddings_path: Union[str, Path] = Path("embeddings.npy"),
) -> np.ndarray:
    if Path(embeddings_path).exists():
        # if npy file exists load vectors from disk
        embeddings = np.load(embeddings_path)
    else:
        # otherwise embed texts and save vectors to disk
        embeddings = model.encode(sentences=[" ".join(e.keywords) for e in emojis])
        np.save(embeddings_path, embeddings)

    return embeddings


def find_emoji(
    query: str,
    emojis: list[Emoji],
    model: SentenceTransformer,
    embeddings,
    n=1,
) -> list[Emoji]:
    """embed file, calculate similarity to existing embeddings, return top n hits"""
    embedded_desc: torch.Tensor = model.encode(query, convert_to_tensor=True)  # type: ignore
    sims = cos_sim(embedded_desc, embeddings)
    top_n = sims.argsort(descending=True)[0][:n]
    return [emojis[i] for i in top_n]

REST Handler

Finally we need a simple REST handler that serves our semantic search app:

import json
from dataclasses import asdict
from functools import lru_cache

from loguru import logger
from sentence_transformers import SentenceTransformer

from emojifinder import Emoji, find_emoji, get_vectors


@lru_cache
def get_model(model_name: str) -> SentenceTransformer:
    return SentenceTransformer(model_name)


@lru_cache
def get_emojis() -> list[Emoji]:
    with open("emoji-en-US.json") as fp:
        return [Emoji(k, v) for k, v in json.load(fp).items()]


def endpoint(event, context):
    logger.info(event)

    try:
        request = json.loads(event["body"])
        query = request.get("query")
        assert query, f"`query` is required"
        n = int(request.get("n", 32))

        model = get_model("model/")
        emojis = get_emojis()
        embeddings = get_vectors(model, emojis)

        response = {
            "emojis": [
                asdict(e)
                for e in find_emoji(
                    query=query, emojis=emojis, model=model, embeddings=embeddings, n=n
                )
            ]
        }

        # https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-integration-settings-integration-response.html
        return {
            "statusCode": 200,
            "headers": {
                "Content-Type": "application/json",
                "Access-Control-Allow-Origin": "*",
                "Access-Control-Allow-Credentials": True,
            },
            "body": json.dumps(response),
        }
    except Exception as e:
        logger.error(repr(e))

        # https://docs.aws.amazon.com/apigateway/latest/developerguide/handle-errors-in-lambda-integration.html
        return {
            "statusCode": 500,
            "headers": {
                "Content-Type": "application/json",
                "Access-Control-Allow-Origin": "*",
                "Access-Control-Allow-Credentials": True,
            },
            "body": json.dumps({"error": repr(e), "event": event, "context": context}),
        }


if __name__ == "__main__":
    print(endpoint({"body": json.dumps({"query": "vacation"})}, None)["body"])

The function called endpoint contains our POST handler. There’s nothing special about the name, we can call it anything, we just have to make sure we reference it correctly in our serverless.yml later.

A couple of things we have to bear in mind:

The client’s request will come in JSON-stringified form inside the event["body"] field.
Then handler’s response must follow the special format as it will go through a proxy. Specifically, we need to set the statusCode and put our app’s response (JSON-stringified) into the body field. If you want to call your function from within a web app then you will need to set the appropriate headers to enable CORS support. E.g. to enable requests from all domains, you set:

"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Credentials": True,

We use the @lru_cache decorator for our get_ functions to make sure these resources get reused as long as the function is warm.

Prefetching the Sentence Transformer

When our function is idle for too long, Amazon kills the container. When the next request comes in the container will have to be restarted, which means that any models we used will have to be downloaded again. In order to avoid this, we save our Sentence Transformer model to disk and ship it inside the Docker container. Our save_model.py:

import sys

from sentence_transformers import SentenceTransformer

def save_model(model_name: str):
    """Loads any model from Hugginface model hub and saves it to disk."""
    model = SentenceTransformer(model_name)
    model.save("./model")

if __name__ == "__main__":
    args = dict(enumerate(sys.argv))
    model_name = args.get(1, "all-MiniLM-L6-v2")
    save_model(model_name)

Run python save_model.py <model_name_or_path> to download your preferred encoder model. If you don’t specify a model name, a small MiniLM model will be used, which offers a good balance between speed and quality.

Dockerizing our App

We don’t yet have a requirements.txt, so here goes:

https://download.pytorch.org/whl/cpu/torch-1.11.0%2Bcpu-cp39-cp39-linux_x86_64.whl
sentence-transformers==2.2.0
loguru==0.6.0

This gives us a cpu-only version of torch, the sentence-transformers package and loguru, a super-simple logging library.

And here’s the Dockerfile, no surprises there:

FROM public.ecr.aws/lambda/python:3.9

# Copy model directory into /var/task
ADD ./model ${LAMBDA_TASK_ROOT}/model/

# Copy `requirements.txt` into /var/task
COPY ./requirements.txt ${LAMBDA_TASK_ROOT}/

# install dependencies
RUN python3 -m pip install -r requirements.txt --target ${LAMBDA_TASK_ROOT}

# Copy function code into /var/task
COPY handler.py emojifinder.py emojis.json embeddings.npy ${LAMBDA_TASK_ROOT}/

# Set the CMD to our handler
CMD [ "handler.endpoint" ]

Maybe you wonder why I ADD the model first, then COPY only requirements, and only in the end COPY the rest of the files. This is to optimize Docker’s caching mechanism — things that (often) change together go together; things that change more frequently come later. If I copied everything at the very top, then whenever I made a change to my business logic then Docker would have to rebuild everything from the top.

Now we build our image:

docker build -t emojifinder . --platform=linux/amd64

Note: If you’re on an M1 Mac you HAVE TO add --platform=linux/amd64 to the build command to make sure you build for the correct platform, otherwise your container will fail to start and terminate with an exec format error.

Debugging

As cool as serverless functions are, debugging them is still incredibly painful. Here are a few tips to help you in your struggle:

First, make sure your Python code is running as it should.
If your Docker image is compiling, you can use the following command to invoke your new function: curl -X POST http://localhost:8080/2015-03-31/functions/function/invocations -d '{"query": "stars"}'
If the function is throwing an exception, you might want to look into your container. You can override the container’s entry point with docker run -it --entrypoint /bin/bash <container_id> to get a shell into your container.
Once you successfully deployed your image and invoke your serverless function, its logs will turn up in Cloudwatch (the link goes to eu-west-1 – change to your region!). Make sure you go there, select your lambda function and look at the logs.

Create ECR Repo and Upload Docker Image

Hang in there, friend, we’re almost done!

To deploy our code we need to create a so-called ECR repository where we can push our image. ECR is a cloud-based container registry that allows us to store and manage Docker images. Creating the repo is just this one line:

aws ecr create-repository --repository-name emojifinder-repo

This command will create an ECR repository called emojifinder-repo and return a JSON response containing a repositoryUri (<account_id>.dkr.ecr.<region>.amazonaws.com/<repo_name>). We note both the account_id and the region, as we’ll need them in the next step.

Now we have to authenticate our docker CLI with AWS ECR, tag the image with the repository URI and push it to ECR:

aws ecr get-login-password | docker login --username AWS --password-stdin <account_id>.dkr.ecr.<region>.amazonaws.com  # authenticate with AWS ECR
docker tag emojifinder <repositoryUri>  # tag the image
docker push <repositoryUri>  # push image to ECR

Depending on your internet connection, pushing the image will take a while, so this might be a good time to grab a cup of tea or whatever.

Here’s a little trick: If you do all of this inside a repo that’s hosted on GitHub, you can use GitHub actions to automatically build your image and push it to ECR, saving you a lot of bandwidth. (Let me know if you find this interesting enough for its own post.)

Deploying the Serverless Sentence Transformer

The service we want to deploy is configured via a serverless.yml file where we define our functions, events and AWS resources to deploy. In our case, we’re deploying a single function (called emojifinder) to AWS and we want to POST to /endpoint:

service: emojifinder

provider:
  name: aws  # provider
  region: eu-west-1  # aws region
  memorySize: 1024  # optional, in MB, default is 1024
  timeout: 30  # optional, in seconds, default is 6

functions:
  emojifinder:
    image: <repositoryUri>
    events:
      - http:
          path: endpoint
          method: post

And now, the moment you all waited for 💫

serverless deploy

If successful, you will see the following output:

❯ serverless deploy
Running "serverless" from node_modules

Deploying emojifinder to stage dev (eu-west-1)

✔ Service deployed to stack emojifinder-dev (27s)

endpoint: POST - https://XXXXXX.execute-api.eu-west-1.amazonaws.com/dev/endpoint
functions:
  emojifinder: emojifinder-dev-emojifinder

This command will output an endpoint that you can POST to: curl -X POST <yourEndpointURI> -d '{"query": "stars"}'

Two things you should know:

The first request could time out because the model needs to load (I think is what happens).
If you get a 403 error saying Missing Authentication Token, make sure you’re really POSTing to the right URI and endpoint.

If you need to fix something in your code, then you will have to run all those fours steps again:

docker build
docker tag
docker push
serverless deploy

To make this easier, I’ll leave you with the following justfile so you can just deploy:

region := "<your_region>"
endpoint_uri := "<your_lambda_endpoint_uri>"
repo_uri := "<your_ecr_repository_uri>"
image_name := "emojifinder"

default:
    @just --list

auth:
    account_id=$(aws sts get-caller-identity --query Account --output text) \
    && aws ecr get-login-password | docker login --username AWS --password-stdin ${account_id}.dkr.ecr.{{region}}.amazonaws.com

deploy:
    docker build -t {{image_name}} . --platform=linux/amd64
    docker tag {{image_name}} {{repo_uri}}
    docker push {{repo_uri}}
    serverless deploy

query query n="16":
    @curl -s -X POST {{endpoint_uri}} -d '{"query": "{{query}}", "n": {{n}}}'

emojis query n="16":
    @just query "{{query}}" "{{n}}" | jq -r .emojis[].symbol | tr '\n' ' '

Final remarks

Aaaaand it’s a wrap! Phew! I never said it was going to be easy! But once you got it working, you will have a really powerful and cost-effective tool at your disposal. Also, you now have a super-awesome semantic search for emojis, allowing you to find emojis for any situation, like when you’re hungry and need an emoji-inspired meal plan for the month:

❯ just emojis hungry 31
🍲 🥫 🍽️ 🍟 🥣 🫘 🍫 🌯 😋 🧑‍🍼 🥮 🌭 🍠 🧇 🍤 🍨 🥙 🥠 🦥 🌮 🍭 🍪 🍬 🥪 🧀 🍖 🌰 🥞 🥝 🍘 🍌

❧

Alexander Seifert

Hi, I'm Alex and I write this blog. Here you'll find articles and tutorials mostly about Natural Language Processing and related areas.

Follow me on Twitter for updates or contact me.