Some Internet operations trust that clients are “well behaved.” As an operator of a publicly accessible web application, for example, you have to trust that the clients accessing your content identify themselves accurately, or that they only use your services in the manner you expect. However, some clients are bad actors. These bad actors are typically automated processes: some might try to scrape your content for their own profit (content scrapers), and others might misrepresent who they are to bypass restrictions (bad bots). For example, they might use a fake user agent.
Successfully blocking bad actors can help reduce security threats to your systems. In addition, you can lower your overall costs, because you no longer have to serve traffic to unintended audiences. In this blog post, I will show you how you can realize these benefits by building a process to help detect content scrapers and bad bots, and then use Amazon CloudFront with AWS WAF (a web application firewall [WAF]) to help block bad actors’ access to your content.
WAFs give you back some control. For example, with AWS WAF you can filter traffic, look for bad actors, and block their access. This is no small feat because bad actors change methods continually to mask their actions, forcing you to adapt your detection methods frequently. Because AWS is fully programmable using RESTful APIs, you can integrate it into your existing DevOps workflows, and build automations around it to react dynamically to the changing methods of bad actors.
AWS WAF works by allowing you to define a set of rules, called a web access control list (web ACL). Each rule in the list contains a set of conditions and an action. Requests received by CloudFront are handed over to AWS WAF for inspection. Individual rules are checked in order. If the request matches the conditions specified in a rule, the indicated action is taken; if not, the default action of the web ACL is taken. Actions can allow the request to be serviced, block the request, or simply count the request for later analysis. Conditions offer a range of options to match traffic based on patterns, such as the source IP address, SQL injection attempts, size of the request, or strings of text. These constructs offer a wide range of capabilities to filter unwanted traffic.
Let’s get start with the involved AWS services and an overview of the solution itself. Because AWS WAF integrates with Amazon CloudFront, your website or web application must be fronted by a CloudFront distribution for the solution to work.
How AWS services help to make this solution work
The following AWS services work together to help block content scrapers and bad bots:
- As I already mentioned, AWS WAF helps protect your web applications from common web exploits that can affect their availability, compromise security, or consume excessive resources.
- CloudFront is a content delivery web service. It integrates with other AWS products to give you an easy way to distribute content to end users with low latency and high data-transfer speeds.
- AWS Lambda enables you to run code without provisioning or managing servers. With Lambda, you can run code for virtually any type of application or back-end service.
- Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. You can create an API that acts as a “front door” for applications to access data, business logic, or functionality from your back-end services, such as code running on Lambda or any web application.
- AWS CloudFormation gives you an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.
Solution overview
Blocking content scrapers and bad bots involves 2 main actions:
- Detect an inbound request from a content scraper or bad bot.
- Block any subsequent requests from that content scraper or bad bot.
For the solution in today’s post to be effective, your web application must employ both of these actions. The following architecture diagram shows how you can implement this solution by using AWS services.
These are the key elements of the diagram:
- A bad bot requests a specifically disallowed URL on your web application. This URL is implemented outside your web application in the blocking solution.
- The URL invocation triggers an Lambda function that captures the IP address of the requestor (source address).
- The function adds the source address to an AWS WAF block list.
- The function also issues a notification to an Amazon SNS topic, informing recipients that a bad bot was blocked.
CloudFront will block additional requests from the source address of the bad bot by checking the AWS WAF block list.
In the remainder of this post, I describe in detail how this solution works.
Detecting content scrapers and bad bots
To detect an inbound request from a content scraper or bad bot, set up a honeypot. This is usually a piece of content that good actors know they are not supposed to— and don’t—access. First, embed a link in your content pointing to the honeypot. You should hide this link from your regular human users, as shown in the following code.
<a href="/v1/honeypot/" style="display: none" aria-hidden="true">honeypot link</a>
Note: In production, do not call the link honeypot. Use a name that is similar to the content in your application. For example, if you are operating an online store with a product catalog, you might use a fake product name or something similar.
Next, instruct good content scrapers and bots to ignore this embedded link. Use the robots exclusion standard (a robots.txt file in the root of your website) and protocol to specify which portions of your site are off limits and to what content scrapers and bots. Conforming content scrapers and bots, such as Google’s web-crawling bot Googlebot, will actively look for this file first, download it, and refrain from indexing any content you disallow in the file. However because this protocol relies on trust, content scrapers and bots can ignore your robots.txt file, which is often the case with malware bots that scan for security vulnerabilities and scrape email addresses.
The following is a robots.txt example file, which disallows access to the honeypot URL described previously.
User-agent: * Disallow: /v1/honeypot/
Between the embedded link and the robots.txt file, it is likely that any requests made to the honeypot URL do not come from a legitimate user. This is what forms the basis of the detection process.
Blocking content scrapers and bad bots
Next, set up a script that is triggered when the honeypot URL is requested. As mentioned previously, AWS WAF uses a set of rules and conditions to match traffic and trigger actions. In this case, you will use an AWS WAF IPSet filter condition to create a block list, which is a list of disallowed source IP addresses. The script captures the IP address of the requestor and adds it to the block list. Then, when CloudFront passes an inbound request over to AWS WAF for inspection, the rule is triggered if the source IP address appears in the block list. In turn, AWS WAF instructs CloudFront to block the request. Any subsequent requests for your content from that source IP address will be blocked when the honeypot URL is requested.
Note: IPSet filter lists can store up to 1,000 IP addresses or ranges expressed in Classless Inter-Domain Routing (CIDR) format. If you expect the block list to exceed this number, consider using multiple IPSet filter lists and rules. For more details on service limits, see the AWS WAF Limits documentation.
In the remainder of this post, I show you how to implement the honeypot trap using Lambda and Amazon API Gateway. The trap is a minimal microservice that enables you to implement it without having to manage compute capacity and scaling.
Solution implementation and deployment
All resources for this solution are also available for download from our GitHub repository to enable you to inspect the code and change it as needed.
Step 1: Create a RESTful API
To start, you’ll need to create a RESTful API using API Gateway. Using the AWS CLI tools, run the following command and make note of the API ID returned by the call. (For details about how to install and configure the AWS CLI tools, see Getting Set Up with the AWS Command Line Interface.)
$ aws apigateway create-rest-api --name myBotBockingApi
The output will look like this (the line that has the API ID is highlighted):
{
"name": "myFirstApi",
"id": "xxxxxxxxxx",
"createdDate": 1454978163
}
Note: We recommend that you deploy all resources in the same region. Because this solution uses API Gateway and Lambda, see the AWS Global Infrastructure Region Table to check which AWS regions support these services.
Step 2: Deploy the CloudFormation stack
Download this CloudFormation template and run it in your AWS account in the desired region. For detailed steps about how to create a CloudFormation stack based on a template, see this walkthrough.
You must provide two parameters:
- The Base Resource Name you want to use for the created resources.
- The RESTful API ID of the API created in Step 1 earlier in this post.
The CloudFormation – Create Stack page looks like what is shown in the following screenshot.
CloudFormation will create a web ACL, rule, and empty IPSet filter condition. Additionally, it will create an Amazon Simple Notification Service (SNS) topic to which you can subscribe so that you can receive notifications when new IP addresses are added to the list. CloudFormation will also create a Lambda function and an IAM execution role for the Lambda function, authorizing the function to change the IPSet. The service will also add a permission allowing the RESTful API to invoke the function.
Step 3: Set up API Gateway
We also provide a convenient Swagger template that you can use to set up API Gateway, after the relevant resources have been created using CloudFormation. Swagger is a specification and complete framework implementation for representing RESTful web services, allowing for deployment of easily reproducible APIs. Use the Swagger importer tool to set up API Gateway, but make sure you change the downloaded Swagger template in JSON format, by updating all occurrences of the placeholders shown in the following table.
Placeholder | Description | Example |
---|---|---|
[[region]] |
The desired region |
us-east-1 |
[[account-id]] |
The account ID where the resources are created |
012345678901 |
[[honeypot-uri]] |
The name of the honeypot URI endpoint |
honeypot |
[[lambda-function-name]] |
The name of the Lambda function created by CloudFormation (check the Outputs section of the stack) |
wafBadBotBlocker-rLambdaFunction-XXXXXXXXXXXXX |
Clone the Swagger import tool from GitHub and follow the tool’s readme file to build the import tool using Apache Maven, as shown in the following command.
$ git clone https://github.com/awslabs/aws-apigateway-importer.git aws-apigateway-importer && cd aws-apigateway-importer
Import the customized template (make sure you use the same region as for the CloudFormation resources), and replace [api-id] with the ID from Step 1 earlier in this post, and replace [basepath] with your desired URL segment (such as v1).
$ ./aws-api-import.sh --update [api-id] --deploy [basepath] /path/to/swagger/template.json
In API Gateway terminology, our [basepath] URL segment is called a stage, and defines the path through which an API is accessible.
Step 4: Finish the configuration
Finish the configuration by connecting API Gateway to the CloudFront distribution:
- Create an API key, which will be used to ensure that only requests originating from CloudFront will be authorized by API Gateway.
-
Associate the newly created API key with the deployed API stage. The following image shows an example console page with the API key selected and the recommended API Stage Association values.
-
Find the API Gateway endpoint created by the Swagger import script. You will need this endpoint for the custom origin. Find the endpoint on the API Gateway console by clicking the name of the deployed stage, as highlighted in the following image.
-
Create a new custom origin in your CloudFront distribution, using the API Gateway endpoint. The details screen in the AWS Management Console for your existing CloudFront distribution will look similar to the following image, which already contains a few distinct origins. Click Create Origin.
-
As shown in the following screenshot, use the API Gateway endpoint as the Origin Domain Name. Make sure the Origin Protocol Policy is set to HTTPS Only and add the API key in the Origin Custom Headers box. Then click Create.
-
Add a cache behavior that matches your base path (API Gateway stage) and honeypot URL segment. This will point traffic to the newly created custom origin. The following screenshot shows an example console screen that lists CloudFront distribution behaviors. Click Create Behavior.
- Use the value of your base path and honeypot URL to set the Path Pattern field. The honeypot URL must match the value in the robots.txt file you deploy and the API Gateway method specified. Select the Custom Origin you just created and configure additional settings, as illustrated in the following screenshot:
- Though whitelist headers are not strictly required, creating them to match the following screenshot would provide additional identification for your blocked IP notifications.
- I recommend that you customize the Object Caching policy to not cache responses from the honeypot. Set the values of Minimum TTL, Maximum TTL, and Default TTL to 0 (zero), as shown in the following screenshot.
-
Register the AWS WAF web ACL with your CloudFront distribution. The General tab of your distribution (see the following screenshot) contains settings affecting the configuration of your content delivery network. Click Edit.
-
Find the AWS WAF Web ACL drop-down list (see the following screenshot) and choose the correct web ACL from the list. The name of the web ACL will start with the name you assigned as the Base Resource Name when you launched the CloudFormation template earlier.
- To receive notifications when an IP address gets blocked, subscribe to the SNS topic created by CloudFormation. You can receive emails or even text messages, and you can use that opportunity to validate the blocking action and remove the IP address from the block list, if it was blocked in error. For more information about how to subscribe to SNS topics, see Subscribe to a Topic.
Summary
The solution explained in this blog post helps detect content scrapers and bad bots. In most production deployments, though, this is just a component of a more comprehensive web traffic filtering strategy. AWS WAF provides a highly customizable service that can be interacted with programmatically to react faster to changing threats.
If you have comments about this blog post, please submit them in the “Comments” section below. If you have questions about or issues deploying this solution, start a new thread on the AWS WAF forum.
- Vlad