Deploy Meraki HA vMX100 in Amazon AWS

IFM supplies network engineering services for $NZ200+GST per hour. If you require assistance with designing or engineering a Cisco network - hire us!

Introduction

This guide will walk you through the process of creating an active/active HA deployment of a redundant pair of Cisco Meraki vMX100s (also known as the vMX) in the Amazon AWS environment. This requires you to have two vMX licences. This is not a warm spare configuration.

To complete this guide you will need intermediate or above skills with Amazon AWS.

Methodology

The methodology of operation is that you have a vMX in two different availability zones in the same region. Each vMX has two Amazon AWS routes tables associated with it, one for when it is up and another for when it is down. On the Meraki vMX side each vMX statically advertises all of the subnets available in Amazon AWS into AutoVPN.

A CloudWatch alarm is created to trigger an alarm if either of the two built in Amazon AWS status checks fail. Examples of failures that will cause an alarm are:

Loss of network connectivity
Loss of system power
Software issues on the physical host
Hardware issues on the physical host that impact network reachability

The instance status check also sends an ARP packet to the vMX which it must respond to. Failure to respond usually means the software has crashed or stopped running.

An Amazon AWS Lambda script then checks every minute the "Instance State" of each vMX to make sure it is in the "running" state, and that the "Alarm Status" is "ok". If both pass then the "UP" Amazon AWS route table is associated with the related Amazon AWS subnets, otherwise the "DOWN" Amazon AWS route table is associated with the related subnets.

Lets take an example. Lets say you have four subnets in Amazon AWS, 10.32.0.0/24, 10.32.1.0/24, 10.32.2.0/24 and 10.32.3.0/24. And lets say in your Meraki network you have all your spokes in the supernet 192.168.64.0/19. Lets say you have two vMX called vMX1 and vMX2.

In normal operation (using the "UP" Amazon AWS route table) 10.32.0.0/24 and 10.32.1.0/24 have a route for 192.168.64.0/19 via vMX1and 10.32.2.0/24 and 10.32.3.0/24 have a route for 192.168.64.0/19 via vMX2. Lets say vMX1 now fails. The "DOWN" route table for vMX1 will now be applied to the Amazon AWS subnets 10.32.0.0/24 and 10.32.1.0/24, and the route for 192.168.64.0/19 in the DOWN route table points to vMX2. Now all subnets will route their traffic via vMX2.

On the Meraki side when the Dashboard registers that vMX1 has gone down the routes via vMX1 will be marked as down, leaving only the routes being advertised via vMX2 as being active.

Typically you would have half of your Amazon AWS subnets set to use one vMX, and the other half using the second vMX, so the load is split.

Implementation

Deploy vMX

Amazon AWS has something called "availability zones" within a region. You can think of each availability zone as a separate DC. In my closest region the availability zones have names like ap-southeast-2a, ap-southeast-2b, etc. I am going to refer to these availability zones as 2A, 2B, etc. Your availability zones might be called 1A, 1B, etc. You'll need to substitute in your names in the following examples.

Start by selecting the two availability zones you are going to deploy the vMX into. Typically try and put the vMX into availability zones where the bulk of your servers are located. For this guide I have selected availability zones 2A and 2B.

Also it is really important everything goes into the same VPC. Lets collect some information to make it easier later on.Go to Services/VPC/Your VPCs. Note down the VPC ID of the VPC we are using for your environment. The vMXs need to use this same VPC.

The vMXs should be placed into seperate networks in the Cisco Meraki Dashboard.

Now follow the "vMX100 Setup Guide for Amazon AWS" HOWEVER do not do the section under "Additional VPC Configuration" which discusses configuring route tables. Put a vMX into each of the two availability zones you have already chosen. Make sure they are using the same vPC as above. I like to name my vMX after the zones they are, so vMX-2A and vMX-2B. You can choose whatever names you like, but substitute your names for the examples I am using in this guide. Please give your two new vMX names in the Amazon AWS EC2 console. It makes things much easier later on if you can see the name of each vMX rather than referring to it by various IDs.
https://documentation.meraki.com/MX/MX_Installation_Guides/vMX_Setup_Guide_for_Amazon_Web_Services_(AWS)

Configure EC2 Alarms

Now we need to setup an Alarm to trigger if the two built in Amazon AWS status checks fail. The status checks themselves are done every 60s however alarms are only generated every 5 minutes by default. We are going to reduce that down to 1 minute. In the Amazon EC2 console select the VMX-2A instance and go to the "Monitoring" tab down the bottom. Click on the link labelled "Enable Detailed Monitoring".

Now click on the "Status Check" tab. Click the "Create Status Check" button. By default "Send a notification" is ticked. Untick this. We are not going to do notifications but if you are keen you can come back to this later and configure email or TXT notifications. Half way down it says "For at least". Change the number from 2 to 1. For "consecutive period(s) of" it should say "1 Minute". If it still says "5 minutes" the detailed monitoring has not kicked in yet. Reload your browser and repeat this process until it does say "1 Minute". If it is still not kicking in go and have a coffee and come back in 5 minutes. For the "Name of Alarm" I would called it VMX-2A-STATUS-CHECK. This will make Amazon CloudWatch trigger an alarm after 1 minute if either instance status checks fail.

Repeat the above process for VMX-2B.

All going well (within 1 minute) you should now see the column in the Amazon EC2 console labelled "Alarm Status" say "OK". If you select either vMX and go to the monitoring tab down the bottom it should say "CloudWatch Alarms: 1 of 1 in OK".

I personally feel the status checks are sufficient for detecting a large majority of likely failure cases. However if you want to learn more about CloudWatch alarms you can add other failure criteria as well, such as "Network Packets Out" being "<=" 0 (if the vMX is not generating any network packets something is bad) or a whole heap of other options. But be carefull about making the failure cases two complex - as it is easy to mark something as failed when it is actually working. You also need to consider complex cases such as what the vMX might do during a firmware upgrade, etc.

You can watch a You Tube video of the alarms being configured with the link below.
https://youtu.be/tOUinDtTG4c

Configure VPC Route Tables

We are now going to make UP and DOWN route tables. For this example I am going to assuming we are working with subnets that have Internet access. If you have a more complex Amazon AWS network configuration then you'll need to adjust your Amazon AWS route table to suit your environment.

Navigate to Services/VPC/Route Tables. Create a route table called "VMX-2A-UP". Select the VPC ID you noted down before. Select this route table, go to the "routes" tab and select "Edit". Add routes for your AutoVPN destinations. For "Target" select "Instance" and then select "VMX-2A". You also need to add a default route "0.0.0.0/0". The "Target" will be "Internet Gateway". You should only have one "Internet Gateway" to choose.

Repeat the process but create a route table called "VMX-2A-DOWN". This will be used if VMX-2A is down. This time when you add the AutoVPN routes make the "Target" VMX-2B. We want all AutoVPN traffic to go via VMX-2B if VMX-2A is down.

Create a route table called "VMX-2B-UP". Select the VPC ID you noted down before. Select this route table, go to the "routes" tab and select "Edit". Add routes for your AutoVPN destinations. For "Target" select "Instance" and then select "VMX-2B". You also need to add a default route "0.0.0.0/0". The "Target" will be "Internet Gateway". You should only have one "Internet Gateway" to choose.

Repeat the process but create a route table called "VMX-2B-DOWN". This will be used if VMX-2B is down. This time when you add the AutoVPN routes make the "Target" VMX-2A. We want all AutoVPN traffic to go via VMX-2A if VMX-2B is down.

You can now perform some manual checks of your Amazon AWS route tables. Select "Services/VPC/Subnets". You want half the subnets to use VMX-2A and the other half to use VMX-2B. You also want the subnet that VMX-2A is located in to use VMX-2A and vice versa for VMX-2B.

Make a note of the "Subnet ID" and which vMX you are associating each with. We need this later for the Amazon AWS Lambda script.

Considering these things, select one subnet at a time, and then go "Actions/Edit Route Table Association". For the "Route Table ID" select VMX-2A-UP or VMX-2B-UP as appropriate. At this point in time you should have full connectivity via AutoVPN to Amazon subnets and vice versa (assuming firewall rules allow the traffic).

As a test you should be able to change a subnet route table associate to the "DOWN" route table and have everything keep working. Change it back to the "UP" route table when done. This screen will also acts as a visual reference as to what state the system believes it is in with regard to vMX failures and routing.

You can watch a You Tube video of the route tables being configured with the link below.
https://youtu.be/-dDbUDCpixo

Configure Amazon IAM Security Role

We need to create an IAM role to allow permissions for the things the Lambda script needs to do. Go "Service/IAM/Polices". Click "Create Policy". Click on the JSON table and replace everything in the table with the below.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:ReplaceRouteTableAssociation",
                "ec2:DescribeInstances",
                "ec2:DescribeVpcs",
                "ec2:DescribeSubnets",
                "ec2:DescribeRouteTables"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "cloudwatch:DescribeAlarms",
            "Resource": "arn:aws:cloudwatch:*:*:alarm:*"
        }
    ]
}

Click "Review Policy" at the bottom. Call the policy "vMX-HA". Click "Create Policy".

Click on "Roles" on the left hand side. Click on "Create Role". Select "Lamda". Click "Next: Permissions". Inthe search box type in "vMX" and select the existing policy that we created. Click "Next: Tags".Click "Next: Review". For the "Role Name" type in "vMX-HA". Click "Create Role"

You can watch a You Tube video of the IAM security role being configured with the link below.
https://youtu.be/sItvvKzRlFQ

Configure Amazon AWS Lambda Script

Go Services/Lambda. Click "Create Function". Select the default option of "Author from Scratch". For the name enter "vMX-HA". For the runtime choose "Python 3.7". Under permissions expand "Choose or create an execution role". Choose "Use an existing role". Choose "vMX-HA" for the existing role. Click "Create Function".

Under "Designer" select "CloudWatch Events". Scroll down to rule. Select "Create a new rule". Enter "Every-Minute" for the rule name. Under "Rule Type" leave it set to "Schedule Express". Enter the following expression:

rate(1 minute)

Click "Add".

Click "vMX-HA" at the top to get to the code editor. Copy and paste the below script into the code window. Change the entries in bold to match your configuration.


import boto3
import json

# Change this to be the same as the region you are using
region='ap-southeast-2'

ec2 = boto3.resource('ec2',region_name=region)
cloudwatch = boto3.resource('cloudwatch',region_name=region)
client = boto3.client('ec2',region_name=region)

def change_subnet_routetable(subnetID,rtID):
  response = client.describe_route_tables(
      Filters=[{'Name': 'association.subnet-id','Values': [subnetID]}]
  )
  rtaID=response['RouteTables'][0]['Associations'][0]['RouteTableAssociationId']
  ec2.RouteTableAssociation(rtaID).replace_subnet(RouteTableId=rtID)

def lambda_handler(event, context):
  # Change this to be the instance ID of VMX-2A
  vmx2a = ec2.Instance('i-xxxxxx')
  vmx2aalarm = cloudwatch.Alarm('VMX-2A-STATUS-CHECK')
  # Change this to be the instance ID of VMX-2B
  vmx2b = ec2.Instance('i-xxxxxx')
  vmx2balarm = cloudwatch.Alarm('VMX-2B-STATUS-CHECK')
  if vmx2a.state['Name']=="running" and vmx2aalarm.state_value=='OK':
    print("VMX2A UP")
    # Add one of these lines for every subnet you have.  Change "rtb" to be the ID for the route table VMX-2A-UP
    change_subnet_routetable('subnet-xxxxxx','rtb-xxxxxx')
  else:
    print("VMX2A DOWN")
    # Add one of these lines for every subnet you have.  Change "rtb" to be the ID for the route table VMX-2A-DOWN
    change_subnet_routetable('subnet-xxxxxx','rtb-xxxxxx')
  if vmx2b.state['Name']=="running" and vmx2balarm.state_value=='OK':
    print("VMX2B UP")
    # Add one of these lines for every subnet you have.  Change "rtb" to be the ID for the route table VMX-2B-UP
    change_subnet_routetable('subnet-xxxxxx','rtb-xxxxxx')
  else:
    print("VMX2B DOWN")
    # Add one of these lines for every subnet you have.  Change "rtb" to be the ID for the route table VMX-2B-DOWN
    change_subnet_routetable('subnet-xxxxxx','rtb-xxxxxx')

Click on "Save". Click on "Actions" on the top and "Publish new version". I tend to name the versions after the date and time,such as "2019-04-14-16:22".

If you make any changes to the script make sure you click "Save" and publish the new version again.

You can watch a You Tube video of configuring the Lambda script with the link below.
https://youtu.be/mpoXfCqTa6c

You are now all done!

You can watch a You Tube video of a sample failover occurring with the link below.
https://youtu.be/mnKCazMh3pw