The Difficulty of Managing AWS Security Groups with Terraform

JeremySecurity & Compliance, AnnouncementsLeave a Comment

Cloud Posse recently overhauled its Terraform module for managing security groups and rules. We rely on this module to provide a consistent interface for managing AWS security groups and associated security group rules across our Open Source Terraform modules.

This new module can be used very simply, but under the hood, it is quite complex because it is attempting to handle numerous interrelationships, restrictions, and a few bugs in ways that offer a choice between zero service interruption for updates to a security group not referenced by other security groups (by replacing the security group with a new one) versus brief service interruptions for security groups that must be preserved. Another enhancement is now you can provide the ID of an existing security group to modify, or, by default, this module will create a new security group and apply the given rules to it.

Avoiding Service Interruptions

It is desirable to avoid having service interruptions when updating a security group. This is not always possible due to the way Terraform organizes its activities and the fact that AWS will reject an attempt to create a duplicate of an existing security group rule. There is also the issue that while most AWS resources can be associated with and disassociated from security groups at any time, there remain some that may not have their security group association changed, and an attempt to change their security group will cause Terraform to delete and recreate the resource.

The 2 Ways Security Group Changes Cause Service Interruptions

Changes to a security group can cause service interruptions in 2 ways:

  1. Changing rules may be implemented as deleting existing rules and creating new ones. During the period between deleting the old rules and creating the new rules, the security group will block traffic intended to be allowed by the new rules.
  2. Changing rules may be implemented as creating a new security group with the new rules and replacing the existing security group with the new one (then deleting the old one). This usually works with no service interruption when all resources referencing the security group are part of the same Terraform plan. However, if, for example, the security group ID is referenced in a security group rule in a security group that is not part of the same Terraform plan, then AWS will not allow the existing (referenced) security group to be deleted, and even if it did, Terraform would not know to update the rule to reference the new security group.

The key question you need to answer to decide which configuration to use is “will anything break if the security group ID changes”. If not, then use the defaults create_before_destroy = true and preserve_security_group_id = false and do not worry about providing “keys” for security group rules. This is the default because it is the easiest and safest solution when the way the security group is being used allows it.

If things will break when the security group ID changes, then set preserve_security_group_id to true. Also read and follow the guidance below about keys and limiting Terraform security group rules to a single AWS security group rule if you want to mitigate against service interruptions caused by rule changes. Note that even in this case, you probably want to keep create_before_destroy = true because otherwise, if some change requires the security group to be replaced, Terraform will likely succeed in deleting all the security group rules but fail to delete the security group itself, leaving the associated resources completely inaccessible. At least with create_before_destroy = true, the new security group will be created and used where Terraform can make the changes, even though the old security group will still fail to be deleted.

The 3 Ways to Mitigate Against Service Interruptions

Security Group create_before_destroy = true

The most important option is create_before_destroy which, when set to true (the default), ensures that a new replacement security group is created before an existing one is destroyed. This is particularly important because a security group cannot be destroyed while it is associated with a resource (e.g. a load balancer), but “destroy before create” behavior causes Terraform to try to destroy the security group before disassociating it from associated resources so plans fail to apply with the error

Error deleting security group: DependencyViolation: resource sg-XXX has a dependent object

With “create before destroy” set, and any resources dependent on the security group as part of the same Terraform plan, replacement happens successfully:

  1. New security group is created
  2. Resource is associated with the new security group and disassociated from the old one
  3. Old security group is deleted successfully because there is no longer anything associated with it

(If a resource is dependent on the security group and is also outside the scope of the Terraform plan, the old security group will fail to be deleted and you will have to address the dependency manually.)

Note that the module's default configuration of create_before_destroy = true and preserve_security_group_id = false will force the “create before destroy” behavior on the target security group, even if the module did not create it and instead you provided a target_security_group_id.

Unfortunately, creating a new security group is not enough to prevent a service interruption. Keep reading for more on that.

Setting Rule Changes to Force Replacement of the Security Group

A security group by itself is just a container for rules. It only functions as desired when all the rules are in place. If using the Terraform default “destroy before create” behavior for rules, even when using create_before_destroy for the security group itself, an outage occurs when updating the rules or security group because the order of operations is:

  1. Delete existing security group rules (triggering a service interruption)
  2. Create the new security group
  3. Associate the new security group with resources and disassociate the old one (which can take a substantial amount of time for a resource like a NAT Gateway)
  4. Create the new security group rules (restoring service)
  5. Delete the old security group

To resolve this issue, the module's default configuration of create_before_destroy = true and preserve_security_group_id = false causes any change in the security group rules to trigger the creation of a new security group. With that, a rule change causes operations to occur in this order:

  1. Create the new security group
  2. Create the new security group rules
  3. Associate the new security group with resources and disassociate the old one
  4. Delete the old security group rules
  5. Delete the old security group

Preserving the Security Group

There can be a downside to creating a new security group with every rule change. If you want to prevent the security group ID from changing unless absolutely necessary, perhaps because the associated resource does not allow the security group to be changed or because the ID is referenced somewhere (like in another security group's rules) outside of this Terraform plan, then you need to set preserve_security_group_id to true.

The main drawback of this configuration is that there will normally be a service outage during an update because existing rules will be deleted before replacement rules are created. Using keys to identify rules can help limit the impact, but even with keys, simply adding a CIDR to the list of allowed CIDRs will cause that entire rule to be deleted and recreated, causing a temporary access denial for all of the CIDRs in the rule. (For more on this and how to mitigate against it, see The Importance of Keys below.)

Also, note that setting preserve_security_group_id to true does not prevent Terraform from replacing the security group when modifying it is not an option, such as when its name or description changes. However, if you can control the configuration adequately, you can maintain the security group ID and eliminate the impact on other security groups by setting preserve_security_group_id to true. We still recommend leaving create_before_destroy set to true for the times when the security group must be replaced to avoid the DependencyViolation described above.

Defining Security Group Rules

We provide several different ways to define rules for the security group for a few reasons:

  • Terraform type constraints make it difficult to create collections of objects with optional members
  • Terraform resource addressing can cause resources that did not actually change to be nevertheless replaced (deleted and recreated), which, in the case of security group rules, then causes a brief service interruption
  • Terraform resource addresses must be known at plan time, making it challenging to create rules that depend on resources being created during apply and at the same time are not replaced needlessly when something else changes
  • When Terraform rules can be successfully created before being destroyed, there is no service interruption for the resources associated with that security group (unless the security group ID is used in other security group rules outside of the scope of the Terraform plan)

The Importance of Keys

If you are relying on the “create before destroy” behavior for the security group and security group rules, you can skip this section and much of the discussion about keys in the later sections because keys do not matter in this configuration. However, if you are using the “destroy before create” behavior, a full understanding of keys applied to security group rules will help you minimize service interruptions due to changing rules.

When creating a collection of resources, Terraform requires each resource to be identified by a key so that each resource has a unique “address” and Terraform uses these keys to track changes to resources. Every security group rule input to this module accepts optional identifying keys (arbitrary strings) for each rule. If you do not supply keys, then the rules are treated as a list, and the index of the rule in the list will be used as its key. Note that not supplying keys, therefore, has the unwelcome behavior that removing a rule from the list will cause all the rules later in the list to be destroyed and recreated. For example, changing [A, B, C, D] to [A, C, D] causes rules 1(B), 2(C), and 3(D) to be deleted and new rules 1(C) and 2(D) to be created.

We allow you to specify keys (arbitrary strings) for each rule to mitigate this problem. (Exactly how you specify the key is explained in the next sections.) Going back to our example, if the initial set of rules were specified with keys, e.g. [{A: A}, {B: B}, {C: C}, {D: D}], then removing B from the list would only cause B to be deleted, leaving C and D intact.

Note, however, two cautions. First, the keys must be known at terraform plan time and therefore cannot depend on resources that will be created during apply. Second, in order to be helpful, the keys must remain consistently attached to the same rules. For example, if you did the following:

rule_map = { for i, v in rule_list : i => v }

Then you will have merely recreated the initial problem by using a plain list. If you cannot attach meaningful keys to the rules, there is no advantage to specifying keys at all.

Avoid One Terraform Rule = Many AWS Rules

A single security group rule input can actually specify multiple security group rules. For example, ipv6_cidr_blocks takes a list of CIDRs. However, AWS security group rules do not allow for a list of CIDRs, so the AWS Terraform provider converts that list of CIDRs into a list of AWS security group rules, one for each CIDR. (This is the underlying cause of several AWS Terraform provider bugs, such as #25173.) As of this writing, any change to any element of such a rule will cause all the AWS rules specified by the Terraform rule to be deleted and recreated, causing the same kind of service interruption we sought to avoid by providing keys for the rules, or, when create_before_destroy = true, causing a complete failure as Terraform tries to create duplicate rules which AWS rejects. To guard against this issue, when not using the default behavior, you should avoid the convenience of specifying multiple AWS rules in a single Terraform rule and instead create a separate Terraform rule for each source or destination specification.

rules and rules_map inputs

This module provides 3 ways to set security group rules. You can use any or all of them at the same time.

The easy way to specify rules is via the rules input. It takes a list of rules. (We will define a rule a bit later.) The problem is that a Terraform list must be composed of elements of the exact same type, and rules can be any of several different Terraform types. So to get around this restriction, the second way to specify rules is via the rules_map input, which is more complex.

Why the input is so complex?

The rules_map input takes an object.

  • The attribute names (keys) of the object can be anything you want, but need to be known during terraform plan, which means they cannot depend on any resources created or changed by Terraform.
  • The values of the attributes are lists of rule objects, each representing one Security Group Rule. As explained above in “Why the input is so complex“, each object in the list must be exactly the same type. To use multiple types, you must put them in separate lists and put the lists in a map with distinct keys.

Definition of a Rule

For our module, a rule is defined as an object. The attributes and values of the rule objects are fully compatible (have the same keys and accept the same values) as the Terraform aws_security_group_rule resource, except

  • The security_group_id will be ignored, if present
  • You can include an optional key attribute. Its value must be unique among all security group rules in the security group, and it must be known in the Terraform “plan” phase, meaning it cannot depend on anything being generated or created by Terraform.

If provided, the key attribute value will be used to identify the Security Group Rule to Terraform to prevent Terraform from modifying it unnecessarily. If the key is not provided, Terraform will assign an identifier based on the rule's position in its list, which can cause a ripple effect of rules being deleted and recreated if a rule gets deleted from the start of a list, causing all the other rules to shift position. See “Unexpected changes…” below for more details.

Important Notes

Unexpected changes during plan and apply

When configuring this module for “create before destroy” behavior, any change to a security group rule will cause an entirely new security group to be created with all new rules. This can make a small change look like a big one, but is intentional and should not cause concern.

As explained above under The Importance of Keys, when using “destroy before create” behavior, security group rules without keys are identified by their indices in the input lists. If a rule is deleted and the other rules move closer to the start of the list, those rules will be deleted and recreated. This can make a small change look like a big one when viewing the output of Terraform plan, and will likely cause a brief (seconds) service interruption.

You can avoid this for the most part by providing the optional keys, and limiting each rule to a single source or destination. Rules with keys will not be changed if their keys do not change and the rules themselves do not change, except in the case of rule_matrix, where the rules are still dependent on the order of the security groups in source_security_group_ids. You can avoid this by using rules instead of rule_matrix when you have more than one security group in the list. You cannot avoid this by sorting the source_security_group_ids, because that leads to the “Invalid for_each argument” error because of terraform#31035.

Invalid for_each argument

You can supply many rules as inputs to this module, and they (usually) get transformed into aws_security_group_rule resources. However, Terraform works in 2 steps: a plan step where it calculates the changes to be made, and an apply step where it makes the changes. This is so you can review and approve the plan before changing anything. One big limitation of this approach is that it requires that Terraform be able to count the number of resources to create without the benefit of any data generated during the apply phase. So if you try to generate a rule based on something you are creating at the same time, you can get an error like

Error: Invalid for_each argument
The "for_each" value depends on resource attributes that cannot be determined until apply, so Terraform cannot predict how many instances will be created.

This module uses lists to minimize the chance of that happening, as all it needs to know is the length of the list, not the values in it, but this error still can happen for subtle reasons. Most commonly, using a function like compact on a list will cause the length to become unknown (since the values have to be checked and nulls removed). In the case of source_security_group_ids, just sorting the list using sort will cause this error. (See terraform#31035.) If you run into this error, check for functions like compact somewhere in the chain that produces the list and remove them if you find them.

WARNINGS and Caveats

Setting inline_rules_enabled is not recommended and NOT SUPPORTED: Any issues arising from setting inlne_rules_enabled = true (including issues about setting it to false after setting it to true) will not be addressed because they flow from fundamental problems with the underlying aws_security_group resource. The setting is provided for people who know and accept the limitations and trade-offs and want to use it anyway. The main advantage is that when using inline rules, Terraform will perform “drift detection” and attempt to remove any rules it finds in place but not specified inline. See this post for a discussion of the difference between inline and resource rules and some of the reasons inline rules are not satisfactory.

KNOWN ISSUE (#20046): If you set inline_rules_enabled = true, you cannot later set it to false. If you try, Terraform will complain and fail. You will either have to delete and recreate the security group or manually delete all the security group rules via the AWS console or CLI before applying inline_rules_enabled = false.

Objects not of the same type: Any time you provide a list of objects, Terraform requires that all objects in the list must be the exact same type. This means that all objects in the list have exactly the same set of attributes and that each attribute has the same type of value in every object. So while some attributes are optional for this module, if you include an attribute in any of the objects in a list, you have to include that same attribute in all of them. In rules where the key would otherwise be omitted, including the key with a value of null, unless the value is a list type, in which case set the value to [] (an empty list), due to #28137.