First Impressions With dep

Dependency management in Go has never been fun. When I was starting this post, I was making updates to the Terraform ACME provider, and in dependency hell with govendor due to this issue, which still has yet to be fixed, or so it would seem.

You would think that fixing this would be as easy as just ripping out the vendor directory and installing every dependency again, version-locking the things you are concerned about (Terraform core in this case), but nope. Ultimately, I think one of the non-TF dependencies overwrote some UUID generation stuff that we depend on in Terraform, or perhaps one of the TF providers that we use for testing overwrote go-uuid, but in any case, TF core is now blocking the build and I can’t proceed.

So, what’s the only answer – burn it all down of course 🔥🔥🔥🔥🔥🔥

Since this is my personal project, I can pretty much use whatever dependency management tool I want, so I’m trying (and documenting for your pleasure) a run of using the new dep tool that will ultimately become the one dependency management system to rule them all in Go, finally.

With that said, let’s give it a spin. Since I like to live life on the edge, I’m using the bleeding edge copy fetched by go get, so if things are a bit off from the last binary release (I don’t know if they are or not, by the way), then you might want to check out master. There hasn’t been a new release since October.

(The specific install step is go get -u github.com/golang/dep/cmd/dep).

Initializing

With all of my import paths in place and raring to go, I deleted my vendor directory. You don’t have to do this, by the way, dep will back up your old vendor directory and also import the dependencies based off your existing tool’s metadata file – you can see the list of supported tools here. I just didn’t really trust my vendor.json given the amount of problems I was having trying to get it to actually converge a correct dependency set.

Here is the output of dep init on the ACME provider:

Locking in master (06020f8) for transitive dep github.com/mitchellh/mapstructure
Using master as constraint for direct dep github.com/xenolf/lego
Locking in master (b929aa5) for direct dep github.com/xenolf/lego
Locking in master (23c074d) for transitive dep github.com/hashicorp/hcl
Locking in v1.32.0 (32e4c1e) for transitive dep gopkg.in/ini.v1
Locking in v1.1.0 (879c588) for transitive dep github.com/satori/go.uuid
Locking in v0.1.0 (4aabc24) for transitive dep github.com/bgentry/speakeasy
Locking in v1.32.0 (32e4c1e) for transitive dep github.com/go-ini/ini
Locking in v9.7.0 (8c58b47) for transitive dep github.com/Azure/go-autorest
Locking in master (7554cd9) for transitive dep github.com/hashicorp/errwrap
Locking in master (d23ffcb) for transitive dep github.com/mitchellh/copystructure
Locking in master (44bad6d) for transitive dep github.com/hashicorp/hcl2
Locking in v1.2.1 (5f10fee) for transitive dep github.com/agext/levenshtein
Locking in master (30785a2) for transitive dep golang.org/x/oauth2
Locking in v1.9.0 (f3955b8) for transitive dep google.golang.org/grpc
Locking in master (3d48364) for transitive dep github.com/jen20/awspolicyequivalence
Locking in master (709e403) for transitive dep github.com/zclconf/go-cty
Locking in master (42fe2e1) for transitive dep golang.org/x/net
Locking in master (dd9ec17) for transitive dep golang.org/x/sys
Locking in v1.1.0 (346938d) for transitive dep github.com/davecgh/go-spew
Locking in master (1e59b77) for transitive dep github.com/golang/protobuf
Locking in v0.9.1 (87b6919) for transitive dep github.com/hashicorp/vault
Locking in master (683f491) for transitive dep github.com/hashicorp/yamux
Locking in v12.1.1-beta (57da600) for transitive dep github.com/Azure/azure-sdk-for-go
Using ^1.6.0 as constraint for direct dep github.com/terraform-providers/terraform-provider-aws
Locking in v1.6.0 (ddd2205) for direct dep github.com/terraform-providers/terraform-provider-aws
Locking in master (64130c7) for transitive dep github.com/hashicorp/go-uuid
Locking in master (1fca145) for transitive dep github.com/armon/go-radix
Locking in v1.1 (dc2bc5a) for transitive dep github.com/posener/complete
Locking in master (fa9f258) for transitive dep github.com/hashicorp/hil
Locking in master (d5fe4b5) for transitive dep github.com/hashicorp/go-cleanhttp
Locking in master (f33a2c6) for transitive dep github.com/decker502/dnspod-go
Locking in v1.0.1 (787fb05) for transitive dep github.com/miekg/dns
Locking in master (59fac50) for transitive dep github.com/juju/ratelimit
Locking in v0.17.0 (050b16d) for transitive dep cloud.google.com/go
Locking in master (53e6ce1) for transitive dep github.com/google/go-querystring
Locking in v0.0.3 (0360b2a) for transitive dep github.com/mattn/go-isatty
Locking in master (2d22b6e) for transitive dep github.com/keybase/go-crypto
Locking in master (2bd8b58) for transitive dep github.com/apparentlymart/go-cidr
Locking in master (2bca23e) for transitive dep github.com/mitchellh/hashstructure
Locking in master (b1c0f9b) for transitive dep google.golang.org/api
Locking in master (37e8452) for transitive dep github.com/timewasted/linode
Locking in 1.15.0 (fa1c036) for transitive dep github.com/JamesClonk/vultr
Locking in (0b12d6b5) for transitive dep github.com/jmespath/go-jmespath
Locking in master (4fe82ae) for transitive dep github.com/hashicorp/go-version
Locking in v1.0.4 (d682213) for transitive dep github.com/sirupsen/logrus
Locking in master (994f50a) for transitive dep github.com/hashicorp/go-getter
Using ^0.11.1 as constraint for direct dep github.com/hashicorp/terraform
Locking in v0.11.1 (a42fdb0) for direct dep github.com/hashicorp/terraform
Locking in master (0dc08b1) for transitive dep github.com/hashicorp/logutils
Locking in master (02f7e94) for transitive dep github.com/ovh/go-ovh
Locking in v0.1.0 (5e23d5d) for transitive dep github.com/exoscale/egoscale
Locking in v0.15.0 (9641549) for transitive dep github.com/dnsimple/dnsimple-go
Locking in v3.1.0 (dbeaa93) for transitive dep github.com/dgrijalva/jwt-go
Locking in master (b836f5c) for transitive dep github.com/apparentlymart/go-textseg
Locking in master (ca137eb) for transitive dep github.com/hashicorp/go-hclog
Locking in master (e2fbc68) for transitive dep github.com/hashicorp/go-plugin
Locking in master (a61a995) for transitive dep github.com/mitchellh/go-testing-interface
Locking in v1.0.0 (150dc57) for transitive dep google.golang.org/appengine
Locking in master (63d60e9) for transitive dep github.com/mitchellh/reflectwalk
Locking in v3.5.1 (2ee8785) for transitive dep github.com/blang/semver
Locking in v2 (0e4404d) for transitive dep gopkg.in/yaml.v2
Locking in v1.1.0 (aa2e30f) for transitive dep gopkg.in/square/go-jose.v1
Locking in v1.0.3 (1563e62) for transitive dep github.com/edeckers/auroradnsclient
Locking in v1.2.0 (b91bfb9) for transitive dep github.com/stretchr/testify
Locking in master (a8101f2) for transitive dep google.golang.org/genproto
Using ^1.0.1 as constraint for direct dep github.com/terraform-providers/terraform-provider-tls
Locking in v1.0.1 (1be2f9e) for direct dep github.com/terraform-providers/terraform-provider-tls
Locking in master (b7773ae) for transitive dep github.com/hashicorp/go-multierror
Locking in v1.0.0 (15a30b4) for transitive dep github.com/beevik/etree
Locking in master (b8bc1bf) for transitive dep github.com/mitchellh/go-homedir
Locking in v1.12.56 (ce8d7a1) for transitive dep github.com/aws/aws-sdk-go
Locking in master (33edc47) for transitive dep github.com/mitchellh/cli
Locking in v2 (3c1c487) for transitive dep gopkg.in/ns1/ns1-go.v2
Locking in v0.5.4 (0c6b41e) for transitive dep github.com/ulikunitz/xz
Locking in master (ad45545) for transitive dep github.com/mitchellh/go-wordwrap
Locking in master (e19ae14) for transitive dep golang.org/x/text
Locking in master (553a641) for transitive dep github.com/golang/snappy
Locking in master (9fd32a8) for transitive dep github.com/bgentry/go-netrc
Using master as constraint for direct dep golang.org/x/crypto
Locking in master (0fcca48) for direct dep golang.org/x/crypto
Locking in v1.0.0 (792786c) for transitive dep github.com/pmezard/go-difflib

My initial impression here is the UX is much better than govendor. It actually tells you what versions it’s pulling in? What sorcery is this? 😂

Looking at the output, it seems to have done a pretty good (initial) job at giving me what I want. The most important thing is that it’s vendoring lego at master. The tool hasn’t had a release for a while, and I need some post-release fixes I specifically need.

Going even farther, it vendored Terraform at it’s most recent release, 0.11.1. This is amazing, because it’s really the only dependency that I was concerned with vendoring at a specific version lock!

Now, editing one of the files within the provider does not give me the UUID-related build issues I was having, or any of the other missing or outdated dependency issues, for that matter. Pack it up, we’re done here.

But wait, are we done? Let’s check out a few more things.

File Count

dep will – currently anyway – bloat your vendor directory pretty hard. The ACME project’s directory probably inflated about 5 times (from about 2800 files to over 16000).

This is due to some currently ongoing work in dep to ensure that the steps undertaken by dep prune are done by the rest of the process automatically. This removes unused packages and running it got the file count down to a much more manageable ~4700 files, which, I would imagine, is much easier to get by someone during code review (thankfully I don’t have to do that here 😉).

This still did not remove everything though. It appears the team is still working on making this work on a scale that will ultimately ensure that only what is truly needed remains – this includes removing _test.go files in necessary packages. This is being tracked in #944, it appears, or at the very least questions about the pruning functionality are being answered there. Note that as mentioned, and as the subject of the issue mentions, prune will soon be consolidated into ensure, with that command assuming prune’s functionality along with its own.

Metadata and Configuration Files

Two files control dep’s behavior.

The first one, Gopkg.toml, is a configuration file that only displays your direct dependencies – the packages your project directly references. This is where you will want to make changes if you need to increment a version of a particular dependency before running dep ensure. Personally, I like this approach over something like govendor where you need to manage this on the command line.

The documentation seems to indicate that you can make changes here, run dep ensure, and dep will automatically transition the entire vendor tree to properly satisfy the new dependency chain, if that’s possible.

The other file is Gopkg.lock. This is similar to any other lock file that you would see in a dependency management system (such as Bundler), and replaces other files like vendor.json (if you were using govendor, for example). This file you would normally not touch, and serves to catalog all of the dependencies the tool needs, including your transitive dependencies – the packages your direct dependencies require.

Viewing Dependency Status

dep status gives you information about the vendored dependencies.

Running it with no options gives you stats on every project that has been vendored, including versions and used package count within each project.

Also, running dep status -dot allows you to output a GraphViz diagram, much like terraform graph does, if you are familiar with that. If you are familiar with terraform graph though, you might know how ridiculous the graphs get at scale, and obviously vendored dependencies are going to be worse, so don’t expect the graphs to be entirely simple.

Local Work

Aniother thing to note is that dep does not work out of local paths at all currently, it seems. This means that any local work you do will be superseded when you run dep ensure, which I personally think is a great addition, because it removes the ambiguity that other tools currently have surrounding this behavior.

My hope is they they keep this default behavior if/when they do actually introduce support for working with local packages.

Version Constraints for Transitive Dependencies

The last thing I want to note is how to deal with transitive dependencies. These are dependencies that your direct dependencies (the packages your project directly references) depend on.

This probably cost me about half a day of debugging and troubleshooting, but one thing that did happen was the correctly vendored dependencies for copystructure and reflectwalk were not imported, and there are currently some “correct” behaviors in Terraform that depend on bugs that were present in these two packages. This was fixed by adding some extra override blocks in Gopkg.toml that locked these two packages at the correct commits. After adding these in, it was as easy as running dep ensure to get everything corrected.

Conclusion

dep is a great tool. It’s got a bit more work to go, but the UX is very pleasant, and everyone involved should give themselves a pat on the back for finally getting the Go community to a place where we might be able to put dependency management behind us. It’s always going to be somewhat of a struggle, but at the very least we can all struggle under one banner versus the dozen or so different ones that currently comprise the vendor tool ecosystem.

If you are curious as to how it looks in action, you can check out the code for the Terraform ACME provider, which is now being managed with dep!

(If you read all the way through this and notice that the TF being used is now constrained to master, even though I mentioned that I wanted it locked at 0.11.1, it’s because not only did I do that during my troubleshooting process, but there has also been some changes to the “provider SDK” in helper/schema that I figured would be good to have on hand.)

Advertisement

Inside the Terraform Environments Feature

Terraform v0.9.0 shipped with a new core feature called environments. This formalizes a pattern that has been circulating around in the Terraform community for probably a number of years now, and is crucial to using TF effectively at scale.

In this post I’m going to dive into this feature and discuss a bit of what to expect, especially if you implement a custom environment pattern of your own currently. I’m also going to briefly discuss a pattern for managing your naming based on a module that computes names based off your environment, and finally a brief on the future plans for the feature.

Effective State Management

First off, it’s probably prudent to discuss the background of why first-class support for this feature was added.

A modern organization will have at least a production environment and a staging environment. Possibly, more complex organizations will have maybe multiple production environments in different datacenters spread across the world, staging environments to match, and perhaps even sandbox environments for their individual teams.

Terraform is great because it allows you to codify your infrastructure in a repeatable way. To take full advantage of this repeatability, it’s best to spread your Terraform configuration across multiple projects (especially when those projects don’t share infrastructure), and then deploy those projects in a unit of granularity that makes sense (such as on a per-datacenter or per-region basis), managing each state separately.

This is what I will be referring in this article as the environment pattern.

State Namespacing

The environment pattern is the simple partitioning of your Terraform state namespace into different categories, hopefully starting with the most unique entity (your project), down to the most common (your datacenters or regions).

This allows an effective logical separation of state, maintaining sanity, and reducing single points of failure.

As an example, say you have 2 projects. One is for a purely frontend web application whose content is hosted on S3 and served via CloudFront in AWS. The bucket is homed in us-west-2, which is notable even though both S3 and CloudFront are “global” services (the latter is, the former not necessarily, especially if you remember the recent S3 outage in us-east-1). The second is a backend API service for the web application, hosted on EC2 in both us-east-1 and us-west-2. Both of these applications reside in different projects and don’t share code and possibly even developers. Both of these projects are deployed to staging for QA, and then production once they pass QA and are ready for deploy.

Given this simple example, using the environment pattern, we would have two projects, and 6 different Terraform state namespaces. Almost, if not all, of our state pathing information is naturally delimited by slashes. They are below:

FE/app-frontend/production/us-west-2/terraform.tfstate
FE/app-frontend/staging/us-west-2/terraform.tfstate
BE/api-backend/production/us-east-1/terrafrom.tfstate
BE/api-backend/production/us-west-2/terrafrom.tfstate
BE/api-backend/staging/us-east-1/terraform.tfstate
BE/api-backend/staging/us-west-2/terrafrom.tfstate

This separation of configuration, and more importantly, Terraform state, allows for a couple of things. First off, the infrastructure for the frontend site and the backend API do not interfere with each other. They can be set up and maintained independently, without risk to the other when doing so. The second is the flexibility in where the infrastructure can be set up. Thanks to the fact that we are not explicitly stating in the Terraform code where the infrastructure is being set up, we can deploy first to a local region that makes sense, then to a satellite region for local availability or further site resilience.

This does not even need to necessarily apply to “physical” infrastructure either. Other kinds of configuration can be controlled in this way. Consider the scenario of organizing AWS IAM access for a complex application infrastructure. Leveraging modules and a name spacing scheme, you can use Terraform to manage IAM in a pluggable fashion where projects can get only the access they need for the environments they need it in, allowing you to add and remove this access at will, simply by adding or removing configuration.

Control of this pattern in your Terraform configuration used to be entirely up to you, and as such was highly flexible to what the individual project needed. Ultimately, this yields separate state organized in a hierarchical fashion as discussed above, either checked into source control, or stored remotely in a central state store. The scheme is highly flexible as to what you want to do, as if pathed the way we show above, referencing the state is semantic, and there are very few chances for collisions.

It can still be up to you, but now you have first-class support in that decision.

The Old Way

As a very basic illustration of how environment namespacing ultimately ultimately translated to Terraform state configuration is shown below. We use the example of remote configuration.

Remote Configuration

Pre-v0.9.0, this was entirely controlled via the terraform remote config command. This basically did the same thing that dropping in the backend configuration into your TF directory and running terraform init does now. The command went something like this:


terraform remote config \
-backend=s3
-backend-config="bucket=tfstate.foobar.local" \
-backend-config="key=FE/app-frontend/${ENV}/${REGION}/terraform.tfstate"

You could then pass in this data via variables to your Terraform run.


TF_VAR_env=${ENV} TF_VAR_region=${REGION} \
terraform apply

The lack of fist-class support meant that structure of the environment and the variables that you pass into your project were completely up to you.

The New Way

Terraform v0.9.x changes this in a few ways.

First off, since environment is a single key now, we have to manage this key-space a different way if we want to incorporate region, or any other namespace, into the mix.

Backend Configuration

First off, we need to drop in the backend config into the Terraform directory:


terraform {
backend "s3" {
bucket = "tfstate.foobar.local"
key = "FE/web-frontend"
region = "your-buckets-region"
}
}

After this is done, terraform init needs to be run to ensure that the .terraform directory contains all of the information that is needed to connect to the remote backend. This also needs to be incorporated into your scripts, as the .terraform directory should not be checked into source control – hence it should be the first command that is run, before checking for, and switching to, environments.

Switching to the Environment

After backend configuration is done, we need to either switch to the new environment, or create it if it doesn’t exist. See the gist below:


if ! env_list="$(terraform env list)"; then
exit 1
fi
if [[ "${env_list}" == *"${ENV}_${REGION}"* ]]; then
terraform env select "${ENV}_${REGION}"
else
terraform env new "${ENV}_${REGION}"
fi

Note how we are getting around the single-keyspace limitation we now have, by simply combining our ENV and REGION with an underscore. Again, this can be customized to your needs, but you may want to keep in mind some other considerations to future-proof yourself (more on this later).

Referencing the Environment in Config

Probably one of the best things about the first-class support for environments is the addition of the terraform.env built-in variable. This can be referenced in config much like the variables that we had to pass in manually before, creating a clear, best-practice pattern and helping to clean up build scripts. Aside from any other variables you now choose to pass in, the only thing that is now necessary is to run terraform apply.

Under the Hood

So what does the state namespace look like under the hood when using environments?

Using local state (no backend), Terraform creates a terraform.state.d directory for your environments, dropping a terraform.tfstate file in a named directory for your environment within that directory. This is created in the directory that you run Terraform from.

Using remote state, things are a little different. Using S3 as the example, Terraform creates an env:/${ENVIRONMENT_NAME} directory (with colon), prefixing this on to the key that you specify to the backend config. This might take a bit of getting used to, and especially if you rely on the terraform_remote_state data source, you may actually need to migrate the location of your remote state to the new structure. Especially if you are reliant on a format that’s similar to the namespacing discussed at the beginning of this article, this will need to happen before you can take advantage of this support.

Managing Naming With Environment Data

Managing naming across several repositories with different environments can be a challenge. The environment pattern, both new and old, actually gives you a set of input data that can be used to help you manage these conventions in a conditional fashion. You can go a step further and encapsulate all of these names in a module. The below example shows you how you can do it when delimiting region on an underscore. You can obviously simplify this if you are not including region in your environment.

A very simple module could look like:


variable "endpoint_name" {
type = "string"
}
variable "domains" {
type = "map"
default = {
"production" = "foobar.local"
"staging" = "dev.foobar.local"
}
}
output "endpoint_name" {
value = "${var.endpoint_name}.${split("_", terraform.env)[0]}.${var.domains[element(split("_", terraform.env), 0)]}"
}

And could be included like so:


module "names" {
source = "./names"
endpoint_name = "frontend"
}
output "endpoint_name" {
value = "${module.names.endpoint_name}"
}

This would then render something like frontend.us-west-2.dev.foobar.local if your environment was staging_us-west-2. You can encapsulate any variety of names behind this module too, giving you a nicely packaged way to compute any number of standardized endpoint or resource names for your organization across multiple projects and environments.

This module looks pretty much the same in the old pattern, just with the inclusion of an environment and region parameter. So ultimately, the new pattern does save some coding for the module consumer, as now they don’t need to worry about supplying environment and region.

Future-Proofing

HashiCorp has expressed some intentions in regards to plans for their environment support, namely having to do with source control tracking. As such, it might be prudent to start structuring your environment naming conventions into your git workflows, so that you can take advantage of this support when it comes out. This also means that your environments should probably be structured so that they are compatible with the naming Git or your VCS of choice affords you, to, at the very least, ensure things look sane and readable.

Conclusion

The environments feature is a welcome feature to Terraform. It provides first-class validation to a pattern that has been been in use in the community for a while, possibly allowing internal code that has been written to manage this practice to be cleaned up in the process. There might be a little bit of a learning curve in adoption, but the task is not insurmountable, and the presence of the feature alone and the formalization of the pattern will help encourage its use as a general best practice.

AWS Advent: Where are They Now?

Last November, I wrote an article for the 2016 AWS Advent, which you can find here and here.

As luck would have it, a mere week after it went live in the advent, HashiCorp went and released version 0.8.0 of Terraform, rendering a few important parts of the article obsolete via way of first-class support for things like ternary logic. And since then, Terraform 0.9.0 has been released (around mid-March to be specific), bringing even more goodies, such as the long awaited ability to use a computed count value in data sources, making data sources much more useful.

Let’s touch on these and some other of the more useful core updates:

The Ternary Operator

The Ternary operator adds first-class support for basic conditional logic to Terraform. Interpolation operations can now render output based on a conditional if-then-else expression in an EXPR ? THEN : ELSE form.

This can still be used to do things like toggle resources in the graph. Let’s take an example from the advent article where we toggle an AWS ALB target group in the autoscaling module. This used to be written as:

// autoscaling_alb_target_group creates the ALB target group.
resource "aws_alb_target_group" "autoscaling_alb_target_group" {
  count    = "${lookup(map("true", "1"), var.enable_alb, "0")}"
  ...
}

This can now be written as:

// autoscaling_alb_target_group creates the ALB target group.
resource "aws_alb_target_group" "autoscaling_alb_target_group" {
  count    = "${var.enable_alb == "true" ? "1" : "0" }"
  ...
}

Much simpler. 🙂

Count In Data Sources

Terraform 0.9.0 brought partial support for the ability for a computed count to be available within data sources. As long as all computed counts can be evaluated during the planning phase, Terraform will not kick back an error to you.

This means that a computed count is still not permitted when the computed variable comes from the result of a resource – basically if there’s absolutely no way for Terraform to know what your count is going to be before it starts its run, you’re out of luck.

Support for this will more than likely be coming at some point in time though through a in-development feature currently referred to as a partial apply. You can follow the effort here.

Remote State Configuration Files

terraform remote config was dropped as a command in 0.9.0 as well. You now use backends in a terraform configuration stanza to define remote state configuration, and then terraform init to write the .terraform/terraform.tfstate stub state file with the correct remote state configuration in it.

It should also be noted that via this stub state is now the only thing that gets saved to the .terraform/terraform.tfstate file. State is not cached anymore, preventing the leaking of sensitive data via artifacts from a Terraform run, or edge cases that can happen with existing state (as an example, enabling KMS on an existing S3-stored Terraform state was problematic because the initial pull of the non-KMS encrypted state blew away the new remote state configuration).

State Environments

Related to this change is the concept of state environments. These are namespaces that can be used to store state for different sets of resources that share a common configuration, such as development and production. This is normally something that you would have needed to accomplish yourself via a specific namespace in your remote state, but now can be accomplished via the terraform env command.

One Gotcha – No Interpolations

Unfortunately, the current remote state configuration scheme does not support interpolations. This may be a blocker for you if you need a bit more advanced parameterization of your environments, such as purpose and region (such as if you need to differentiate between development in ca-central-1, and development in us-west-2, for example).

One way around this is to move your conditional logic to your environment name purely. You can then use the terraform env commands to switch to this state when needed, such as in scripts in your CI pipeline.

Emulating Pre-0.9.0 Behaviour

If you’re not quite ready to switch your state management yet, but want 0.9.x for the other amazing features, you can kind of emulate the previous terraform remote config behaviour by scripting the process of dropping in an automatically-generated file and running terraform init. Check out this gist which will give you the building blocks you need to update your scripts accordingly.

Just add Code: Fun with Terraform Modules and AWS

Note: This article originally appeared in the 2016 AWS Advent. For the original article, click here.

Note that this article was written for Terraform v0.7.x – there have been several developments since this release that makes a number of the items covered here obsolete and they will be covered in the next article. 🙂

This article is going to show you how you can use Terraform, with a little help from Packer and Chef, to deploy a fully-functional sample web application, complete with auto-scaling and load balancing, in under 50 lines of Terraform code.

You will need the sample project to follow along, so make sure you load that up before continuing with reading this article.

The Humble Configuration

Check out the code in the terraform/main.tf file.

It might be hard to think that with this mere smattering of Terraform is setting up:

  • An AWS VPC
  • 2 subnets, each in different availability zones, fully routed
  • An AWS Application Load Balancer
  • A listener for the ALB
  • An AWS Auto Scaling group
  • An ALB target group attached to the ALB
  • Configured security groups for both the ALB and backend instances

So what’s the secret?

Terraform Modules

This example is using a powerful feature of Terraform – the modules feature, providing a semantic and repeatable way to manage AWS infrastructure. The modules hide most of the complexity of setting up a full VPC behind a relatively small set of code, and an even smaller set of changes going forward (generally, to update this application, all that is needed is to update the AMI).

Note that this example is composed entirely of modules – no root module resources exist. That’s not to say that they can’t exist – and in fact one of the secondary examples demonstrates how you can use the outputs of one of the modules to add extra resources on an as-needed basis.

The example is composed of three visible modules, and one module that operates under the hood as a dependency

  • terraform_aws_vpc, which sets up the VPC and subnets
  • terraform_aws_alb, which sets up the ALB and listener
  • terraform_aws_asg, which configures the Auto Scaling group, and ALB target group for the launched instances
  • terraform_aws_security_group, which is used by the ALB and Auto Scaling modules to set up security groups to restrict traffic flow.

These modules will be explained in detail later in the article.

How Terraform Modules Work

Terraform modules work very similar to basic Terraform configuration. In fact, each Terraform module is a standalone configuration in its own right, and depending on its pre-requisites, can run completely on its own. In fact, a top-level Terraform configuration without any modules being used is still a module – the root module. You sometimes see this mentioned in various parts of the Terraform workflow, such as in things like error messages, and the state file.

Module Sources and Versioning

Terraform supports a wide variety of remote sources for modules, such as simple, generic locations like HTTP, or Git, or well-known locations like GitHub, Bitbucket, or Amazon S3.

You don’t even need to put a module in a remote location. In fact, a good habit to get into is if you need to re-use Terraform code in a local project, put that code in a module – that way you can re-use it several times to create the same kind of resources in either the same, or even better, different, environments.

Declaring a module is simple. Let’s look at the VPC module from the example:

module "vpc" {                                                       
  source                  = "github.com/paybyphone/terraform_aws_vpc?ref=v0.1.0"
  vpc_network_address     = "${var.vpc_network_address}"             
  public_subnet_addresses = ["${var.public_subnet_addresses}"]       
  project_path            = "${var.project_path}"                    
} 

The location of the module is specified with the source parameter. The style of the parameter will dictate what kind of behaviour TF will undertake to get the module.

The rest of the options here are module parameters, which translate to variables within the module. Note that any variable that does not have a default value in the module is a required parameter, and Terraform will not start if these are not supplied.

The last item that should be mentioned is regarding versioning. Most module sources that work off of source control have a versioning parameter you can supply to get a revision or tag – with Git and GitHub sources, this is ref, which can translate to most Git references, be it a branch, or tag.

Versioning is a great way to keep things under control. You might find yourself iterating very fast on certain modules as you learn more about Terraform or your internal infrastructure design patterns change – versioning your modules ensures that you don’t need to constantly refactor otherwise stable stacks.

Module Tips and Tricks

Terraform and HCL is a work in progress, and there may be some things that seem like they may make sense that don’t necessarily work 100% – yet. There are some things that you might want to keep in mind when you are designing your modules that may reduce the complexity that ultimately gets presented to the user:

Use Data Sources

Terraform 0.7+’s data sources feature can go a long way in reducing the amount of data needs to go in to your module.

In this project, data sources are used for things such as obtaining VPC IDs from subnets (aws_subnet) and getting the security groups assigned to an ALB (using the aws_alb_listener and aws_alb data sources chained together). This allows us to create ALBs based off of subnet ID alone, and attach auto-scaling groups to ALBs with knowing only the listener ARN that we need to attach to.

Exploit Zero Values and Defaults

Terraform follows the rules of the language it was created in regarding zero values. Hence, most of the time, supplying an empty parameter is the same as supplying none at all.

This can be advantageous when designing a module to support different kinds of scenarios. For example, the alb module supports TLS via supplying a certificate ARN. Here is the variable declaration:

// The ARN of the server certificate you want to use with the listener.
// Required for HTTPS listeners.
variable "listener_certificate_arn" {
  type    = "string"
  default = ""
}

And here it is referenced in the listener block:

// alb_listener creates the listener that is then attached to the ALB supplied
// by the alb resource.
resource "aws_alb_listener" "alb_listener" {
  ...
  certificate_arn   = "${var.listener_certificate_arn}"
  ...
}

Now, when this module parameter is not supplied, its default value becomes an empty string, which is passed in to aws_alb_listener.alb_listener. This is, most times, exactly the same as if the parameter is not passed in at all. This allows you to not have to worry about this parameter when you just want to use HTTP on this endpoint (the default for the ALB module as a whole).

Pseudo-Conditional Logic

Terraform does not support conditional logic yet, but through creative use of count and interpolation, one can create semi-conditional logic in your resources.

Consider the fact that the terraform_aws_autoscaling module supports the ability to attach the ASG to an ALB, but does not explicit require it. How can you get away with that, though?

To get the answer, check one of the ALB resources in the module:

// autoscaling_alb_target_group creates the ALB target group.
resource "aws_alb_target_group" "autoscaling_alb_target_group" {
  count    = "${lookup(map("true", "1"), var.enable_alb, "0")}"
  ...
}

Here, we make use of the map interpolation function, nested in a lookup function to provide essentially an if/then/else control structure. This is used to control a resource’s instance count, adding an instance if var.enable_alb is true, and completely removing the resource from the graph otherwise.

This conditional logic does not necessarily need to be limited to count either. Let’s go back to the aws_alb_listener.alb_listener resource in the ALB module, looking at a different parameter:

// alb_listener creates the listener that is then attached to the ALB supplied
// by the alb resource.
resource "aws_alb_listener" "alb_listener" {
  ...
  ssl_policy        = "${lookup(map("HTTP", ""), var.listener_protocol, "ELBSecurityPolicy-2015-05")}"
  ...
}

Here, we are using this trick to supply the correct SSL policy to the listener if the listener protocol is not HTTP. If it is, we supply the zero value, which as mentioned before, makes it as if the value was never supplied.

Module Limitations

Terraform does have some not-necessarily-obvious limitations that you will want to keep in mind when designing both modules and Terraform code in general. Here are a couple:

Count Cannot be Computed

This is a big one that can really get you when you are writing modules. Consider the following scenario that totally did not happen to me even though I knew of of such things beforehand 😉

  • An ALB listener is created with aws_alb_listener
  • The arn of this resource is passed as an output
  • That output is used as both the ARN to attach an auto-scaling group to, and the pseudo-conditional in the ALB-related resources’ count parameter

What happens? You get this lovely message:

value of 'count' cannot be computed

Actually, it used to be worse (a strconv error was displayed instead), but luckily that changed recently.

Unfortunately, there is no nice way to work around this right now. Extra parameters need to be supplied or you need to structure your modules in way that avoids computed values being passed into countdirectives in your workflow. (This is pretty much exactly why the terraform_aws_asg module has a enable_alb parameter).

Complex Structures and Zero Values

Complex structures are not necessarily good candidates for zero values, even though it may seem like a good idea. But by defining a complex structure in a resource, you are by nature supplying it a non-zero value, even if most of the fields you supply are empty.

Most resources don’t handle this scenario gracefully, so it’s best to avoid using complex structures in a scenario where you may be designing a module for re-use, and expect that you won’t be using the functionality defined by such a structure often.

The Application in Brief

As our focus in this article is on Terraform modules, and not on other parts of the pattern such as using Packer or Chef to build an AMI, we will only touch up briefly on the non-Terraform parts of this project, so that we can focus on the Terraform code and the AWS resources that it is setting up.

The Gem

The Ruby gem in this project is a small “hello world” application running with Sinatra. This is self-contained within this project and mainly exists to give us an artifact to put on our base AMI to send to the auto-scaling group.

The server prints out the system’s hostname when fetched. This will allow us to see each node in action as we boot things up.

Packer

The built gem is loaded on to an AMI using Packer, for which the code is contained within packer/ami.json. We use chef-solo as a provisioner, which works off a self-contained cookbook named packer_payload in the cookbooks directory. This allows us a bit more of a higher-level workflow than we would have simply with shell scripts, including the ability to better integration test things and also possibly support multiple build targets.

Note that the Packer configuration takes advantage of a new Packer 0.12.0 feature that allows us to fetch an AMI to use as the base right from Packer. This is the source_ami_filter directive. Before Packer 0.12.0, you would have needed to resort to a helper, such as ubuntu_ami.sh, to get the AMI for you.

The Rakefile

The Rakefile is the build runner. It has tasks for Packer (ami), Terraform (infrastructure), and Test Kitchen (kitchen). It also has prerequisite tasks to stage cookbooks (berks_cookbooks), and Terraform modules (tf_modules). It’s necessary to pre-fetch modules when they are being used in Terraform – normally this is handled by terraform get, but the tf_modules task does this for you.

It also handles some parameterization of Terraform commands, which allows us to specify when we want to perform something else other than an apply in Terraform, or use a different configuration.

All of this is in addition to standard Bundler gem tasks like build, etc. Note that install and releasetasks have been explicitly disabled so that you don’t install or release the gem by mistake.

The Terraform Modules

Now that we have that out of the way, we can talk about the fun stuff!

As mentioned at the start of the article, This project has 4 different Terraform modules. Also as mentioned, one of them (the Security Group module) is hidden from the end user, as it is consumed by two of the parent modules to create security groups to work with. This exploits the fact that Terraform can, of course, nest modules within each other, allowing for any level of re-usability when designing a module layout.

The AWS VPC Module

The first module, terraform_aws_vpc, creates not only a VPC, but also public subnets as well, complete with route tables and internet gateway attachments.

We’ve already hidden a decent amount of complexity just by doing this, but as an added bonus, redundancy is baked right into the module by distributing any network addresses passed in as subnets to the module across all availability zones available in any particular region via the aws_availability_zones data source. This process does not require previous knowledge of the zones available to the account.

The module passes out pertinent information, such as the VPC ID, the ID of the default network ACL, the created subnet IDs, the availability zones for those subnets as a map, and the ID of the route table created.

The ALB Module

The second module, terraform_aws_alb allows for the creation of AWS Application Load Balancers. If all you need is the defaults, use of this module is extremely simple, creating an ALB that will answer requests on port 80. A default target group is also created that can be used if you don’t have anything else mapped, but we want to use this with our auto-scaling group.

The Auto Scaling Module

The third module, terraform_aws_asg, is arguably the most complex of the three that we see in the sample configuration, but even at that, its required options are very slim.

The beauty of this module is that, thanks to all the aforementioned logic, you can attach more than one ASG to the same ALB with different path patterns (mentioned below), or not attach it to an ALB at all! This allows this same module to be used for a number of scenarios. This is on top of the plethora of options available to you to tune, such as CPU thresholds, health check details, and session stickiness.

Another thing to note is how the AMI for the launch configuration is being fetched from within this module. We work off the tag that we used within Packer, which is supplied as a module variable. This is then searched for within the module via an aws_ami data source. This means that no code or variables need to change when the AMI is updated – the next Terraform run will pick up the most recent AMI with the tag.

Lastly, this module supports the rolling update mechanism laid out by Paul Hinze in this post oh so long ago now. When a new AMI is detected and the auto-scaling group needs to be updated, Terraform will bring up the new ASG, attach it, wait for it to have minimum capacity, and then bring down the old one.

The Security Group Module

The last module to be mentioned, terraform_aws_security_group, is not shown anywhere in our example, but is actually used by the ALB and ASG modules to create Security Groups.

Not only does it create security groups though – it also allows for the creation of 2 kinds of ICMP allow rules. One for all ICMP, if you so choose, but more importantly, allow rules for ICMP type 3 (host unreachable) are always created, as this is how path MTU discovery works. Without this, we might end up with unnecessarily degraded performance.

Give it a Shot

After all this talk about the internals of the project and the Terraform code, you might be eager to bring this up and see it working. Let’s do that now.

Assuming you have the project cloned and AWS credentials set appropriately, do the following:

  • Run bundle install --binstubs --path vendor/bundle to load the project’s Ruby dependencies.
  • Run bundle exec rake ami. This builds the AMI.
  • Run bundle exec rake infrastructure. This will deploy the project.

After this is done, Terraform should return a alb_hostname value to you. You can now load this up in your browser. Load it once, then wait about 1 second, then load it again! Or even better, just run the following in a prompt:

while true; do curl http://ALBHOST/; sleep 1; done

And watch the hostname change between the two hosts.

Tearing it Down

Once you are done, you can destroy the project simply by passing a TF_CMD environment variable in to rake with the destroy command:

TF_CMD=destroy bundle exec rake infrastructure

And that’s it! Note that this does not delete the AMI artifact, you will need to do that yourself.

More Fun

Finally, a few items for the road. These are things that are otherwise important to note or should prove to be helpful in realizing how powerful Terraform modules can be.

Tags

You may have noticed the modules have a project_path parameter that is filled out in the example with the path to the project in GitHub. This is something that I think is important for proper AWS resource management.

Several of our resources have machine-generated names or IDs which make them hard to track on their own. Having a easy-to-reference tag alleviates that. Having the tag reference the project that consumes the resource is even better – I don’t think it gets much clearer than that.

SSL/TLS for the ALB

Try this: create a certificate using Certificate Manager, and change the alb module to the following:

module "alb" {
  source                   = "github.com/paybyphone/terraform_aws_alb?ref=v0.1.0"
  listener_subnet_ids      = ["${module.vpc.public_subnet_ids}"]
  listener_port            = "443"
  listener_protocol        = "HTTPS"
  listener_certificate_arn = "arn:aws:acm:region:account-id:certificate/certificate-id"
  project_path             = "${var.project_path}"
}

Better yet, see the example here. This can be run with the following command:

TF_DIR=terraform/with_ssl bundle exec rake infrastructure

And destroyed with:

TF_CMD=destroy TF_DIR=terraform/with_ssl bundle exec rake infrastructure

You now have SSL for your ALB! Of course, you will need to point DNS to the ALB (either via external DNS, CNAME records, or Route 53 alias records – the example includes this), but it’s that easy to change the ALB into an SSL load balancer.

Adding a Second ASG

You can also use the ASG module to create two auto-scaling groups.

module "autoscaling_group_foo" {
  source            = "github.com/paybyphone/terraform_aws_asg?ref=v0.1.1"
  subnet_ids        = ["${module.vpc.public_subnet_ids}"]
  image_tag_value   = "vancluever_hello"
  enable_alb        = "true"
  alb_listener_arn  = "${module.alb.alb_listener_arn}"
  alb_rule_number   = "100"
  alb_path_patterns = ["/foo/*"]
  alb_service_port  = "4567"
  project_path      = "${var.project_path}"
}

module "autoscaling_group_bar" {
  source            = "github.com/paybyphone/terraform_aws_asg?ref=v0.1.1"
  subnet_ids        = ["${module.vpc.public_subnet_ids}"]
  image_tag_value   = "vancluever_hello"
  enable_alb        = "true"
  alb_listener_arn  = "${module.alb.alb_listener_arn}"
  alb_rule_number   = "101"
  alb_path_patterns = ["/bar/*"]
  alb_service_port  = "4567"
  project_path      = "${var.project_path}"
}

There is an example for the above here. Again, run it with:

TF_DIR=terraform/multi_asg bundle exec rake infrastructure

And destroy it with:

TF_CMD=destroy TF_DIR=terraform/multi_asg bundle exec rake infrastructure

You now have two auto-scaling groups, one handling requests for /foo/*, and one handling requests for /bar/*. Give it a go by reloading each URL and see the unique instances you get for each.

Acknowledgments

I would like to take a moment to thank PayByPhone for allowing me to use their existing Terraform modules as the basis for the publicly available ones at https://github.com/paybyphone. Writing this article would have been a lot more painful without them!

Also thanks to my editors, Anthony Elizondo and Andrew Langhorn for for their feedback and help with this article, and the AWS Advent Team for the chance to stand on their soapbox for my 15 minutes! 🙂

AWS World Detour – Packer and Terraform

NOTE: Click here to get to the sample project quickly.

So I know it’s been a while – quite a while – since I’ve posted anything. I’ve been far from inactive though, and I will touch up on that in a later post.

I will be getting back on track here with my AWS World Tour series soon, but first, I want to take a bit of a detour and discuss some things that have been highly relevant to what I have been working on as of the last little while – infrastructure provisioning with Terraform.

My Journey the Last 6 Months

I have had a very eventful last 6 months since I started to use AWS “in anger”. I’m not the biggest fan of that term, but I can’t really think of other way to Put it right now. There have indeed been some frustrating moments, but I can’t really say I was angry.

In any case, my journey to try and find a good AWS provisioning platform took me to a few places, namely:

  • Trying to use Chef – Which prompted me to write this pull request. However, I quickly realized that having to write AWS resources for everything that we needed to do at PayByPhone might not be the best use of my time.
  • I also was going to write my own tooling around managing CloudFormation, but stopped short when, one weekend, I asked myself if I was reinventing the wheel, and I kind of was.

Enter Terraform.

Terraform

Terraform is a configuration management tool created by HashiCorp with an emphasis on virtual infrastructure, versus say something like Chef or Puppet, that has an emphasis on OS-level configuration management. In fact, the two can work in tandem – Terraform has a Chef provisioner.

What makes Terraform special is its ability to support many kinds of infrastructure platforms – and in fact, this is its stated goal. Things written in Terraform should be ultimately portable from one platform to another, say one wanted to move from AWS to Azure, or some kind of OpenStack-hosted solution somewhere. Admittedly, we had looked at Terraform earlier on in the year last year, and its AWS support had not been fully fleshed out yet, but that has seemed to have changed drastically, and by the end of last year its feature set was on par, and possibly even better, than CloudFormation itself.

So what if one just wanted to stick with AWS? What’s the point in using it over, say, simply CloudFormation? The answer to that is really dependent on the use case, and like many things in life, boil down to the little things. Some examples:

  • Support for ZIP file uploads to AWS Lambda (versus being restricted to S3 on CloudFormation).
  • Support for the AWS NAT Gateway. As of this writing, about a month since the NAT Gateway’s release, it still does not appear to be supported by
    CloudFormation, according to my very light research (judging by the traffic on this thread).
  • A non-JSON DSL that has support for several kinds of programmatic interpolation operations, including some basic forms of loop operations allowing one to create a certain amount of resources with a single chunk of code. The DSL also has support for modules, allowing re-use and distribution of common templates.

This is all possible by the fact that by using Terraform, one is not beholden to the CloudFormation way of doing things. Rather, Terraform uses the AWS API (thru the Go SDK), which in some instances is much more versatile than CloudFormation is, or CloudFormation could be, for that matter (as a hosted configuration management platform, they have to restrict some of their data sources to some reliable services – this is more than likely why S3 is the only way to get a ZIP file read in for Lambda).

Terraform is a fast moving target. I have submitted several pull requests myself, for bug fixes and adding functionality alike, and don’t see an end in sight for that, not at least for a few months. For example, one of my more recent feature PRs allows one to get details on an AMI for use in a template later. In order to do this in CloudFormation, one would have to undertake the tedious process of writing a custom resource which ultimately leads to more out-of-band resources, and unnecessary technical debt, in my opinion.

Packer

I’ve been using Packer for quite some time now to build base images, namely for our VMWare infrastructure at PayByPhone.

Packer is basically an image builder. Think of it this way – gem might build a Ruby gem to deploy on other systems, and one might build a static binary with go build. Packer is like this, but for system images. In fact, one might not have a custom application, and simply may need to build an image with some typical software on it and configured a certain way – in this instance, Packer and its provisioning code could live in the same repository and serve as a complete “application”.

Incidentally, the process for building AMIs is actually much simpler than using VMWare. There is a lot less code, and it’s basically AMI-in, AMI-out.

Packer Provisioners

Just a note on provisioners – Packer supports several of these. I use the Chef provisioner. There are also Ansible and shell provisioners. I would recommend using one of the configuration management options – it allows for the better re-use code, and also allows for testing before moving on to the Packer build. Namely, with Chef, I am able to use Berkshelf to manage dependencies, and Test Kitchen to sort out most, if not all, errors that would happen with a cookbook before moving on to creating the packer JSON file.

This also makes it more portable for use with something like Vagrant – generally all that’s needed is to copy the provisioner configuration from Vagrant to Packer, or vice versa, with possibly some minor modifications.

Putting it Together

With the above tools and a competent build runner, I can actually write an entire pipeline that takes an application and deploys it to AWS with relative ease. Better than that, this all can live together! This enables someone to go to a single repository to:

  • Get hands on the code, make changes, and run unit tests on it
  • Deploy using the provisioner code to an EC2 instance, or local to their machine with Vagrant, to run integration tests and experiment with the code
  • Fully test the infrastructure by building the AMI and deploying the infrastructure in an uniform fashion to multiple environments (ie: sandbox, staging, or production).

This allows for a pipeline that should ensure a near perfect deploy once the application is ready for production.

The Pattern in Action

As an example, I have taken the code from my previous article, (see here, here, here, and the code here), and applied the same idea, but with some changes. Again, I am deploying a VPC with ELB and 3 instances, but this time, I have skipped some of the details irrelevant to a VPC of this kind, especially since I will never be logging into it – namely the private network and NAT part of things.

The code can be found here. Let’s go over it together:

The application

The application is a simple Ruby “hello world” style application running with Sinatra. The application is bundled up as a Ruby gem – this is actually a pretty easy way to create this kind of application as it produces a single artifact that can be deployed where ever, especially on a “bare metal”, single- purpose system. This application could be even deployed even using the system’s base Ruby, if it is current enough (I don’t do that though, as I will show in later sections).

The layout of the application part of the sample repo is such:

exe/
  vancluever_hello       <-- Executable binary part of gem package
lib/
  vancluever_hello.rb    <-- Application &amp;amp;quot;entry point&amp;amp;quot; and Sinatra code
  vancluever_hello/
    version.rb           <-- Gem version file
pkg/                     <-- Output directory (gem gets built to here)
Gemfile                  <-- Bundler dependency file (chained to gemspec)
Rakefile                 <-- Rake build runner configuration file
vancluever_hello.gemspec <-- RubyGems package spec file

The bulk of the test code is in the lib/vancluever_hello.rb file, a simple file whose contents is shown below:

require 'sinatra/base'
require 'socket'

module VanclueverHello
  # Run the test server.
  class Server < Sinatra::Base
    def self.run_server
      set :bind, '0.0.0.0'

      get '/' do
        "Hello from #{Socket.gethostname}!!!"
      end

      run!
    end
  end
end

This is the Sinatra self-hosted version of what we were doing with the index.html files, Apache, and user data in the previous version of this stack. Rather than use a static file this time, we are using Sinatra and Ruby to demonstrate how this small app can be bundled onto an image and deployed from there, without any post-creation package installation and content writing.

The Rakefile contains the bundler/gem_tasks helper that allows us to easily build this gem from the details in the vancluever_hello.gemspec file. By running rake build, the .gem is dropped into the pkg/ directory, ready for the next step.

The packer_payload Chef cookbook

The next piece of the puzzle is the packer_payload Chef cookbook, self-contained in the cookbooks directory. This cookbook is not like other Chef cookbooks one might see – the metadata is stripped down (no version info, description, or even version locking). This is because this cookbook is not intended to be used anywhere else other than the Packer build that I will be discussing shortly.

Why Chef then? Why not shell scripts if this is all the cookbook is going to be used for? A couple of quick reasons that come to mind:

  • I’m not the biggest fan of shell scripts – I will use them when necessary, but I’m more a fan these days of writing things in a way that they can be easily re-used, and in a way that makes it easy to pull in things that make my job easier. Using Chef allows me to do that. For example, rather than having to write code to manage a non-distro Ruby, I use poise-ruby and poise-ruby-build to manage the Ruby version and gem package. Taking it further, rather than having to write scripts to template out upstart or systemd, I use poise-service, which supports both.
  • Even if this cookbook is not suitable for Supermarket, or to sit on a Chef server, it’s re-usability is not completely diminished. Test Kitchen can still be used with this, and in fact there is a .kitchen.cloud.yml file in the directory. Kitchen was used to test this cookbook before putting it into packer, ensuring that most, if not all, of the code worked before starting the process to build the AMI. This cookbook can also be used in Vagrant with minimal effort as well, should the need arise.

Getting the data to Chef

One thing that deserves mention is how I actually get the data to the Chef cookbook. The artifact does need to be delivered in some fashion to the cookbook itself. This is okay, mainly because I have Packer and Test Kitchen to help out with that.

In attributes/default.rb I control the location of the artifact:

default['packer_payload']['source_path'] = '/tmp/gem_pkg'

This is where the artifact is copied to with Packer (more on that soon). However, with Kitchen, things are a little different, because of how the data directory stuff works:

provisioner:
  name: chef_zero
  data_path: ../../pkg/

data_path controls the directory that contains any non-cookbook data that I want to send to the server. After this is done, I need to change the source_path node attribute:

suites:
  - name: default
    run_list:
      - recipe[packer_payload::default]
    attributes:
      packer_payload:
        source_path: /tmp/kitchen/data
        app_version: <%= ENV['KITCHEN_APP_VERSION'] %>

Also note the ERB in app_version – this is an environment variable passed in from Rake, which gets the data from the VanclueverHello::VERSION module in the gem code. More on this below.

Testing

As mentioned, the cookbook’s .kitchen.cloud.yml is fully functional, and the cookbook can be tested using the following command:

KITCHEN_APP_VERSION=0.1.0 \
 AWS_KITCHEN_USER=ubuntu \
 AWS_KITCHEN_AMI_ID=ami-123456abcd \
 KITCHEN_YAML=.kitchen.cloud.yml \
 ../../bin/kitchen verify

Note the environment variables. Depending on the target being tested for, kitchen-ec2 may have trouble finding an AMI for the target. The one I seem to have the most luck with right now is Ubuntu Trusty (14.04) – but I wanted to try this against some of the more recent versions like Wily. This necessitated the need to supply the login user and the AMI ID, which I do through environment so that it can be parameterized. Also, I pass the app version, which helps show how I can control the version of the gem that gets installed. Also, it is a popular pattern to name EC2 or other cloud Kitchen config files as .kitchen.cloud.yml and call them by passing the KITCHEN_YAML environment variable.

All of this is also in the Rakefile, under the kitchen task. By doing this, I don’t have to worry about running this from the command line all the time. Further to that it takes the work out of determining the AMI to use (more on that later).

The Packer template

The Packer template lives in the packer/ami.json file, sitting at a nice 43 lines, fully beautified. It has variables that are used to tell Packer what AMI to get, what region to set it up in, and also some things to add to the description, such as the distribution and application version.

The Packer template is probably the most simplest part of this setup. All that is needed to kick off the AMI creation is a packer build packer/ami.json. Of course, the parameters need to be passed via environment variables though, but the Rakefile handles that.

Provisioners – artifact delivery and Chef

One thing that I will note about the Packer template is how it does its configuration work on the AMI – this is done via what are known in Packer as provisioners:

  "provisioners": [
    {
      "type": "shell",
      "inline": ["mkdir /tmp/gem_pkg"]
    },
    {
      "type": "file",
      "source": "pkg/vancluever_hello-{{user `app_version`}}.gem",
      "destination": "/tmp/gem_pkg/vancluever_hello-{{user `app_version`}}.gem"
    },
    {
      "type": "chef-solo",
      "cookbook_paths": ["berks-cookbooks"],
      "run_list": ["packer_payload"],
      "json": {
        "packer_payload": {
          "app_version": "{{user `app_version`}}"
        }
      }
    }
  ]

Note the first two – the shell and file provisioners, which deliver the artifact. Creation of the directory is necessary here (and Packer won’t fail if the directory does not exist – something that created about an extra 2 hours of troubleshooting work for me as I was making this example). The next one, the chef-solo provisioner, runs the packer_payload cookbook to configure things.

Note the cookbooks directory. It’s not cookbooks, but instead berks-cookbooks. This is because I’m staging the full, dependency evaluated cookbook collection in the berks-cookbooks directory, via Berkshelf. This is handled by the Rakefile ahead of the execution of Packer. I haven’t dived into the Rakefile yet, but I won’t just yet – first off, I want to introduce the star of the show.

Tags

One last thing before I move on. Tagging the AMI is important! This allows a search on this AMI afterwards. Also, it keeps me from having to parse the Packer logs for the AMI ID, which even though Packer makes easier with a machine readable output flag, I still find tagging to be less work to do (no capturing output or having to save a log file). It also gets one into the habit of tagging resources, which should be done anyway.

Note that it doesn’t have to be the application ID that is the tag or the “artifact”. In addition to this, one could also tag the build ID – which can provide even further granularity when searching on AMIs.

Terraform

Finally I get to the headliner. Terraform.

The Terraform file sits in the terraform/main.tf file. This is only a single file, but it can be broken up in this directory into as much as it makes sense to break up such things. For example, a lot of my projects now have a variables.tf outputs.tf, vpc.tf, instance.tf, and more, depending on how things need to look. This allows for code chunks that are easy to read. As long as they are all in the same directory, Terraform will treat them all as the same plan.

Looking at the main.tf file, one may notice several analogs to the CloudFormation template from my last article. The key differences, save the difference in infrastructure to remove things that were not necessary for this article, are that it’s not JSON, and also that there is only one aws_instance resource, with a count of 3. count is a special Terraform DSL attribute that tells Terraform to make more than one of the specified resource. This is referenced in the aws_elb block too, with a splat operator:

resource "aws_elb" "elb" {
  ...

  instances = ["${aws_instance.web_servers.*.id}"]
}

Basically allowing me to reference all the instance IDs at once.

And of course, the Terraform file is parameterized. There are 3 parameters – region, ami_id, and vpc_subnet. region and ami_id are both required as they don’t have a default, but vpc_subnet does not need to be supplied if the default 10.0.0.0/24 network is okay.

Manually, if the variables were supplied as TF_VAR_ environment variables (ie: TF_VAR_region or TF_VAR_ami_id), one could just run terraform apply and watch this thing go. In reality though, I want this going through Rake, and that’s what I do.

The Rakefile

Maybe not the star of the show, but definitely the thing keeping the lights on, is the Rakefile.

In addition to having a full DSL of its own to run builds with, Rake can be extended with standard Ruby. This can come in the form of simple methods within the Rakefile, or full-on helper libraries that can provide a suite of common tasks for a project. As mentioned, I used the bundler/gem_tasks helper to provide the basic RubyGems building tasks (and if in the Rakefile I have even disabled a few to ensure the gem doesn’t get accidentally pushed).

Incidentally, the Rake tasks make up only a small portion of this Rakefile. There are 4 user-defined tasks, berks-cookbooks, ami, infrastructure, and kitchen.

ami calls two prerequisites, build and berks-cookbooks. The former is a gem task, which builds the gem and puts it into the pkg/ directory. The second runs Berkshelf on the cookbooks/packer_payload/Berksfile file, Vendoring the cookbooks into the berks-cookbooks directory so that Packer has all cookbooks available to it during its chef-solo run. After these are done, packer can run, after getting some variables, of course. The same goes for the infrastructure task, which has no prerequisites, but sends some variables to the terraform command. Finally, the kitchen task allows for me to easily run tests on the packer_payload cookbook.

The Rakefile helper methods

This is where things really come together. There are a couple of kinds of helpers that I have here. The first kind are very simple and handle a few variables from the environment. Honestly, if Rake had something better to handle this, I would use that instead, but from what I’ve seen, it doesn’t (a bit more RTFM may be necessary). Also, I would rather use environment instead of the parameter system that Rake uses by default to provide parameters – it’s more in line with what the rest of our toolchain uses.

The second kind is where Rake really shines though. These are the functions ubuntu_ami_id, app_ami_id, and rfc3339_to_unix:

  • ubuntu_ami_id uses the ubuntu_ami gem to find the latest Ubuntu AMI for the distribution (default trusty) and root store type (ebs-ssd). This is fed to Packer.
  • After Packer is done and has tagged the AMI with our application tag, app_ami_id can go and get the latest built image for our system. Sorting is helped by the rfc3339_to_unix, which helps convert the timestamp. This AMI ID is then fed to Terraform.

The tasks to deploy

Finally, after all of this write up, what is needed to make this run? Three very easy commands:

  • After mirroring the repo, bundle install --binstubs --path vendor/bundle ensures all the dependencies are gathered up within the working directory tree.
  • Then, bundle exec rake ami will build the AMI.
  • Finally, bundle exec rake infrastructure will deploy the infrastructure with Terraform.

And presto! A 3-node ELB cluster in AWS. The Terraform output will have the DNS name of the ELB – after a few minutes when everything is available, connect over HTTP to port 4567 to see what has been created – refresh to see the page cycle through the IP addresses.

Destroying the infrastructure

After completing the exercise, I want to shut down these resources to ensure that they are not going to rack up a nice big bill for me. This is easily done with the setup I have:

TF_CMD=destroy bundle exec rake infrastructure

This will destroy all created resources. Afterwards, I delete the AMI and the snapshot that Packer made using the AWS CLI or console. And it’s like it never existed!

Final Word – the terraform.tfstate File

One important thing about this file. The terraform.tfstate contains a working state of the infrastructure and must be treated with respect. This is the part that’s actually hidden from when CloudFormation is used as AWS handles this part.

The terraform.tfstate file can be managed in one of two ways:

  • By checking it into source control (if using the example repo as a starting point, note that it is, by default, in the .gitignore file).
  • By using remote state to store the config. There are several options, such as S3, Consul, or Atlas, and a few others not mentioned here. Remote state also has the advantage of easy retrieval for use with other projects (other Terraform files, for example, can access its outputs).

Final Final Word – Modules

One thing that didn’t get mentioned here at all has been Terraform’s ability to use modules.

I suggest checking this out if you plan to really dive into Terraform. So much of your repeatable infrastructure code can be put in a module. In fact, the template in this example can serve as a module, if it lived in its own repository – then, all the template that referenced it would need is the 3 variables defined at the top, which would ultimately turn the main.tf file in its referencing project into an approximately half-dozen lines of code.


Well, that’s all for this article! As usual, I hope you found the material informative! I hope to get back on track with my initial intention of evaluating AWS services soon, but don’t hold it against me if things are slow. I have been far from inactive though – I will mention some of the things I have been up to in my next post. Until then, take care!

AWS Basics Using CloudFormation (Part 3) – ELB and EC2

This is the third part of a 3-part article covering the basics of AWS through using CloudFormation. For the first part of this article, click here, and for the second, click here.

This is the third and final part in my AWS basics article. So far, I’ve covered CloudFormation and Amazon VPC. This time, I will cover Elastic Load Balancing (ELB), and Amazon EC2, the actual operational pieces that end up getting deployed and serve the web content. The final product is a basic virtual datacenter that load balances across two web servers, deployable with a single command through CloudFormation.

And once more, if you would like to follow along at home, remember to check out the template on the GitHub project.

Elastic Load Balancing (ELBs)

Elastic Load Balancing is AWS’s layer 7 load balancing component of EC2, facilitating the basic application redundancy features that most modern applications need today.

ELB has a feature set that is pretty much what could be expected from a traditional layer 7 load balancer, such as SSL offloading, health checks, sticky sessions, and what not. However, the real fun in using ELB is in what it does to make the job of infrastructure management easier.

As a completely integrated platform service, ELBs are automatically redundant, and can span multiple availability zones without much extra configuration. Metrics and logging are also built in, and can be sent to S3 or CloudWatch.

Other than that, there is not much to really hype up about ELB. Not to say that is a bad thing! So on with the CloudFormation entries.

ELBs in CloudFormation

After the gauntlet I ran with explaining the VPC entries in the sample CloudFormation stack, the ELB entry will be a breeze. Below is the ELB section.

"VCTSLabELB1": {
  "Type": "AWS::ElasticLoadBalancing::LoadBalancer",
  "Properties": {
     "HealthCheck": {
       "HealthyThreshold": "2",
       "Interval": "5",
       "Target": "HTTP:80/",
       "Timeout": "3",
       "UnhealthyThreshold": "2"
     },
     "Listeners": [{
         "InstancePort": "80",
         "InstanceProtocol": "HTTP",
         "LoadBalancerPort": "80",
         "Protocol": "HTTP"
     }],
     "Scheme": "internet-facing",
     "Subnets": [ { "Ref": "VCTSLabSubnet1" } ],
     "SecurityGroups": [ { "Ref": "VCTSElbSecurityGroup" } ],
     "Instances": [
       { "Ref": "VCTSLabSrv1" },
       { "Ref": "VCTSLabSrv2" }
     ],
     "Tags": [ { "Key": "resclass", "Value": "vcts-lab-elb" } ]
  }
}

The resource is of the AWS::ElasticLoadBalancing::LoadBalancer type. It is an internet-facing load balancer (as defined by Scheme), as opposed to an internal load balancer that would only be visible within the VPC. It’s also associated with the VCTSLabSubnet1 subnet, so that it can have public access, it does not affect the instances that it can connect to. The instances are defined in the Instances property, which contain references to the two named instances in the EC2 section of the template.

Health checking

The HealthCheck property marks an individual service as healthy (defined by HealthyThreshold) after 2 checks, which brings it back into the cluster; subsequently the health check will also mark a service as unhealthy after 2 failures (defined by UnhealthyThreshold). Note that although this is okay for the purpose that I am using it for, intermittent service failures may cause an undesirable flapping when thresholds are set this low. In that event, set HealthyThreshold to a value that ensures there have been enough successful checks to reasonably determine that the service is available.

Timeout controls how long to wait before marking an individual service as down if a response has not been received. Interval is the time to wait between checks. Both of these values are in seconds. In the example above, the health check waits 3 seconds before marking a service as failed, and the health check itself runs every 5 seconds.

The health check Target takes the syntax of SERVICE:PORT/urlpath. SERVICE can be one of TCP, SSL, HTTP, and HTTPS. /urlpath is only available for the last two (the first two being simple connect open checks and lacking any protocol awareness other than SSL). Also, the response to /urlpath needs to be a 200 OK response – anything else (even a 300 Redirect class code) is considered a failure. In the example above, a check against / over HTTP will be done on any EC2 instances to be sure that the service is up.

Listeners

The listener describes how clients connect to the load balancer and how those connects are routed to instances.

Here, connections come in to port 80 (defined by LoadBalancerPort) and are handled as HTTP connections (defined by Protocol). There are implications from this; namely the X-Forwarded-For HTTP header will be passed, and the connection is statefully passed across as a proxy. Use of HTTP on the front end also means that HTTP or HTTPS needs to be used on the back end. This is indeed the case; the listener is configured to send traffic to instances via HTTP on port 80 (defined by InstanceProtocol and InstancePort).

There are topics that are not covered in this article; namely having to do with SSL offloading (ie: using HTTPS as the front end or instance protocols), persistence, and back-end authentication. It would be wise to check out the Listeners for Your Load Balancer section of the ELB manual to get an idea of all available configurations for listeners.

EC2

There was a time, albeit a long time ago, that AWS was simply EC2 and not much else. Although, it should be noted that SQS was the first AWS service; Jeff Barr’s article on his first 12 years at Amazon is a good read on the launch dates of SQS, EC2, and S3.

Even in the face of today’s AWS massive platform service portfolio, I personally think it’s safe to say that EC2 still has a very major place at Amazon. It serves as the building block for services like ECS (AWS’s Docker service); the EC2 instances that make an ECS pool are, as of this writing, still visible to the end user and require some degree of management. Custom workloads may not fit the bill for use on zero-administration platforms like Lambda. Managed service providers that run their customers off AWS will have a need for the service for quite a long time to come.

EC2 is Amazon’s most basic building block, and the product that gave “Cloud Computing” its name (its acronym itself standing for Elastic Compute Cloud). It is a Xen-based virtualization platform, with features that in today’s world we now take for granted, such as host redundancy and per-use billing, to just name a couple. It set the standard for how a cloud platform handles instances – virtual machines are first rolled into base units called images (which under AWS is called an AMI, standing for Amazon Machine Image), from which instances are created with their own storage laid on top of it.

This small overview does not do the service justice, and there is no way that I would be able to cover all of EC2’s features in this document without losing sight of the goal of setting up a basic VPC with CloudFormation. I would recommend the EC2 documentation for coverage on these topics, in addition to watching this space, where I will more than likely cover these topics as need be.

EC2 in CloudFormation

And now, finally, I come to the last section in this part of the series – the EC2 section of the sample CloudFormation template.

Below is the definition of one of the two EC2 instances that are set up in the template, not counting the NAT instance.

"VCTSLabSrv1": {
  "Type": "AWS::EC2::Instance",
  "Properties": {
    "ImageId": { "Fn::FindInMap": [ "RegionMap", { "Ref": "AWS::Region" }, "AMI" ] },
    "InstanceType": "t2.micro",
    "KeyName": { "Ref": "KeyPair" },
    "SubnetId": { "Ref": "VCTSLabSubnet2" },
    "SecurityGroupIds": [ { "Ref": "VCTSPrivateSecurityGroup" } ],
    "Tags": [ { "Key": "resclass", "Value": "vcts-lab-srv" } ],
    "UserData": { "Fn::Base64": { "Fn::Join": [ "", [
      "#!/bin/bash -xen",
      "/usr/bin/yum -y updaten",
      "/usr/bin/yum -y install httpdn",
      "/sbin/chkconfig httpd onn",
      "echo '<html><head></head><body>vcts-lab-srv1</body></html>' > /var/www/html/index.htmln",
      "echo "/opt/aws/bin/cfn-signal -e $? ",
      "  --stack ", { "Ref": "AWS::StackName" }, " ",
      "  --resource VCTSLabSrv1 ",
      "  --region ", { "Ref": "AWS::Region" }, " ",
      "  && sed -i 's#^/opt/aws/bin/cfn-signal .*\$##g' ",
      "  /etc/rc.local" >> /etc/rc.localn",
      "/sbin/rebootn"
    ]]}}
  },
  "CreationPolicy" : { "ResourceSignal" : { "Count" : 1, "Timeout" : "PT10M" } },
  "DependsOn": "VCTSLabNatGw"
}

EC2 instances are defined as the AWS::EC2::Instance instance type. The instance type is t2.micro, the smallest of the newer generation T2 instance types. Also, remember from the Mappings part of the CloudFormation section that the actual AMI to use is selected from the RegionMap map, based off the availability zone that this instance is launched in.

The KeyName is chosen from the supplied key name when the CloudFormation template was launched (it was either supplied on the command line or through the CloudFormation web interface).

The subnet (specified by SubnetId) is VCTSLabSubnet2, the private subnet, along with its SecurityGroupIds, which is in this case is the VCTSPrivateSecurityGroup private subnet security group (which is simply an allow all, as this group will have no internet access and will be interfacing with the NAT instance and the ELB).

Using userdata for post-creation work

The section after all the other aforementioned properties is where some of the real magic happens. The UserData property is used to create a post-installation shell script that updates the system (/usr/bin/yum update), installs apache (/usr/bin/yum -y install httpd), enables the service (/sbin/chkconfig httpd on), creates an index.html page with the server ID, and then finally injects a self-destructing cfn-signal command that gets run when the server reboots. This is a very simple way to get a fully deployed server in our example.

Note that there is a more complex configuration management system built right right into CloudFormation if using something more complex like Chef, Puppet, Ansible, Salt, or whatever is not possible. Check out AWS::CloudFormation::Init. Incidentially, this requires the cfn-init command be launched, which is not necessarily installed on all Linux AMIs (however is available usually thru pacakges, and is already on the system with Amazon Linux). Incidentally as well, cfn-init is generally launched through user data.

Finally, also note that user data needs to be base64 encoded – this is done by the Fn::Base64 section in the example.

Creation policies and dependencies

The last little bit that needs to be mentioned in regards to the EC2 instances are the creation policies and dependencies attached to them. These are not unique to EC2 instances (and hence, they are not properties of that specific resource type, as can be seen from their scope).

Consider the following scenario: the NAT instance has generally the same UserData as the EC2 instances – it updates and reboots as well. During the period that the NAT instance is rebooting, internet access will be unavailable to the 2 web instances in the private subnet. If all 3 instances were set to install at the same time without the web instances waiting for the NAT instance, it is plausible that there would be a time where the web instances would be attempting updates while the NAT instance was rebooting. This would, of course, break updates, and possibly the creation of the CloudFormation stack.

This is what creation policies and dependencies are for. Generally, when using user data, one does not want to count a resource as created until everything is done. In this case, that means the instance has had all of its software updated, any other software installed that it needs (ie: in the event of the web instances), and has been fully rebooted.

The CreationPolicy defined above waits until that is all done. Ultimately, by what it is defined there, it waits for one cfn-signal command to be run for the resource (defined by ResourceSignal), with a 10 minute Timeout (if the format looks weird, it’s because it is in ISO 8601 format). This gives the node enough time to fully update and restart.

And finally, the DependsOn attribute ties the web instances to the NAT instance. This will ensure CloudFormation waits until the NAT instance (referring to it by its named resource, VCTSLabNatGw) has completed creation and received its own cfn-signal before even attempting to create them, giving us an error-free template!

Conclusion

This concludes the intro article. I hope that you found the material informative!

Watch this space for much more in the way of coverage of AWS services as I continue my “world tour”. Not going to say 100% about what is next, but more than likely Route 53 will be on the radar shortly, as possibly will be an introduction to Identity and Access Management and Security Token Service, as both of the latter services are pretty important when organizing security on an AWS account these days, and there is a lot to digest.

See you then!

AWS Basics Using CloudFormation (Part 2) – Amazon VPC

This is the second part of a 3-part article covering the basics of AWS through using CloudFormation. For the first part of this article, click here.

Last week I covered the basics of CloudFormation – giving an idea of how a template is generally structured and all of the specific elements. This time, I am going into detail on how to set up Amazon VPC. Again, Amazon VPC is the first part of almost any AWS deployment – think of it as the datacenter and network layer. Not much else can happen unless these two items are present, can they? There are a few exceptions to this rule, such as Route 53, Amazon’s DNS hosting service, but even Route 53 can be provisioned in a VPC to provide private hosted zones that are not externally available.

And again, if you would like to follow along at home, remember to check out the template on the GitHub project.

Regions and Availability Zones

Two key concepts that should be explained while discussing VPCs are the concepts of regions and availability zones.

As a modern computing infrastructure that serves a wide variety of clients, AWS datacenters are spread throughout the world. These are known as regions. Examples of the regions would be the AWS datacenters in Virginia (us-east-1), North California (us-west-1), and Oregon (us-west-2).

Within each of these regions are availability zones, which according to the EC2 FAQ, are so separated from each other that even physical failures such as power outages or even fire at one availability zone would not affect others.

VPCs can span availability zones but not regions. In order to interconnect regions, a peer needs to be set up, or the VPCs need to be connected via other means, such as using a VPN.

Private and Public Networks

It takes a little getting used to how private and public networks work in a VPC if one is used to engineering their own networks.

First off, the difference between a private and a public subnet is very subtle – public subnets have an internet gateway attached to them, and instances require a public IP address attached to them to be able to get internet access. Instances that do not have public IPs cannot access the internet on their own, regardless if they are in a public subnet.

To solve the problem, one can use a NAT instance. The concept of this explained below. What I do not cover are some of the other considerations that need to be thought of because of this “manual” process, such as redundancy and security of the network. Hopefully, Amazon will consider automating this piece of the infrastructure soon, as it is the single manual setup element in the VPC platform and hence probably the one prone to failure the most.

Advanced VPC topics

I do not cover more in-depth VPC topics here, however due to the scale of some of Amazon’s customers, it is only natural that they would have a wide range of options for connecting enterprise networks to a VPC.

Consult the Network Administrator Guide for help on these integration topics (such as using VPNs, or routing protocols like BGP).

VPCs in CloudFormation

As there are a lot of VPC components in the CloudFormation template, I have broken it up some of the items into respective sub-sections.

The root VPC resource is defined as an AWS::EC2::VPC resource:

"VCTSLabVPC1": {
  "Type": "AWS::EC2::VPC",
  "Properties": {
    "CidrBlock": "10.0.0.0/16",
    "EnableDnsHostnames": true,
    "Tags": [ { "Key": "resclass", "Value": "vcts-lab-vpc" } ]
  }
}

A couple of other things to note here: CidrBlock cannot be any bigger than a /16, so if you get errors mentioning something about the network address being invalid, try reducing the network size. EnableDnsHostnames allows DNS hosts to be assigned to instances as they start up in this VPC – having this off may be useful if DNS will be managed outside of AWS, but otherwise it’s generally a good idea to enable this.

Subnets, gateways, and routes

There are a couple of examples on subnets in the CloudFormation template, since it makes use of a NAT instance as well. This section will discuss the default (public) subnet to start with, and I will expand into the private subnet during the NAT instance section.

"VCTSLabSubnet1": {
"Type": "AWS::EC2::Subnet",
  "Properties": {
    "CidrBlock": "10.0.0.0/24",
    "MapPublicIpOnLaunch": true,
    "Tags": [
      { "Key": "resclass", "Value": "vcts-lab-subnet" },
      { "Key": "subnet-type", "Value": "public" }
    ],
    "VpcId": { "Ref": "VCTSLabVPC1" }
  }
},
"VCTSLabGateway": {
  "Type": "AWS::EC2::InternetGateway",
  "Properties": {
    "Tags": [ { "Key": "resclass", "Value": "vcts-lab-gateway" } ]
  }
},
"VCTSLabGatewayAttachment": {
  "Type": "AWS::EC2::VPCGatewayAttachment",
  "Properties": {
    "InternetGatewayId": { "Ref": "VCTSLabGateway" },
    "VpcId": { "Ref": "VCTSLabVPC1" }
  }
},
"VCTSLabPublicRouteTable": {
  "Type": "AWS::EC2::RouteTable",
  "Properties": {
    "VpcId": { "Ref": "VCTSLabVPC1" },
    "Tags": [
      { "Key": "resclass", "Value": "vcts-lab-routetable" },
      { "Key": "routetable-type", "Value": "public" }
    ]
  }
},
"VCTSLabPublicDefaultRoute": {
  "Type": "AWS::EC2::Route",
  "Properties": {
    "DestinationCidrBlock": "0.0.0.0/0",
    "GatewayId": { "Ref": "VCTSLabGateway" },
    "RouteTableId": { "Ref": "VCTSLabPublicRouteTable" }
  }
},
"VCTSLabPublicSubnet1Assoc": {
  "Type": "AWS::EC2::SubnetRouteTableAssociation",
  "Properties": {
    "SubnetId": { "Ref": "VCTSLabSubnet1" },
    "RouteTableId": { "Ref": "VCTSLabPublicRouteTable" }
  }
}

There are a lot of things going on here. First, the subnet is defined with the AWS::EC2::Subnet resource type. The MapPublicIpOnLaunch property makes this the de facto public subnet, as anything launched in this subnet will get a public IP address. However, this is only part of the story, as without a gateway, public IP address assignments will not be possible or functional.

To that end, various routing resoruces are created: a gateway (type AWS::EC2::InternetGateway), a route table (AWS::EC2::RouteTable), and a default route (resource type AWS::EC2::Route). This effectively makes a route table with a default route, however it also needs to be attached to the gateway that is created and the subnet. This is done by using the AWS::EC2::VPCGatewayAttachment and AWS::EC2::SubnetRouteTableAssociation resources.

At this point, the subnet is now ready for use, however, without security policies may be not of much use, or extremely insecure.

Security groups

Shown below is the CloudFormation resource for the public subnet’s security group. This is defined as type AWS::EC2::SecurityGroup, and is named for the fact that it will mainly run on the NAT instance only.

"VCTSNatSecurityGroup": {
  "Type": "AWS::EC2::SecurityGroup",
  "Properties": {
    "Tags": [ { "Key": "resclass", "Value": "vcts-lab-sg" } ],
    "GroupDescription": "NAT (External) VCTS security group",
    "VpcId": { "Ref": "VCTSLabVPC1" },
    "SecurityGroupIngress": [
      { "IpProtocol": "tcp", "CidrIp": "10.0.1.0/24", "FromPort": "80", "ToPort": "80" },
      { "IpProtocol": "tcp", "CidrIp": "10.0.1.0/24", "FromPort": "443", "ToPort": "443" },
      { "IpProtocol": "tcp", "CidrIp": { "Ref": "SSHAllowIPAddress" }, "FromPort": "22", "ToPort": "22" }
    ],
    "SecurityGroupEgress": [
      { "IpProtocol": "tcp", "CidrIp": "0.0.0.0/0", "FromPort": "22", "ToPort": "22" },
      { "IpProtocol": "tcp", "CidrIp": "0.0.0.0/0", "FromPort": "80", "ToPort": "80" },
      { "IpProtocol": "tcp", "CidrIp": "0.0.0.0/0", "FromPort": "443", "ToPort": "443" }
    ]
  }
}

Access rules are defined through the SecurityGroupIngress (inbound) and SecurityGroupEgress (outbound) properties. Here, only SSH traffic is being allowed inbound, from the IP address supplied via the SSHAllowIPAddress parameter. SSH is also being allowed general outbound – this allows the ability to “bounce” off the NAT instance into the private network. HTTP and HTTPS are being allowed both inbound and outbound generally – this might be confusing at first, seeing as there is some redundancy with this and the (not shown) load balancer access list, but consider the fact that the NAT instance has to handle traffic going in both directions – so the HTTP will need to come in to the NAT instance and the back out to the internet, so the access list has to allow for both directions.

Two security groups are not shown here – the load balancer group (named VCTSElbSecurityGroup) and the private group (named VCTSPrivateSecurityGroup). The former simply allows HTTP traffic generally, and the latter is a general allow for any traffic flowing into private instances.

One last thing to note – keep in mind that there are two kinds of network security concepts in a VPC – security groups, as shown here, and network ACLs, which I do not discuss. The former are applied at the instance level, and the latter are applied at the subnet level. At the very least, there should be some security applied on the instance level, however, adding the network ACLs can allow for some fallback barring that. As an example, the private subnet is very loose, and could benefit from an ACL being applied to it, just in case a public address got assigned to it (even though, as the CloudFormation template is currently set up, that is impossible). The VPC Security Comparison document gives a great breakdown on the differences.

NAT instances

Below are the CloudFormation template snippets for the NAT instance. There is some information overlap here, as the subnets and route table items have mainly been explained already, and the network security group has already been shown above, so it is not being shown again. So given that, I will just show the relevant subnet and route entries, and then briefly explain the NAT instance itself, which is an EC2 instance, and of course I will be describing EC2 in detail later on.

Here are the private subnet and routes. Note how MapPublicIpOnLaunch is off. Also, there is a hook into the availability zone that the public subnet is in, as load balancing can break if the private and public subnets are inconsistently created in different availability zones. Also, the private subnet uses the NAT instance as the default route.

"VCTSLabSubnet2": {
"Type": "AWS::EC2::Subnet",
  "Properties": {
    "CidrBlock": "10.0.1.0/24",
    "MapPublicIpOnLaunch": false,
    "Tags": [
      { "Key": "resclass", "Value": "vcts-lab-subnet" },
      { "Key": "subnet-type", "Value": "private" }
    ],
    "VpcId": { "Ref": "VCTSLabVPC1" },
    "AvailabilityZone": { "Fn::GetAtt" : [ "VCTSLabSubnet1", "AvailabilityZone" ] }
  }
},
"VCTSLabPrivateRouteTable": {
  "Type": "AWS::EC2::RouteTable",
  "Properties": {
    "VpcId": { "Ref": "VCTSLabVPC1" },
    "Tags": [
      { "Key": "resclass", "Value": "vcts-lab-routetable" },
      { "Key": "routetable-type", "Value": "private" }
    ]
  }
},
"VCTSLabPrivateDefaultRoute": {
  "Type": "AWS::EC2::Route",
  "Properties": {
    "DestinationCidrBlock": "0.0.0.0/0",
    "InstanceId": { "Ref": "VCTSLabNatGw" },
    "RouteTableId": { "Ref": "VCTSLabPrivateRouteTable" }
  }
},
"VCTSLabPrivateSubnet2Assoc": {
  "Type": "AWS::EC2::SubnetRouteTableAssociation",
  "Properties": {
    "SubnetId": { "Ref": "VCTSLabSubnet2" },
    "RouteTableId": { "Ref": "VCTSLabPrivateRouteTable" }
  }
}

And here is the NAT instance itself. Again, I am not describing it here in detail, as it is an EC2 instance and will be explained in its own section. However, do note that the AMI does map to a list of Amazon Linux instances specifically configured for NAT use – there are scripts on these instances that set up the NAT table and IP forwarding. Also, the NAT instance does not have 2 interfaces in each of the subnets, like a traditional router – traffic flows in through the VPC’s own routers in a way that only an IP in the public subnet is required.

"VCTSLabNatGw": {
  "Type": "AWS::EC2::Instance",
  "Properties": {
    "ImageId": { "Fn::FindInMap": [ "NatRegionMap", { "Ref": "AWS::Region" }, "AMI" ] },
    "InstanceType": "t2.micro",
    "KeyName": { "Ref": "KeyPair" },
    "SubnetId": { "Ref": "VCTSLabSubnet1" },
    "SourceDestCheck": false,
    "SecurityGroupIds": [ { "Ref": "VCTSNatSecurityGroup" } ],
    "Tags": [ { "Key": "resclass", "Value": "vcts-lab-natgw" } ],
    "UserData": { "Fn::Base64": { "Fn::Join": [ "", [
      "#!/bin/bash -xen",
      "/usr/bin/yum -y updaten",
      "echo "/opt/aws/bin/cfn-signal -e $? ",
      "  --stack ", { "Ref": "AWS::StackName" }, " ",
      "  --resource VCTSLabNatGw ",
      "  --region ", { "Ref": "AWS::Region" }, " ",
      "  && sed -i 's#^/opt/aws/bin/cfn-signal .*\$##g' ",
      "  /etc/rc.local" >> /etc/rc.localn",
      "/sbin/rebootn"
    ]]}}
  },
  "CreationPolicy" : { "ResourceSignal" : { "Count" : 1, "Timeout" : "PT10M" } }
}

Next Article – ELB and EC2

Stay tuned for the conclusion of this 3-part article, were I discuss setting up ELB and EC2 and their respective items in the CloudFormation template!

AWS Basics Using CloudFormation (Part 1) – Introduction to CloudFormation

This article is the first in many – as mentioned in the last article, I will be writing more articles over the course of the next several months on AWS, touching on as much of the service as I can get my hands on.

For my first article, I am starting with the basics – CloudFormation, Amazon VPC (Virtual Private Cloud), Elastic Load Balancing, and finally, EC2. The services that are covered here serve as some of the basic building blocks of an Amazon infrastructure, and some of the oldest components of AWS. This will serve as a entry point not only into further articles, but for myself, and you the reader, into learning more about AWS, and being more comfortable with the tools that manage it.

However, this article got so large that I have had to separate it into 3 parts! so, for the first article, I will be mainly covering CloudFormation, the second one will cover VPC, and the third one will cover ELB and EC2.

Viewing the Technical Demo

All of the items covered in this article have been assembled into a CloudFormation template that can be downloaded from the github page:

https://github.com/vancluever/aws-basics-using-cloudformation

There is a README there that provides instructions on how to download and use the template.

Introduction

I selected the first features of AWS to cover from a way that could give someone that is already familiar with the basic concepts of modern cloud computing and devops (which includes virtual infrastructure, automation, and configuration management) an idea of what that means when dealing with AWS and its products. Ultimately, this meant creating an example that would create a full running basic “application” that could be created and destroyed with a single command.

CloudFormation is Amazon’s primary orchestration product, and covers a wide range of services that make up the core of AWS’s infrastructure. It is used in this article to manage every service I touch – besides IAM and access keys, which are not covered here, nothing in this example has been set up through the AWS console. Given that the aforementioned two items have been set up, all that is necessary to create the example is a simple aws cloudformation CLI command.

Amazon VPC is the modern (and current) virtual datacenter platform that makes up the base of AWS. From a VPC, networks, gateways, access lists, and peer connections (such as VPN endpoints and more) are made to cover both the needs of a public-facing application and the private enterprise. It is pretty much impossible to have a conversation about AWS these days without using VPC.

Amazon EC2 is one of Amazon’s oldest and most important products. It is the solution that gave “the cloud” its name, and while Amazon has created a large number of platform services that have removed the need for EC2 in quite a few applications (indeed, one can run an entire application these days in AWS without a single EC2 instance), it is still highly relevant, and will continue to be so long as there is ever a need to run a server and not a service. Products such as VPC NAT instances (covered in part 2) and Amazon EC2 Container Service (not covered here) also use EC2 directly with no transparency, so its importance in the service are still directly visible to the user.

I put these three products together in this article – with CloudFormation, a VPC is created. This VPC has two subnets, a public subnet and a private subnet, along with a NAT instance, so that one can see some of the gotchas that can be encountered when setting up such infrastructure (and hopefully avoid some of the frustration that I experienced, mentioned in the appropriate section). An ELB is also created for two EC2 instances that will, upon creation, do some basic configuration to make themselves available over HTTP and serve up a simple static page that allows one to see both the ELB and EC2 instances in action.

CloudFormation

CloudFormation is Amazon’s #1 infrastructure management service. With features that cover both deployment and configuration management, the service supports over two dozen AWS products, and can be extended to support external resources (and AWS processes not directly supported by CloudFormation) via custom resources.

One does not necessarily need to start off with CloudFormation completely from scratch. There are templates available at the AWS CloudFormation Templates page that have both examples of full stacks and individual snippets of various AWS services, which can be a great time saver in building custom templates.

The following few sections cover CloudFormation elements in further detail. It is a good idea to consult the general CloudFormation User Guide, which provides a supplemental to the information below, and also a good reference while designing templates, be it starting from scratch or using existing templates.

CloudFormation syntax synopsis

Most CloudFormation items (aside from the root items like template version and description) can be summarized as being an name/type pairing. Basically, given any certain type, be it parameters, resources, mappings, or anything else, items in CloudFormation generally are assigned a unique name, and then a type. Consider the following example parameter:

"Parameters": {
  "KeyPair": {
    "Type": "AWS::EC2::KeyPair::KeyName",
    "Description": "SSH key that will be used for EC2 instances (set up in web console)",
    "ConstraintDescription": "needs to be an existing EC2 keypair (set up in web console)"
  }
}

This parameter is a AWS::EC2::KeyPair::KeyName parameter named KeyPair. The latter name can (and will be) referenced in resources, like the EC2 instance names (see the below section on EC2).

Look in the below sections for CloudFormation’s Ref function, which will be used several times; this function serves as the basis for referencing several kinds of CloudFormation elements, not just parameters.

Parameters and Outputs

Parameters are how data gets in to a CloudFormation template. This can be used to do things like get IDs of SSH keys to assign to instances (as shown above), or IP addresses to assign to security group ACLs. These are the two items parameters are used for in the example.

Outputs are how data gets out of CloudFormation. Data that is a useful candidate for being published through outputs include instance IP addresses, ELB host names, VPC IDs, and anything else that may be useful to a process outside of CloudFormation. This data can be read thru the UI, or through the JSON data produced by the aws cloudformation describe-stacks CLI command (and probably the API as well).

Parameter syntax

Let’s look at the other example in the CloudFormation template, the SSHAllowIPAddress parameter. This example uses more generic data types and gives a bigger picture as to what is possible with parameters. Note that there are several data types that can be used, which include both typical generic data types, such as Strings and Integers, and AWS-speciifc types such as the AWS::EC2::KeyPair::KeyName parameter used above.

"SSHAllowIPAddress": {
  "Type": "String",
  "AllowedPattern": "\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\/32",
  "Description": "IP address to allow SSH from (only /32s allowed)",
  "ConstraintDescription": "needs to be in A.B.C.D/32 form"
}

This parameter is of type String, which also means that the AllowedPattern constraint can be used on it, which is used here to create a dotted quad regular expression, with the /32 netmask being explicitly enforced. JSON/Javascript syntax applies here, which explains the somewhat excessive nature of the backslashes.

Parameters are referenced using the Ref function. The snippet below gives an example of the SSHAllowIPAddress‘s reference:

"VCTSNatSecurityGroup": {
  "Type": "AWS::EC2::SecurityGroup",
  "Properties": {
    ...
    "SecurityGroupIngress": [
      ...
      { "IpProtocol": "tcp", "CidrIp": { "Ref": "SSHAllowIPAddress" }, "FromPort": "22", "ToPort": "22" }
    ],
    ...
  }
}

Ref is a very simple function and usually just used to refer back to a CloudFormation element. It is not just restricted to parameters is used with both parameters, mappings, and resources. Further examples will be given below, so there should be a good idea on how to use it by the end of this article.

Output syntax

Below is the NatIPAddr output, pulled from the example.

"Outputs": {
  "NatIPAddr": {
    "Description": "IP address of the NAT instance (shell to this address)",
    "Value": { "Fn::GetAtt": [ "VCTSLabNatGw", "PublicIp" ] }
  },
  ...
}

The nature of outputs are pretty simple. The data can be pulled any way that allows one to get the needed data. Most commonly, this will be from the Fn::GetAtt function, which can be used to get various attributes from resources, or possibly Ref, which in the event of resources, usually references a specific primary attribute.

Mappings

Mappings allow a CloudFormation template some flexibility. The best example of this is allowing the CloudFormation template to be used in multiple regions, by mapping AMIs (instance images) to their respective regions.

Mapping syntax

This is the one in the reference’s template, and maps to Amazon Linux AMIs. These are chosen because they support cfn-init out of the box, which was going to be used in the CloudFormation template to run some commands via the AWS::CloudFormation::Init resource type in the EC2 section, but I opted to use user data instead (I cover this in further detail in part 3).

"Mappings": {
  "RegionMap": {
    "us-east-1": { "AMI": "ami-1ecae776" },
    "us-west-1": { "AMI": "ami-e7527ed7" },
    "us-west-2": { "AMI": "ami-d114f295" }
  },
  "NatRegionMap": {
    "us-east-1": { "AMI": "ami-303b1458" },
    "us-west-1": { "AMI": "ami-7da94839" },
    "us-west-2": { "AMI": "ami-69ae8259" }
  }
}

The above RegionMap is then referenced in EC2 instances like so:

"VCTSLabNatGw": {
  "Type": "AWS::EC2::Instance",
  "Properties": {
    "ImageId": { "Fn::FindInMap": [ "NatRegionMap", { "Ref": "AWS::Region" }, "AMI" ] },
    "InstanceType": "t2.micro",
    ...
  }
}

This is one of many ways to use mappings, and more complex structures are possible. Check the documentation for further examples (such as how to expand the above map to make use of processor architecture the the region map).

Resources

Resources do the real work of CloudFormation. They create the specific elements of the stack and interface with the parameters, mappings, and outputs to do the work necessary to bring up the stack.

Since resources vary so greatly in what they need in real world examples, I explain each service that the template makes use of in their respective sections (ie: the VPC, ELB, and EC2 sections). However, some common elements are explained here in brief, as to give a primer on how they can be used to further control orchestration of the stack. Again, further detail on how to use these are shown as examples with the various AWS services explained below.

Creation Policies, Dependencies, and Metadata

A CreationPolicy can be used as a constraint to determine when a resource is counted as created. For example, this can be used with cfn-signal on an EC2 instance to ensure that the resource is not marked as CREATE_COMPLETE until all reasonable post-installation work has been done on an instance (for example, after all updates have been applied or certain software has been installed).

A dependency (defined with DependsOn) is a simple association to another resource that ties its creation with said parent. For example, the web server instances in the example do not start creation until the NAT instance is complete, as they are created in a private network and will not install properly unless they have internet access available to them.

Metadata can be used for a number of things. The example commonly explained is the use of the AWS::CloudFormation::Init metadata type to provide data to cfn-init, which is a simple configuration management tool. This is not covered in the example, as the work that is being done is simple enough to be done through UserData.

All of these 3 concepts are touched up on in further detail in part 3, when EC2 and the setup of an instance in CloudFormation is discussed.

Next Article – Amazon VPC

That about covers it for the CloudFormation part of this article. Stay tuned for the next part, in which I cover Amazon VPC basics, in addition to how it is set up in CloudFormation!

AWS NYC Summit 2015 Recap

When one gets the chance to go to New York, one takes it, as far as I’m concerned. So when PayByPhone offered to send me to the AWS NYC Summit, I totally jumped at the chance. In addition to getting to stand on top of two of the world’s tallest buildings, take a bike across the Brooklyn Bridge, and get some decent Times Square shots, I got to learn quite a bit about ops using AWS. Win-win!

The AWS NYC summit was short, but definitely one of the better conferences that I have been to. I did the Taking AWS Operations to the Next Level “boot camp” – a day-long training course – during the pre-day, and attended several of the breakouts the next day. All of them had great value and I took quite a few notes. I am going to try and abridge all of my takeaways on the products, “old” and new, that caught my eye below.

CloudFormation

CloudFormation was a product that was covered in my pre-day course and also during one of the breakouts that I attended. It’s probably the most robust ops product that is offered on AWS today, supporting, from my impressions, the most products versus any other of the automation platform services that are offered.

The idea with CloudFormation, of course, is that infrastructure is codified in a JSON-based template, and then create “stacks” – entities that group infrastructure and platform services up in ways that can be duplicated, destroyed, or even updated with a sort of intelligent convergence, adding, removing, or changing resources depending on what has been defined in the template. Naturally, this can be integrated with any sort of source control so that changes are tracked, and a CI and deployment pipeline to further automate things.

One of the neat features that was mentioned in the CloudFormation breakout was the ability for CloudFormation to use Lambda-backed resources to interface with AWS features that CloudFormation does not have native support for, or even non-AWS products. All of this makes CloudFormation definitely seem like the go-to product if one is planning on using native tools to deploy AWS infrastructure.

OpsWorks

OpsWorks is Amazon’s Chef product, although it’s a quite a bit more than that. It seems mainly like a hybrid of a deployment system like CloudFormation, with Chef being used to manage configuration through several points of the lifecycle. It uses chef-solo and chef-zero depending on the OS it is being employed for (Linux is chef-solo and Chef 11, and Windows is Chef 12 and chef-zero), and since it is all run locally, there is no Chef server.

In OpsWorks, an application stack is deployed using components called Layers. Layers exist for load balancing, application servers and databases, in addition to others such as caching and even custom ones that can utilize functionality that are created through Chef cookbooks. With support for even some basic monitoring, one can probably write an entire application in OpsWorks without even touching another AWS ops tool.

AWS API Gateway

A few new products were announced at the summit – but API Gateway was the one killer app that caught my eye. Ultimately this means that developers do not really need to mess around with frameworks any more to get an API, or even get a web application off the ground – just hook in the endpoints with API gateway, integrate them with Lambda, and it’s done. With the way that AWS’s platform portfolio is looking these days, I’m surprised that this one was so late the party!

CodeDeploy, CodePipeLine, and CodeCommit

These were presented to me in a breakout that gave a bit of a timeline on how Amazon internally developed their own deployment pipelines. Ultimately they segued into these three tools.

CodeDeploy is designed to deploy an application to not only AWS, but also to on-premise resources, and even other cloud providers. The configuration is YAML-based and pretty easy to read. As part of its deployment feature set, it does offer some orchestration facilities, so there is some overlap with some of its other tools. It also integrates with several kinds of source control platforms (ie: GitHub), other CI tools (ie: Jenkins), and configuration management systems. The killer features for me is its support for rolling updates, to automate the deployment of a new infrastructure while draining out the old one.

CodePipeline is a release modeling and workflow engine, and can be used to model a release process, working with CodeDeploy, to automate the deployment of an application from source, to testing, and then to production. Tests can be automated using services like RunScope or Ghost Inspector, to name just a couple. It definitely seems like these two – CodePipeline and CodeDeploy – are naturally coupled to give a very smooth software deployment pipeline – on AWS especially.

CodeCommit is AWS’s foray into source control, a la GitHub, Bitbucket, etc. Aside all the general things that one would hopefully expect from a hosted source control service (ie: at-rest encryption and high availability), expect a few extra AWS-ish perks, like the ability to use existing IAM scheme to assign ACLs to repositories. Unlimited storage per repository was mentioned, or at least implied, but there does appear to be some metering – see here for pricing.

EC2 Container Service (ECS)

The last breakout I checked out was one on EC2 Container Service (ECS). This is AWS’s Docker integration. The breakout itself spent a bit of time on an intro to containers themselves which is out of the scope of this article, but is a topic I may touch up on at a later time (Docker is on “the list” of things I want to evaluate and write on). Docker is a great concept. It rolls configuration management and containerization into one tool for deploying applications and gives a huge return on infrastructure in the form of speed and density. The one unanswered question has been clustering, but there has been several 3rd party solutions for that for a bit and Docker themselves has a solution that is still fairly new.

ECS does not appear to be Swarm, but its own mechanism. But in addition to clustering, ECS works with other AWS services, such as Elastic Load Balancing (ELB), Elastic Block Storage (EBS), and back office things like IAM and CloudTrail. Templates are rolled into entities called tasks, where one can also specify resource requirements, volumes to use, and more. Then, one can use this task to create a run task, which will run for a finite amount of time and then terminate, or a service, which will ensure that the task stays up and running indefinitely. Also, the ability exists to specify a specific instance count for a task, which is then spread out across the instance pool.

This is probably a good time to mention that ECS still does not provide abstraction across container infrastructure. There is a bit of automation to help with this, but ECS2 instances still need to be manually spun up and added to the ECS pool, from which it then derives the available amount of resources and how it distributes the containers. One would assume that there are plans to eliminate the visibility of the EC2 layer from the user – it seems like Amazon is aware of requests to do this, as was mentioned when I asked the question.

Ultimately, it looks like there is still some work to do to make ECS completely integrated. Auto-scaling is another feature that is still a manual process. For now, there are documented examples on how to do things like glue auto-scaling together using Lambda, for example.

Wrap-Up

You can see all – or at least most – of the presentations and breakouts that happened at the summit (and plenty of other conferences) – on the AWS YouTube page.

And as for me, expect to see, over the next several weeks, a product-by-product review of as much of AWS as I can eat. I will be starting with some of the products that were discussed above, with a mix of some of the old standbys in as well to ensure I cover things from the ground up. Watch this space!

Breaking Into the CentOS Cloud Image, and Modern Linux Password Recovery

In this day and age of cloud computing, installing an image from scratch is something that is probably not needed very often, if at all, and probably something that is only needed if installing a hardware machine. Major Linux vendors offer cloud versions of their images, such as Ubuntu and CentOS. Using these images with a compatible system ensures that one can get started up on a fresh Linux install very quickly, be it with a Public cloud like EC2, an OpenStack infrastructure, or even just a basic KVM host.

However, if it’s desired to use some of these images in a non-cloud setup, such as the latter scenario, there are some things that need to be done. I will be using the CentOS images as an example.

Step 1 – Resetting the Root Password

After the image has been downloaded and added into KVM, the root password needs to be reset.

This is actually a refresh of an age-old trick to get into Linux systems. Before, it was easy enough as adding init=/bin/bash to the end of the kernel parameter in GRUB, but times have changed a bit. This method actually still works, but just needs a couple of additions to get it to go. Read on below.

A note – SELinux

SELinux is enabled on the CentOS cloud images. The steps below include disabling it when the root password is reset. Make sure this is done, or you will have a bad time. Note that method #1 also includes enforcing=0 as a boot parameter, so if the this is missed, you have an opportunity to do in the current boot session before the system is rebooted.

Method #1 – “graceful” thru rd.break

This is the Red Hat supported method as per the manual.

rd.break stops the system just before the hand-off to the ramdisk. There are some situations where this can cause issues, but this is rare, and a cloud image is far from it.

Reboot the system, and abort the GRUB menu timeout by mashing arrow keys as soon as the boot prompt comes up. Then, select the default option (usually the top, the latest non-rescue kernel option) and press “e” to edit the parameters.

Make sure the only parameters on the linux or linux16 line are:

  • The path to the kernel (should be the first option, probably referencing the /boot directory)
  • The path or ID for the root filesystem (the root= option)
  • The ro option

Then, supply rd.break enforcing=0 option at the end. Press Ctrl-X to boot.

This will boot the system into an emergency shell that does not require a password, right before when the system would normally have handed it off to the ramdisk.

When the system is in the rescue state like this, the system is mounted on /sysroot. As such, a few extra steps are required to get the system mounted so that the password can be reset properly. Run:

mount -o remount,rw /sysroot
chroot /sysroot
passwd root
[change password here]
vi /etc/sysconfig/selinux
[set SELINUX=disabled]
exit
mount -o remount,ro /sysroot
exit

This load /sysroot into a chroot shell. The password will be prompted for on the passwd root line. Also, make sure to edit /etc/sysconfig/selinux and set SELINUX=disabled. After both of these are done, exit the shell, re-mount the filesystem read-only again to flush any writes, and exit the emergency shell. The system will either now reboot or just resume booting.

Method #2 – old-school init=/bin/bash

init=/bin/bash still works, funny enough, but there are some options that need to be removed on the CentOS system, as mentioned in method #1.

Reboot the system, and abort the GRUB menu timeout by mashing arrow keys as soon as the boot prompt comes up. Then, select the default option (usually the top, the latest non-rescue kernel option) and press “e” to edit the parameters.

Make sure the only parameters on the linux or linux16 line are:

  • The path to the kernel (should be the first option, probably referencing the /boot directory)
  • The path or ID for the root filesystem (the root= option)
  • The ro option

Then, supply the init=/bin/bash option at the end. Press Ctrl-X to boot.

After the initial boot the system is tossed into a root shell. Unlike method #1, this shell is already in the system, and / is the root, without a chroot being necessary. Simply run the following:

mount -o remount,rw /
passwd root
[change password here]
vi /etc/sysconfig/selinux
[set SELINUX=disabled]
mount -o remount,ro /

The password will be prompted on the passwd root command. Also, make sure to edit /etc/sysconfig/selinux and set SELINUX=disabled. After both of these are done, the filesystem should be remounted read-only to ensure that all writes are flushed. From here, simply reboot through a hard reset or ctrl-alt-del.

Last Few Steps

Now that the system can be rebooted and logged into, There are a few final steps:

Remove cloud-init

This is probably spamming the console right about now. Go ahead and disable it.

systemctl stop cloud-init.service
yum -y remove cloud-init

Enable password authentication

Edit /etc/ssh/sshd_config and change PasswordAuthentication to yes. Make sure the line that is not commented out is changed. Then restart SSH:

systemctl restart sshd.service

The cloud image should now be ready for general use.

Honorable Mention – Packer

All of this is not to say tools like Packer don’t have great merit in image creation, and in fact if you wanted to build a generic image for KVM, rather than just grabbing one like I mention above, there is a qemu builder that can do just that. Doing this will also ensure that the image lacks the cloud-init tools and what not that you may not need in your application.