Two companies that I worked for recently used ECS (Elastic Container Service) as container orchestration tool.
If you have ever used it you know that it has somewhat limited observability out of the box.
You have two options to spin containers on ECS:
Fargatewhich is serveless container engine
EC2instances managed by you and your team
Fargate you don’t really need to have insights into infrastructure spinning containers, it’s serveless.
More robust and less expensive solution is to host your own fleet of
EC2 instances that join
ECS cluster. With
that approach you need to manage them and know what’s going on there.
In this blog post I will outline possible
prometheus integration with
My main goal was to improve observability by introducing node monitoring with
cadvisor and ingesting application metrics exposed by ephemeral containers.
As much as I love
AWS, I’m not really a fan of
CloudWatch. Using it as a monitoring system just for the sake of being Cloud
Native doesn’t make much sense to me as it has issues and limitations. I guess some people think that
good because it’s made by AWS and works right off the bat.
Observabilty that it gives is not the best and the more you use it, the more problems you encounter. Let me just point out a couple of major issues for me:
CloudWatch Alarmscan’t monitor ephemeral things like EBS volumes Imagine an
ASGwhich spins new instance that use EBS. You want to keep your alarms in
terraform? No can do. Easiest approach would be
CloudWatch Eventsthat creates new alarms automatically. One can add alarm during bootstrapping but what about removing an alarm when instance dies? Lifecycle policy? What would be source of truth if somehow alarms get out of sync with what’s in AWS? One though cookie.
- Derivattive of 1. -
CloudWatchalarm must monitor exactly one and only one metric. Not two, not three, not
*wildcard, not regex.
- Dashboards are not easy to create, edit and don’t provide way for customization like
grafanadoes with variables, annotations and other great features.
- Metrics Math looks pale in
- Data is kept for limited amount of time and you cannot change that.
- Full observability with
CloudWatchis expensive! It seems to be free but try to understand the pricing and you’ll see that cost quickly adds up.
- Containaers insights were added recently and I haven’t yet used them. If you did you can share in comments how does
it compare to
I haven’t yet written about using
prometheus as monitoring system and I definitely should. I have used it extensively
during my work in Voxnes/Spreaker. It’s a great tool! Powerful, robust, scalable and really resilient.
Requirements for collecting metrics with
prometheus in this PoC were to:
- no changes to any existing application
- infrastructure changes introduced by this PoC must be easy to revert
- service discovery for
ECStask that supports
If you don’t know what
awsvpcis then please refer to docs
First step for me was to enable
cadvisor to collect metrics from the host.
gives a lot of insight into what’s happening on the host and
cadvisor gives insight into containers layer. This duo
greatly improves observability of infrastructure, especially in conjunction with
Second step would be adding side car containers exporting metrics for applications that somehow expose it’s metrics (API, files, etc). Such step don’t require any application changes so would be harmless and also easily revertible.
export metrics and ingesting them with
prometheus. This is the step that requires indepth knowledge of applications
and monitoring practices. If you have troubles with pinpointing best metrics to observe then start from golden signals
and read Site Reliability Engineering in meanwhile.
I will write a blog post about how I use
prometheus in my private K8S cluster, where it fits, best practices,
some caveats and additional tools in ecosystem.
ECS SD to work in all types of environments.
Some SD mechanisms have rate limits that make them challenging to use. As an example we have unfortunately had to reject Amazon ECS service discovery due to the rate limits being so low that it would not be usable for anything beyond small setups.
Right, so let’s get to the point,
terraform was used to easily setup PoC and tear it down when tests are finished.
prometheus from existing infrastructure it has it’s own
ECS cluster. I configured module in
a way to provide resiliency and setup
prometheus in multiple AZs.
As it was PoC it’s not highly adjustible, some things like VPC subnets are actually hardcoded but hey, it’s not my fault :)
There are multiple things that this module abstracts away. Let me iterate over them:
- IAM permissions for EBS volumes so
prometheushave persistent data
- IAM permissions for S3 bucket in which config is stored
- IAM permissions for ECS and EC2 service discovery
- IAM permissions to register containers in CloudMap (this can be improved as now it’s
- EBS volumes
- ASG + LC with userdata file
- ECS cluster + ECS service
- Configuration files on S3
Prometheus ECS task
Let me explain what’s deployed as our
As you can see we deploy:
If you’re familiar with
prometheus then there is nothing to explain with first three. Last one
key component for this setup. It’s an application (just few hundreds line of
python code) that gets information about containers running in ECS and provide it in
prometheus is able to read.
Additionally we need to configure
prometheus so it know about this
sd configuration with:
If you want to know how the
json looks like you can either try this setup or refer to docs.
There are a couple of
ecs-discovery apps but they didn’t support
awsvpc network mode which was one of main
Ingesting metrics from node-exporter and cadvisor
ECS cluster that we want to monitor we have to deploy
scheduling_strategy is set to
DAEMON every node in a cluster will be monitored.
Enable scraping container metrics
Additionally we need to configure
cadvisor and any other application task variables so
will know that we want to scrape it with
PROMETHEUS_PORT must be the same as port on which containers expose
Goal was to have
ECS and start collecting infrastructure and app metrics and I’m happy that my effort
resulted in fully working PoC!
During this work I had a feeling that I’m reinventing the wheel. Deploying
prometheus and it’s exporters on
kubernetes is so much easier. I guess sometimes we have to deal with what we have even though better technology exist.
There are some things that could/should be improved like:
- config shouldn’t be in S3 but because it’s in
terraformI felt it was better to have it in one place.
- some IAM permissions could be more strict
- the bigger infrastructure on
ECSis the more problems this setup would have but it’s PoC (wontfix)
Modules used in this blog posts are in repo publicly available
Feel free to use it but keep in mind that I won’t be developing it any more.