Public “Office Hours” (2020-01-29)

Erik OstermanOffice Hours

Here's the recording from our DevOps “Office Hours” session on 2020-01-29.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here:

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Jenkins Pros & Cons (2020)

Erik OstermanRelease Engineering & CI/CDLeave a Comment

I spent some time this weekend getting caught up on the state of Jenkins in 2020. This post will focus on the pros and cons of Jenkins (not Jenkins X – which is a complete rewrite). My objective was to set up Jenkins following “Infrastructure as Code” best practices on Kubernetes using Helm. As part of this, I wanted to see a modern & clean UI throughout and create a positive developer experience. Below is more or less a brain dump of this experiment.


  • Jenkins has a lot of redundant plugins. Knowing which one to use takes some experimentation and failed attempts. The most common example cited is “docker.” Personally, I don't mind the hunt – that's part of the fun.
  • Jenkins has many plugins that seem no longer maintained. It's important to make sure whatever plugins you chose are still receiving regular updates (as in something pushed within the last ~12 months).
  • Not all plugins are compatible with Declarative Pipelines. IMO using Declarative Pipelines is the current gold standard for Jenkins. Raw imperative groovy pipelines are notoriously complicated and unmanageable.
  • No less than a few dozen plugins are required to “modernize” Jenkins. The more plugins, the greater the chance of problems during upgrades. This can be somewhat mitigated by using command-line-driven tools run inside containers instead of installing some of the more exotic plugins (credit: Steve Boardwell).
  • There's no (maintained) YAML interface for Jenkins Pipelines (e.g. Jenkinsfile.yaml). Most modern CI/CD platforms today have adopted YAML for pipeline configuration. In fact, Jenkins X has also moved to YAML. The closest thing I could find was an alpha-grade prototype with no commits in 18 months.
  • The Kubernetes Plugin works well but complicates Docker Builds. Running the Jenkins Slaves on Kubernetes and then building containers requires some trickery. There are a few options, but the easiest one is to modify the PodTemplate to bind-mount /var/run/docker.sock. This is not a best-practice, however, because it exposes the host-OS to bad actors. Basically, if you have access to the docker socket, you can do anything you want on the host OS. The alternatives like running “PodMan”, “Buildah”, “Kaniko”, “Makisu”, or “Docker BuildKit” on Jenkins have virtually zero documentation, so I didn't try it.
  • The PodTemplate approach is problematic in a modern CI/CD environment. Basically, with a PodTemplate you have to define before the Jenkins slave starts the types of containers you're going to need as part of your pipeline. For example, you define one PodTemplate with docker, golang and terraform. When the Jenkins slave starts up, a Kubernetes Pod will be launched with 3 contains (docker, golang and terraform). One nice thing is that all those containers will be able to share a filesystem and talk over localhost since they are in the same Pod. Also, since it's a Pod, Kubernetes will be able to properly schedule where that Pod should start, and if you have autoscaling configured, new nodes will be spun up on-demand. The problem with this, however, is subtle. What if you want a 4th container to run that is a product of the “docker” container and share the same filesystem? there's no really easy way to do that. These days, we will frequently build a docker container in one step, then run that container and execute some tests in the next step. I'm sure this can be achieved, but nowhere near as easily as with codefresh.
  • It's still not practical today to run “multi-master” Jenkins for High Availability without using Jenkins Enterprise. That said, I think it's moot when operating Jenkins on Kubernetes with Helm. Kubernetes is constantly monitoring the Jenkins process and will restart it if unhealthy (and I tested this inadvertently!). Also, when using Helm if the rollout fails health checks, the previous generation will stay online allowing the bugs to be fixed.
  • Docker layer caching is non-trivial if running with Ephemeral Jenkins Slaves under kubernetes. If you have a large pool of nodes, chances are that every build will hit a new node, thus not taking advantage of layer caching. Alternatively, if using the “Docker in Docker” (dnd) build-strategy every build will necessarily pull down all the layers. This will both add considerably to transit costs and build times as docker images are easily 1G these days.
  • There's lots of stale/out-of-date documentation for Jenkins. I frequently stumbled on how to implement something that seemed pretty basic. Anyways, this is true of any mature ecosystem that has a tremendous amount of innovation, lots of open sources, and been around for 20 years.
  • The “yaml” escape hatch for defining pod specs is sinfully ugly. In fact, I think it's a horrible precedent that will turn people off from Jenkins. It's part of what gives it a bad wrap. The rest of the Jenkinsfile DSL is rather clean and readable, but embedding raw YAML into my declarative pipelines is not a practice I would encourage for any team. To be fair,  some of the ugliness could be eliminated by using readFile or readTrusted steps (credit: Steve Boardwell), but again it’s not that simple.


I would like to end this on a positive note. All in all, I was very pleasantly surprised by how far Jenkins has come in the past few years since we last evaluated it.

  • Helm chart makes it trivial to deploy Jenkins in a “GitOps” friendly way
  • Blue Ocean + Material Theme for Jenkins makes it look like any other modern CI/CD system
  • Rich ecosystem of Plugins enables the ultimate level customization, much more than any SaaS
  • Overall simple architecture to deploy (when compared to “modern” CI/CD systems). No need to run tons of backing services.
  • Easily extract metrics from your build system into a platform like Prometheus. Centralize your monitoring of things running inside of CI/CD infrastructure. This is very difficult (or not even possible) to do with many SaaS offerings.
  • Achieve greater economies of scale by leveraging your existing infrastructure to build your projects.  If you run Jenkins on Kubernetes, you immediately get all the other benefits. Spin up node pools powered by “ Ocean” and get cheap compute capacity with preemptible “spot” instances. If you're running Prometheus with Grafana, you can leverage that to monitor all your build infrastructure.
  • Integrate with Single Signon without paying the SSO tax levied by enterprise software.
  • Arguably the Jenkinsfile Declarative Pipelines DSL is very readable, in fact, it looks a lot like HCL (HashiCorp Configuration Language). To some, this will be a “Con” – especially if YAML is a requirement.
  • Jenkins “Configuration as Code” plugin supports nearly everything you'd need to run Jenkins itself in a “GitOps” compliant manner. And if it doesn’t there is always the configuration-as-code-groovy plugin which allows you to run arbitrary Groovy scripts for the bits you need (credit: Steve Boardwell).
  • Jenkins can be easily deployed for multiple teams. This is an easy way to mitigate one of the common complaints that Jenkins is unstable because lots of different teams have their hands in the cookie jar.
  • Jenkins can be used much like the modern CI/CD forms that use container steps rather than complicated groovy scripts. This is to say, yes, teams can do bad things with Jenkins but with the right “best practices” your pipelines should be about as manageable as any developed with CircleCI or Codefresh. Stick to using container steps to reduce the complexity in the pipelines themselves.
  • Jenkins Shared Libraries are also pretty awesome (and also one of the most polarizing features). What I like about the libraries is the ability for teams to define “Pipelines as Interfaces”. That is, applications or services in your organization should almost always be deployed in the same way. Using versioned libraries of pipelines helps to achieve this without necessarily introducing instability.
  • Just like with GitHub Actions, with Jenkins, it's possible to “auto-discover” new repositories and pipelines. This is sweet because it eliminates all the ClickOps associated with most other CI/CD systems including CircleCI, TravisCI, and Codefresh. I really like it when I can just create a new repository, stick in a Jenkinsfile, and it “just works”.
  • Jenkins supports what seems like an unlimited number of credential backends. This is a big drawback with most SaaS-based CI/CD platforms. With the Jenkins credential backends, it's possible to “plug and play” things like “AWS SSM Parameter Store”, “AWS Secrets Manager” or HashiCorp Vault. I like this more than trusting some smaller third-party to securely handle my AWS credentials!
  • Jenkins PodTemplates supports annotations, which means we can create specially crafted templates that will automatically assume AWS roles. This is rad because we don't even need to hardcode any AWS credentials as part of our CI/CD pipelines. For GitOps, this is a holy grail.
  • Jenkins is 100% Free and Open Source. You can upgrade and get commercial support from Cloud Bees which also includes a “tried and tested” version of Jenkins (albeit more limited in the selection of plugins).

To conclude, Jenkins is still one of the most powerful Swiss Army knifes to get the job done. I feel like with Jenkins anything is possible, albeit sometimes with more effort and 3 dozen plugins. As systems integrators, we're constantly faced with yet-unknown requirements that pop-up at the last minute. Adopting tools that provide “escape hatches” provide a kind of “peace of mind” knowing we can solve any problem.

Parts of it feel dated like the GUI Configurations, but that is mitigated by Configuration as Code and GitOps. I wish some things like building and running Docker containers inside of pipelines on Kubernetes was easier. Let's face it. Jenkins is not the cool kid on the block anymore and there are many great tools out there. But the truth is few will stand the test of time the way Jenkins has in the Open Source and Enterprise space.

Must-Have Plugins

  • kubernetes (we tested 1.21.2) is what enables Jenkins Slaves to be spun upon demand. It comes preconfigured when using the official Jenkins helm chart.
  • workflow-job (we tested 2.36)
  • credentials-binding (we tested 1.20)
  • git (we tested 4.0.0)
  • workflow-multibranch (we tested 2.21) is essential for keeping Jenkinsfiles in your repos. The multi-branch pipeline detects branches, tags, etc, from within a configured repository, so Jenkins works more like Circle, Codefresh or Travis.
  • github-branch-source (we tested 2.5.8) – once configured will scan your GitHub organizations for new repositories and automatically pick up new pipelines. I really wish more CI/CD platforms had this level of autoconfiguration.
  • workflow-aggregator (we tested 2.6)
  • configuration-as-code (we tested 1.35) allows nearly the entire Jenkins configuration to be defined as code
  • greenballs (we tested 1.15) because I've always green is the color of success =P
  • blueocean (we tested 1.21.0) gives Jenkins a modern look. It clearly depicts stages, progress, and most other systems we've seen like CircleCI, Travis or Codefresh.
  • pipeline-input-step(we tested 2.11)
  • simple-theme-plugin (we tested 0.5.1) allows all CSS to be extended. Combined with the “material” theme for Jenkins you get a complete facelift.
  • ansicolor (we tested 0.6.2) – because many tools these days have ANSI output like Terraform or NPM. It's easy to disable the color output, but as a developer, I like the colors as it helps me quickly parse the screen output.
  • slack (we tested 2.35)
  • saml (we tested 1.1.5)
  • timestamper (we tested 1.10) – because with long-running steps, it's helpful to know how much time elapsed between lines of output

Pro Tips

  • Add the following nginx-ingress annotation to make Blue Ocean the default” /blue/organizations/jenkins/pipelines
  • Use this Material Theme for Jenkins with the simple-theme-plugin to get a beautiful-looking Jenkins
  • Hide the [Pipeline] output with some simple CSS and the simple-theme-plugin:

    .pipeline-new-node {
    display: none;


When researching this post and my “Proof of Concept”, I referenced some links and articles.

Public “Office Hours” (2020-01-23)

Erik OstermanOffice Hours

Here's the recording from our DevOps “Office Hours” session on 2020-01-23.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here:

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Machine Generated Transcript

Let's get the show started.

Welcome to Office hours.

It's January 22nd 2020.

My name is Eric Osterman and I'll be leading the conversation.

I'm the CEO and founder of cloud posse.

We are a DevOps accelerator.

We help startups own their infrastructure in record time.

By building it for you and then showing you the ropes.

For those of you new to the call the format of this call is very informal.

Michael my goal is to get your questions answered.

Feel free to unmute yourself anytime if you want to join.

Jump in and participate.

Excuse my dog.

He's having a fun time downstairs here.

We all these calls every week.

We'll automatically post a video recording of this session to the office hours channel as well as follow up with an email.

So you can share it with your team.

If you want to share something in private.

Just ask and we'll temporarily suspend the recording.

With that said, let's kick it off.

Here's the agenda for today.

Some talking points that we can cover just to get the conversation started.

One of the announcements for all of us using excess was pretty good is that the master nodes have now come down and cost.

It's ugly.

It's a 50% reduction across the board.

Again, this is only on the cluster itself.

Like the master nodes.

It has no bearing on your work or nodes themselves.

So there's a link to that.

The other good news is Terraform docs has finally come out of release and that they have an official 8th photo release.

It's actually up to eight that won already this morning that supports each seal.

2 If your team isn't automatically using Terraform docs I highly recommend it.

It's a great way to generate markdown documentation for all your Terraform code automatically.

Some people use pre commit hooks.

We have our own read generator that we use it for other people use it to validate your inputs have been set descriptions and stuff like that.

Also, we launched our references section.

This is if you go to cloud policy slash references.

This is links to articles and blog posts around the web that have been written about how to use some of our stuff since I think it's interesting to see how others have come to use our modules.

This is a great resource for that.

And the first like a technical topic of the day that I'd like to talk about.

If there are no other questions would be using open policy agent with Terraform and then I'm also curious about firsthand accounts using Jenkins X. So with that said, let's take this off.

Any questions are you sharing the right screen.

Oh my god.

I do not share my screen.

All right.

Let's do that alone.

Let's see here.

Sure I am not showing the right screen.

You see my notes.

All right.

The magic is gone.

All right share this window.

Hey you.

All right.

Thanks Ed.

There we go.

All right.

Well, if no other questions I want to share a status update on behalf of Brian Tye use the guy over at audit board or Sox up that had the challenge with those random Kubernetes notes getting rebooted.

Well, you finally got to the bottom of it and everything else he did to try and treat it.

Although it wasn't actually related to the problem.

So they rely on shoot not Datadog.

What's that monitoring tool.

Start off as an open source project tracking all the traffic in your cluster.

Andrew help me out here.

Sorry what is this doing.

Yeah I think cystic is what they're using.

So assisting for monitoring.

So they have cystic installed on all their nodes in the cluster and was finally able to find the stack trace that was getting thrown by going to the AWS web console.

And looking at the terminal output of one of the nodes that had rebooted.

We thought we'd actually done this got to actually do it.

So this is where the actual exception was final cut.

The problem was a null pointer exception insisted running on their nodes.

That was causing the nodes to crash.

So it was actually their monitoring platform itself.

Assad's platform albeit nonetheless, that was causing it to crash.

So they so they got that fixed and now they're nodes they're not crashing anymore.

They have a cluster that some people spun up period that they just kind of gave everyone access to it like room.

You know this is just kind of a playground whatever, again.

And which in theory sounds nice, but what's happening is.

People are not being good stewards of using the cluster properly and setting resource limits and requests and everything.

And so it's causing like starvation on the nodes and it's causing not just the one team to have issues, but like anyone using the cluster to have issues and everyone's going what's going on with the cluster.

My stuff's not I don't think my stuff is hurting the cluster.

And it's everyone pointing fingers at each other.

That's all it's turning into.

And it's just a giant.

So So what are your thoughts on how to resolve that.

Don't do that.

Well, yeah, but I mean, I agree that it should everyone should play fairly with each other.

But maybe there are some pretty simple thrones games like don't give anyone don't give anyone cluster admin.

And if a team wants to use the cluster as a sandbox create of create a namespace you create a service account for that namespace that has whatever access namespace they want.

Create a limit range on that namespace so that if they don't set a request or a limit that it sets default then they don't have access to the entire cluster they just have access to the namespace itself.

And we get people bitching about.

Well, I need to set you know AI need to set a cluster role or whatever.

It's like, well, if you're on cluster.

Yeah OK.

I agree with that.

That's what was basically going to be my recommendation exactly.

You said they're the only other augmentation to that.

But I think it's going to take more effort would also be to deploy something like one of these open policy agent based agent solutions that require limits to be set that require all of these.

Now for the roles that can kind of be sold perhaps if you had get UPS related to this cluster, you can add the roles but that has to go through this different process.

Right anything that is global or the cluster or the next cluster is going to be.

No one has access to it.

And if they want something in the cluster, it goes through CSG like Jenkins or whatever.

OK gotcha.

You know that.

So they register you know so they say I have this app you know I'm going to push up.

I'm going to push up new images to this Container Registry you know and then set up harness or flux or Argo or whatever on the other end, you know and it requires kind of an ops team, which is not super DevOps.

But at the same time, it's like if I give access to the cluster to everyone.

It's just going to cause tons of problems.

So like, yes, I have an ops team.

And you are the dev team.

But like I want you to come work with me to set up your Argo stuff.

And then we'll be in communication on Slack and we can join each other standups or whatever and do that for you know DevOps collaboration versus here's access to the cluster do whatever you want is a free for all.

But are the consequences.

So one of things you said just jog a memory I met with Marcin.

He's a founder of space.

Lift shared it in this tweet UPS earlier this week.

And one of that was pretty neat with what he's doing is two things that come from experience doing get UPS.

One is addressing a big problem with Terraform cloud and Terraform cloud.

The problem is down to a good way for you to bring your own tools.

You have to bake those into your git repo and if you have like 200 Git repos that all depend on a provider, you have to like bake in 200 binaries and on repos and yet no fun.

Or if you depend on other tools and use like local exact probationers and stuff like that.

A fun way to do that.

Terrible so in space left.

He took a different approach.

You can bring your own Docker container.

Kind of like what you have Andrew and what we have geodesic with your tool chamber.

But the other thing that he does.

And this is what jogged my memory with what you're saying is sometimes you need escape hatches.

Sometimes you need a way to run ad hoc commands.

But we also want those to be auditable.

An example is you mess up the Terraform states and you need to unlock like the Terraform state is locked somehow.

You need to run the force unlocked.

How do you do that.

In order for the other example is you're running an actual real live environment and you need to refactor where your projects are.

So you need to move resources between Terraform state.

So you need to use the Terraform import command to do that.

How do you do that in an auditable way.

So what he's done is he.

I forget what he calls it here.

I think he calls it tasks but he provides a way for you to run one off tasks as part of your CI/CD city.

The other thing he supports is a Sas product right now is you delegate roles to the Sas and then those roles allow this to run without hardcoded credentials which is the big problem right with Terraform cloud as you've got a hard put your credentials.

And tear from the open policy agency.

I posted a bunch of links this week in the Terraform channel related to the open policy agent.

I confess it was it's been brought up several times before blaze I know has been doing some audio seized on this one of our community members, but it was my scenes showing me his demo how he's integrated into space.

Lift using.

OK that got you really excited.

So I started doing some research on that.

And then I saw it's actually pretty straightforward.

How you could integrate this into your CI/CD pipelines.

So if you look at these two links here.

I'll post them as well to the office hours.

Channel Nazi era.

I know I forgot to also announce that office hours have started here in Libya peeing everyone that office hours started.

Yeah, right.

So office hours.

So there's the link to the open policy agent.

So in here they provide an example of basically how this works.

So here's an example of a typical resource that you define and Terraform code and then what you do is you basically generate the plan as Jason using this argument.

Now you have this Jason plan and then that makes it very easy to operate on the output of this in the open policy agent.

So they have some examples down here of where you can now describe what is described the policy for this.

So here here.

It has an example of looking over scanning over the plan.

Let's see what this is.

So number of deletions and resources of a given type.

So here.

He's counting the resources you could have, for example, a policy, which is that, hey, if there are no deletions that reduce maybe the number of approvals that you need or something to get this PR through.

So then someone else in the community posted a link to conf test, which seems even more rad because it adds support for lots of different languages or types.

So it supports out of the box.

Jason Hammel but also they have support for ACF too experimental.

So now you can actually do policy enforcement on your Terraform using totally open source tools.

So that should play well with like your Jenkins pipelines code fresh pipelines sufferable Andrea of you are giving this a shot at all.

I am not able to do much experimentation and learning lately and trust me.

So no less that Carlos.

This sounds like something might be up your alley given the level of sensitivity of some of the things you do operating at a hedge fund.

Carlos idea with you.

And I was muted.

Yeah Yes.

Sounds very nursing.

I haven't seen it until now.

Cool Yeah.

You are you see.

Let me know whether I re posted those links into my companies slack in my Terraform channel.

And somebody said open source told everyone that they heard of.

Oh, yeah.

Yeah because sentinel is Ohashi corpse enterprise offering.

So yeah, there's an open source sentinel was saying that what I was linking was open source sentinel.

Yeah, I know the equivalent of basically open source sentinel.

Gotcha exactly.

I hadn't heard of central I didn't.

Yeah So sentinel is an offering of terror from enterprise on prem.

I don't believe you can have it on the cloud version that host an version.

I believe like the way that opa has kind of created this standardized format for creating these policies using this language.

It's called Rigo right.

Oh is that the name of this.

We have opa opa language is called Rigo.

And the fact that opa has created this standard.

It's going to it's going to be it's going to have to be something super awesome nowadays for me to go with something that doesn't support opa.

Yeah like, you know, I'm looking at said and it's going to be like this you know proprietary thing that works.

You know, for this one aspect.

And that's kind of it versus what's the reason I went to what to Terraform in the beginning.

Well, it's because it was standardized like you know if I go AWS, Azure GCP whatever you know it's all kind of I can use Terraform.

I don't have to learn some new tool.

But I'm feeling is that same kind of mentality.

Now Yeah.

Yeah, I think so.

If for kind of testing or validation of all these formats and testing for us because certainly I mean, let's face it, the tools that we use move bleeding we fast and many of them lack sufficient validation on what they have or features.

But now, like let's say say helm file, for example, using this would be very easy to express.

Now some policies around helm files and some of your best is basically codifying your best practices.

Now are unhelpful in doing the same for Docker file et cetera.

So Yeah like what.

So for helm file what would be a good best practice that everything has been solvable flag that should be enforced in all of your home files.

What else would be a good one that you're pinning image tags or something.

I'm struggling right now on my head.

I was prepared for it.

But yeah, I think we could find some things to validate and help us install a little flag.

Oh so how file has moved in the direction of mirroring the interface of Terraform so like Terraform has a plan phase and an apply is now in Terraform the plan phase actually can generate a plan and then you can execute that plan in hell and file.

It has it something analogous to the plan.

But not an artifact like a plan file.

So what I'm referring to is Helm.

That's right.

So helm Def shows you that the potential changes without making those happen.

Well Ben helm file has two other commands.

One is apply and the other is destroyed in those mirror Terraform.

So the apply flag honors the installed flag versus sync.

I don't believe sync.

So when you're in sync I don't think it'll uninstall things.

But apply will uninstall things when you run home file.

Interesting that's not what the documentation is saying.

But maybe it more.

OK I let's get that right.

I might be wrong.

There is a nuance like that.

Let's see what the documentation is saying that the difference between apply and sync is that apply runs Def and if Def shows that something is different than it runs sync.

Am I getting that right.

So you applied the helm file apply sub command begins by executing.

If that defines that there are changes sync is executed adding the interactive flag instructs helm file to get your confirmation vs. sync sync sink says sink all resources from state file.

Yeah So I don't think it mentions it.

And it doesn't work like that.

It doesn't mention anything about that installed flag.

I suspect before consent either I could be wrong.

Maybe maybe it doesn't work that way.

So you know what I said I made.

Maybe I am misled or maybe the functionality has changed since we started using it.

But the idea that apply is intended to be used as part of your c.I. workflow.

We're getting in town file.

And I'm starting to try to come up with some best practices because know some of us.

And now have more experience with home file than others.

And so we're looking at things like, is it a fight to clean up on fail flag for example.

Is it a best practice to have that on or just leave it as a default. False like.

Yeah Well, so you're guy I would be careful where you about that in production and I'd be I'd be recommending and perhaps and staging.

But it depends.

Maybe not at all staging environments maybe just on your like demo or preview type environments.

This just came up today.

So let me explain a little bit more.

So as we know with home if you do an upgrade it will bail on you if the previous releases failed.

So then you can't move forward.

And if you add the force flag for example, then it will uninstall that failed release and then reinstall.

But that might be pretty disruptive if you're running in a production environment, especially since a failed upgrade or a fail deployment doesn't mean a dysfunctional service.

It just means the surface is in an unknown state.

So this is why you might not want to clean up resources if it'll help you debug what went wrong.

However, in a staging environment or like preview environments where you're deploying frequently and you don't want your builds to constantly fail, especially when you're developing, especially when you know things might be breaking and unstable, then I like the force flag would and possibly the cleanup flag, then to just keep things moving humming along.

Even in staging, I would like I would totally agree with that for dev but in staging.

Wouldn't you want to have like.

My thought process for staging is like, all right, I want to have whatever production has now.

And then run the exact same command that you're going to run in production and see what happens.

Yeah, I'm so sorry we're just overloading the term staging like every company does.

Staging in your context is like a production.

Yeah, we have something that we saw staging to us is an account stage where you will have environments with different designations so one designation would be production and production should be almost identical to production.

But then you have unlimited staging or preview environments.

Those are also running in staging or staging accounts.

But they have a different designation.

So it's just that different companies use these terms differently.

And that.

So we're saying the same thing.

We're just using different terms Eric can correct me if I'm wrong, but it almost still deployed to a field.

The play if there were previous deployment deployments that ass right.

My understanding if the most recent deploy failed the upgrade will fail unless you have the forced flag.

But I don't want to claim expertise on that.

I've been using helm files.

So long.

So help file kind of redefined some of those behaviors and use the force flag on helmet file.

The first thing it does is an uninstall of the release if it has failed to ensure that this seat successfully sells.

But the raw home functionality.

Maybe if somebody knows for sure, please check in I'm almost positive.

That's the case.

Just because I have instances where the initial deploy will fail and then and then I won't be able to deploy to that again until I do like a delete purge.

Right I will I will have long running deployments that will fail and then running like another deploy does work.

And I don't have to like delete that.

Delete that.

So you're saying when there are multiple helm releases you're able to do it.

But if there's only one and release.

You can't.

So for a particular release.

If it's been deployed successfully before and the mostly deploying as long as it was successful before.

OK that's worth that that might be the case.

I believe, at least in my case, I haven't seen that.

But definitely if it's the initial deploy of a released and it's never been deployed before and it fails, you cannot redeploy it until you kind of until you purge the that release.

Yeah So maybe what we're seeing to see my screen here.

And I think it's the atomic flag that helps with this magic here.

Last nice.

By the way Brian, I shared just before you joined I think the resolution to your issue with the random node reboots.

Oh, yeah.

Yeah, just for those that ever have to run into this is to check that easy to system logs.

We'll have all of your kernel logs versus checking on the box, or guess ship it yourself by think just if you're on a date.

Is there a ship for you.

She module is.

There's cool any other questions or insights here.

I need to come up with a demo of infrastructure as code using just my local computer.

So self-contained like a demo.

Yeah Any ideas.

There's like I could do Vagrant with you know spinning up a couple of virtual machines.

I love to use Terraform because that's what I use for.

For real infrastructure as code.

And so if I'm just showing code.

I'm going to show Terraform code, but I need to be able to do a demo of like, hey, look at me spinning up actual infrastructure.

But I won't have access to a native US account or anything because multiple people are going to do this.

I assume you're saying, what does it specifically need.

Siri woke up here.

Does it specifically need to demo kw s functionality or demo.

I see any kind of I see wonderful food for thought.

If it jogs your memory because I know you're working on dad's garage you're equivalent of dude s sick one of the truck true prototypes.

They did here.

It was just AI think it was a Yeah.

So I did a prototype of working with Minikube but then it also works with Docker for Mac and I assume most of your company would be had we'd have Docker for Mac installed right now.

He would think, OK, maybe then the value of what I'm saying is moot.

But what I was going to say is if you're running dog for Mac or probably dogs for Windows 2 you can just check off this box enabled Kubernetes.

That's not a bad idea though.

And that you do to play circuit you could do Terraform code that just spins up Docker containers.

Yeah Yeah.

Because all I really care about is showing like you know look here is code that spun up you know it spun up a bunch too with this exact diversion put up with this exact amount of memory and CPR you power that I told it to set up you know and it spun up three of them because I told it to spin up three of them you know.

And now I'm going to kill them all with one command.

Yeah, that is in that's infrastructure as code right there doctor.

Instead of virtual machines.

But the concept is the same.

Faster demo is lighter weight.

And as you know, there's a doctor providers you know you can also skip what I said about cougar genetics and you could just do plain vanilla doctor as well.

You provide the context of this demo is it like a lecture learn for your co-workers it's for an intro to deficit gops class.

So we've got a section on infrastructure as code and the, the customer wants a live demo of the infrastructure as code.

I did a.

So this demo does include AWS.

But I did talk on using Terraform to deploy and apply a simple web API on this young group in the US and then do a looping deployment of that.

So that's kind of fun.

I hope it's open source I can send it to you.

But it does require a database account.

But I actually was going to do this for my co-workers and provide and we'll just create a dummy database account provide them the credentials and then just ask them to do the Terraform from destroy after.

So you could technically do that right.

I guess it depends on the size of your class.

Yeah, that gives me some ideas.

Like, I could do something like I could do something like you know given the way that you guys currently handle your infrastructure.

How would you handle doing OS version upgrade.

Oh, well, you know we'd have to SSA into each and every one of them, then know apply all the commands and shepherd them through the whole pets versus cattle thing versus I could be like, I'm going to change this variable and then redeploy.

That's a good idea.

I like that.

Well, I did also which was really cool too when I did our people that was really cool when I did my demo was I deployed it to two different regions and two different environments all with like the same Terraform and yet.

It was just a search of a variable and when I did that demo everybody appreciated that.

Yeah, that's going to do well, you have identical environments.

But one is staging and one is of the product one might have t to X large is the staging one might have t to my.

Other than that Alice is wearing the same.

Yeah let's call that reminds me of one of the posts at one of our customers shared with us only because it was a nice summary of most many people's experience with Terraform like when you start to where you are today like when you start you start with small project.

And then you realize you need a staging environment.

So you clone that code in the same project.

And now you have two environments.

But any change is in the blast radius of all of those.

So then you move to move having separate projects and all of that stuff, you find that linking to Terraform channel what was the first time I heard of the term a terror list for Terraform monolith material.

I love it.

Yeah, it's not a perfect description of it.

Everyone knows what you mean, if they've ever done it.

Yes speaking of terror let's.

I was thinking about the wave Yeah.

You had showed.

The way you do the terror form infrastructure for your clients with like each client gets a GitHub organization.

And then there is what's under the organ.

We're going to like do you have is it just one big Terraform Apply or whatever for like the whole thing or is it no go out.

Yeah, it's all decomposed.

So basically, our thing is that we've been using it.

So here's like an organization.

And then each e.w. us account represents basically dedicated application and then so therefore, each one of those has a repository but then you go into each one of these repositories.

And then they have software just like your applications have software.

But your applications.

What do they do.

They pin at library versions.

So that's what we do here.

So you want to run the write the same versions of software in all these environments all these office accounts and we do that just by pinning like to a remote module.

So here we can throw a remote module.

So when we made changes to this directory that triggers a plan and apply workflow with Atlantis there's a plan applied just for the cloud trail directory or whatever except for the whole staging got Catholics did.

Exactly So basically, you open pour requests that modify any one of these projects.

And that will trigger, you can modify multiple projects at the same time.

There's nothing no multiple applies to do that they exist.

Well known in multiple plans multiple plans multiple tries to do that.

But all in the same pull requests and all automatic.

Right OK.

What does that look like.

Does it Lantz ask when you submit a pull request does Atlantis say you know this affected 3 Terraform projects.

And here's the plans for all three of them is that what it does.

Yeah So it would.

So you know the pull request was opened automatically or as the site was open manually by a human a developer.

Then the plan was it kicked off automatically.

And then we see this out here.

This happens to be output from helm file.

Not Terraform but we because we use Atlantis for both my file.

And terrible when we see it or what's going to change.

And then somebody approved it.

And then we can surgically target what part of that we want to apply for.

We're going to say Atlantis apply and it'll apply all genes catch a so one thing to note about Atlantis, which is really nice is that it locks that project that you're in.

So to developers can't be trying to modify it in this case cube system at the same time.

Because otherwise, those planning to apply those applies will be under each other's changes and they'll be at war with each other.

So Brian any interesting projects you're working on.

I am starting my Prometheus initiative.

Oh, nice.

Very cool.

Well, are you going to take a Federated approach or as in like a remote or remote storage.

No So basically, you have a centralized Prometheus but then each cluster runs its own Prometheus and the centralized Prometheus grapes.

The other ones.

That possible.

I'm not sure yet.

I actually just started yesterday looking at the helm file for.

But yeah I'll be honest, I'm not stupid familiar with Prometheus.

It's just I know that it's a lot more lightweight than what we're currently running, which is the Stig is third party.

But takes up takes up 30% of our CPU on our 4x 2x large is on a clause of the military boot and it's just for a mining tool is just asking for too much also is memory intensive as well.

So Prometheus obviously is on the other side of that spectrum where it is not a resource hog which is why I mean, yes.

I mean, it takes a brute force method to monitoring right.

It looks at like a packet inspection and everything happening on that box.

It's but I guess it's a testament to how fast sea views have come and how cheap memory is that they're able to get away with doing that.

What does.

But it's still a problem when you're doing things like you are saying, yeah.

Yeah, I have you guys gone the Federated Federated approach.

What would you say about that.

So we are we're starting a project in the next month or so.

And with a new customer.

And that's going to be doing Federated Prometheus.

The reason for that that cases, they're going to be running like dozens or more single tenant dedicated environments of their stack for customers is basically an enterprise that's so having dozens and dozens of Prometheus is in for fun as would just be too much noise and too hard to centrally manage.

Also scaling that architecture.

If you get if you only had central Prometheus would be dangerous because at some point, you're going to hit a tipping point.

And it's just not going to handle.

So with a Federated approach basically, you can run Prometheus because Prometheus is basically time series database.

In addition to its ability to harvest these metrics.

Now it can use like a real time series database like influx TV or post scripts with time series TV or whatever.

And so forth and a bunch of others.

But the simplest way, I think is just to think about Prometheus basically, it's a database, you know we can offer.

So then what you can do is you can run smaller instances of Prometheus on each cluster with a shorter retention period.

So you can still get real time data in, but you don't need to run massive deployments of Prometheus, which can get expensive because I'm running for methe s with less than 12 15 gigs allocated to the main Prometheus operator is required before for any sort of retention more than like a week or two.

So So in the Federated model basically set up like on a shared services cluster.

What we call core one for me at this instance that then scrapes the other ones.

And it can do that at its own pace and you can also downscale the precision in the process of doing that.

If you need to.

Plus if you do need absolute real time, you could still have Grafana running on all the clusters.

And you can have real time updates on those environments, different ways of looking at you guys use the remote data remote storage.

We don't.

So it's kind of it's been in our backlog of things to solve.

And there's been there's a few options of it.

The one that keeps coming up is Thanos where we look at the architecture.

I mean, it suddenly it increases the scope of which you've got to manage right.

And when is your monitoring system.

It's really important that it's up.

So I understand and appreciate the need to have a reliable back, but also the more that you the more moving pieces the bigger the problem is if something goes down.

And then what monitors the monitoring system and all these things.

So we've gotten away with using EFS and as scary as that sounds is actually less scary these days because DFS has basically x compatibility and you can schedule.

You can reserve.


So the problems that we've had with Prometheus have been related to AI ops and then we just bumped up you pay to play.

And it's not that much more to pay to play compared to engineering side.

So we just bumped up the AI ops and all our problems.

One way.

The other option.

I think Andrew's brought up before is that you can also just allocate more storage and then you get more high ops, which is great because it leaves it.

First of all, it just gives you more credits.

So if you still have first of all, IOPS don't need that higher baseline on of credits you're getting is not high enough given the amount of data you have just store a bunch of random data and the amount of money you a knowing that I got your signals pretty bad Andrew at least for me, is anyone else hearing the feedback a little bit.

Yeah So would I be able to use.

So you know I have a ephemeral cluster type situation would I be able to use ACFS with an outcast cluster.

Oh, yeah Yeah Yeah.

Yes cluster that ties back into the same ACFS.

Yeah Yeah Yeah, you could have a static Yeah.

Yeah, this file system.

For example, and just keep reusing that and unique.

Yes, your family.

Yes clusters.

Awesome Yeah.

That's really nice about CSS is it does not suffer from the same limitations of being it EFS is cross availability zone.

Yes versus IBS, which is not one.

So the issue.

People run into frequently when doing persistent volume storage in Kubernetes is let's say you spin up Jenkins and you have a three node cluster across three availability zones because you like high availability will who Jenkins spins up in USD one and then for some reason Jenkins restarts.

And next time Jenkins spins up in USD 1B.

But that EBS volume is in one a still Jenkins can't spin up because it can't connect to the b.s. volume.

Yeah So your host DFS does not have that problem.

Yeah In fact, we've changed the default.

I forget what the term is this default storyboard class.

The storage class for cabinet is to be cast and make it explicit.

If you want PBS like before certain situations.

But the that's worked out really well for us.

Plus it simplifies backups too, because you can just set up AWS backup on your DFS file system.

You don't have to worry about backups of every single volume provisioned by Kubernetes.

So it's a lot easier just to back up the whole file system.

We don't do that because it would be backing up random crap that we don't need.

But you can definitely do that.

Yeah, and if she's a shameless plug.

If use the cloud posse in DFS module.

It supports database backups.

Well, that's good to know.

Thank you.

I probably wouldn't have looked the VFX.

It sounds like the right thing to do.

But we were running it right now with so far we have maybe three months of retention in it.

And it's fine.

Obviously, we have to wait and Seattle goes up to six months.

But I don't see any problems right now.

Once we address the memory issues the meat is we've been running for way over six months.

And we're fine with it.

It's about.

I hope it's about ops right.

As long as your AI ops handle you know, if you are in production and your Amazon, of course, you shouldn't use EFS or your Google or whoever.

That's dumb, but if you're a small you know if your traffic is small go for it.

And don't listen to all the people saying, oh, you should never run Postgres on EFS because it's the NFL.

It it's fine.

Don't worry about it.

Try it.

And if it causes issues, then figure out something some other something else don't add Nazis don't add complexity.

Right off the bat until you have tried the less complex option and determined that it's not a good option.

Yeah, I think that's a good way of putting it.

Are you guys using the CSI driver for EFS.

I'm using the original or whatever that's you that.

I don't know.

There's a tool called the effect provision here that you deployed your QB native cluster that just provisions perceptive volumes for you given consistent volume claims and it works great.

I just posted what I was looking at in the office hours channel.

OK So you guys are using the offensive provision here a fester Virginia.

I think we have a health file for it too.

Thank you.

Thank you.

Good to know.

Yeah if this provision and here's the file that we used and work straight.

We've been on it over a year.

No problems.

And so you guys are testing out coubertin entities federation.

No, we have not on the federation root for Kubernetes just for clarification.

When I was mentioned federation that was for Prometheus, which is unrelated.

I often look into that or Prometheus sandwich.

Protip with the effects if you're going to go into Governor cloud for any reason, use Governor cloud WF cloud E does not have CSS.

We made that mistake and it has cost us.

Well O'Brien by the way, where I pretty much learned about what this looks like for the Federated Prometheus is from your pal Corey Gail over at gundam so they can definitely shed some light.

They're not doing it under Kubernetes.

They're doing it like on bare metal.

But they can tell you about why they did it.

Yeah And I know that they just started that Prometheus initiative like maybe half a year ago.

Yeah, even that period, which is really cool.

Yeah Cool.

So as any other parting thoughts here before we start wrapping things up for today.

Any other interesting news announcements you guys have seen.

On Hacker News don't you.

Apparently friggin what's it called Whatsapp.

Check if you're Jeff Bezos don't use Whatsapp.

Yeah, that more then that sounds too good.

Also I thought apple backups were encrypted.

But now they're saying they're not encrypted.

I have started using Keybase heavily and I'm in absolutely in love.

Awesome but keep AI mean, that's great for chat and validation or whatever.

And maybe with AWS can creations and whatnot.

But it's not going to help you back up your iPhone right.

I mean, I don't have an iPhone.

I'm using it.

Give me some Keybase has file storage.

And it even has encrypted get repositories.

I saw that.

Yeah What's interesting and underrated are they're exploding messengers.

If you're sending secrets once you get them to exploding messages like like Mission Impossible.

They literally explode in the UI.

I haven't seen that.

How does that work.

I mean that on your message.

Oh, yeah.

Sure enough.

Let's go.

Let's go.

I never use that.

I just don't really use it for the check because none of my friends are on it for me.

It's what we use in-house to send like some of our secrets.

Yeah It's one of our best practices.

Well, we start with a new customer.

I always recommend that they set up a team under key dates were for those situations where you need to share its secrets that don't fit into like the best practices of using a password manager or the best practices of using like how she caught ball.

Just kind of a one off.

Yeah, for the one off things right.

Because a lot of these other tools don't support direct sharing of secrets.

To individuals in an encrypted fashion just put your passwords in Slack.

What could go wrong.

Seriously they cannot get access.

I cannot get this stupid doctor provider to work or forget.

All right, everyone.

Well, on that note, I think we'll leave it.

Let it be.

We'll we're going to wrap up office hours here.

And thanks again for sharing everything.

I always learn so much from these calls reporting on this call is going to be posted in the office hours channel.

Do you guys next week same time, same place Thanks, guys.


Public “Office Hours” (2020-01-15)

Erik OstermanOffice Hours

Here's the recording from our DevOps “Office Hours” session on 2020-01-15.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here:

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Machine Generated Transcript

There we go.

All right, everyone.

Let's get the show started.

Welcome to Office hours.

It's January 15th.

My name is Eric Osterman and I'll be leading the conversation.

I'm the CEO and founder of cloud posse.

We are a DevOps accelerator.

We help startups own their infrastructure in record time.

By building it for you and then showing you the ropes.

For those of you new to the call the format of this call is very informal.

My goal is to get your questions answered.

Feel free to unleash yourself at any time if you want to jump in and participate.

We host these calls every week will automatically post a video recording of this session to the office hours channel as well as follow up with an email.

So you can share it with your team.

If you want to share something in private.

Just ask.

And we can temporarily suspend the recording.

With that said, let's kick it off this Office Hours.

So here's some talking points.

I put together one announcement I'd like to make.

Just today.

I think hashi corp. announced that the Terraform providers are going to be distributed as part of the Hasse corp registry.

So this is like what is available today with Terraform models.

The caveat here is right now, it's only official providers they are working to open it up to third parties and other providers to publish their providers.

I guess for the security implications of this.

They're just being very careful about how they go about that.

Anyways, this exciting news because anyone who's tried to use third party providers.

It's been a real pain.

So if in the near future, they fixed that I'll be very happy.

Another public service announcement.

We said this last week.

Just want to share it again, because I'm pretty sure it's a bunch of companies are going to get bit by it.

The root certificates that effect RDS are raw and documents are expiring in March.

If you're using certificate validation with your instances that's going to break unless you improve upgrade your certificate bundle for those instances or fruit for the nodes that your clients run on those certificate bundles need to be updated.

All right.

So the next thing we have a question from here in our community posted and office hours.

He has also some feedback because he asks in Cuba and Eddie's office hours and got some answers there too.

So we'll see how that reconciles with what we recommend.

And then a demo.

If time permitting, we wanted to do this last week we ran out of time.

I just want to show you how we've been able to use CloudFlare as kind of like this Swiss army knife reverse proxy port injecting H2 MLC ss into sites that we don't necessarily control.

All right, let's start with you here.

So the question you'd asked was how to provide a module or a tool to our application developers to provision the databases for them, including the deployment of the secrets inside of communities and in office hours hours right now in that channel.

If you check.

There's a link there to a Terraform repository where it looks like you started kind of like EPBC with putting something together.

Well, it's basically for me, it's in our infrastructure as code repositories right now.

And we want it to or would like to move SAP basically to the application developers.

This is a private project of my the similar is pretty similar to the infrastructure that we use at work.

So you would go to the committee you just have modifier telephone from file.

Yeah, this is Bono.

All right.

That's another one for post.

I think Costco's Hamilton fire down on the bottom.

So easily these two would be one file and we could give this to our application developers.

And I can just maintain those and currently we get roughly around 2 to 10 requests for requests on our infrastructure scope repository and make some people.

And we still have our own people.

So currently kind of gets to the point where it's cumbersome to maintain all these requests and we were thinking about just like putting it out.

So I asked the same question in the communities office hours where a lot of knowledgeable people on.

They basically said, well versed in their developing experience because the environment is either.

So I don't know much about op's application therapy in general.

And yeah it just reduce their productivity from the application developers, because I didn't know the centerfold.

That's the feedback that I got from them.

But I'm happy to know what if you are doing something somebody.

So there's two things at stake here.

There's one is kind of like your stated approach or objective that you're trying to do right now.

And this is one way to do it.

What you describe here is kind of like how I would automate the production of perhaps your production grade backing services for how low to fire for those of you who don't know how notify or is a project that year working on.

It's a helm.

It's a tool to visualize differences between helm releases or helm charts.

And here is kind of like some of the infrastructure code to deploy that.

So what we have here is some Terraform code that provision that generates some random secrets that provisions a post gross role in the database, and then writes those secrets that were generated into Kubernetes into a specific namespace with those values, including the database, you or I in this case is what he's writing there.

So hold that thought doing this and Terraform is great.

And at some point, you're probably going to want to do that like for your production grade services.

This is what you're going to do.

Well, we don't recommend though, is for preview environments or pull requests environments or what we sometimes call unlimited staging environments.

We want to use Terraform in this case in this case.

Well, we would be using is as part of your helm chart you would have the ability to enable backing services that that chart depends on.

And then that makes it very easy to spin up a fully self-contained environment within Kubernetes just by deploying that chart into a release that.

Then the chart itself can generate those random passwords using the Rand function, whatever it is in home templates and it's easy.

OK So that solves kind of like I think the problem actually that you're trying to address right now.

But it doesn't solve kind of your initial requests like, how is this the right approach that we should take for safe production environments.

If I reframe your question.

And so that was the way I first, they actually interpreted it when I was senior question and it's something that we've been thinking about to a cloud passes like, how can we move the dependencies for backing services of your applications that we need to provision with Terraform closer to the applications themselves and not in like some centralized infrastructure repository.

Because you don't want to have to open one PR in the infrastructure repository like most companies have.

And then another PR free application coordinate that whole process.

Right the main problem because we only on regulatory push our infrastructure is cut to production or even to staging and application developers might be faster than us and want to stay up as a service to staging already.

And they have to wait for SRV to dodge a gift with a message.

That's a good point.

One that I didn't bring up.

So you're also talking about not existing cert not just existing services, but also net new services that are deployed.

What's that process.

Yeah OK.

So be the approach that we are starting to lean towards.

But I can't say that we have fully achieved it.

I think there's some others here on the call that have maybe Dale has done it or Andrew Roth maybe if he's on.

But would be to have the Terraform in each one of your microservices.

You have a Terraform application folder in there and that would have the deep.

Backing services that it needs.

Now what you describe here.

Yes, this is a perfect application of what you just what you have in here.

This should be captured inside of a Terraform module.

And then that module will be version separately and of itself.

And then you might want to have some automated tests related to that separately.

But then when you want to instantiate that.

How do you do that.

Well, that's where you're now in your application full.

Now in your applications repository let's say how modifiers your application repository there'll be a sub folder in there called Terraform and that's where you're going to invoke that module as part of your ICD pipeline.

The question is then what do we do about the integration secrets.

How do we get those there.

So that's going to work well in this case is that depending on what you're using for your CI/CD I'm going to say, let's say you're using Jenkins if you're using Jenkins it's going to use the standard credential store and then your pipeline will indicate that that's the credential store, it's going to use.

And then it can just go ahead and position this stuff from scratch.

But the challenge is if you have some credentials that can be programmatically generated and some that needs to be user generated.

That's the challenge.

And that's very often the case like you're integrating with third party guys and you developing a new service.

How do you get those secrets in there for the first time.

I don't have a brilliant answer for that.

We usually just see people kiddos in as part of that cold start.

Well, this kind of stuff your itself with time from very able.

We just put them in well.

And just put that very terrible variable in top secret configuration.

So that's how we do this kind of stuff that's almost an.

And then that also empowers the developer to set that environment variable in your CI/CD.

OK Yeah.

So this is how we do.

We are doing that or they are using basically the services and just to jq and try to get it somewhere else.

For example, Google Maps.

Someone wrote a best script to generate Google Maps secrets.

OK I don't have Terraform support.

But yeah if someone has ideas about us.

I would be super happy to know how you are doing this kind of stuff if you're doing it.

Mm-hmm It's more like exploratory right now.

So also I'm curious.

I mean, anybody interject in the answer I gave is was there some disconnect I missed there, which is the part that you're struggling with.

It's more like I wanted to know if people are already doing it and what our experience was with it because, well, that's the feedback that I got from the committee's office hours.

Yeah, I tried it.

And developers didn't understand why I'd now have to approve infrastructure as code as part of their deployment process.

So I had to prove they did it this way.

They approved basically the terrorism plan and the terror from Kenya to execute it.

And then the application promotion got executed.

And larger.

So you doubled your approval go gates because they moved to infrastructures called into I think it's going to largely that the organization culturally how they approach it.

I kind of react to that.

I this is this kind of like, oh keep other people's problems as it relates to code reviews.

I kind of reflect that.

Don't you want to grow you know like career wise and learn these things.

And I think requiring code owners in your repositories so that certain when certain files are changed that an approval comes from one of those people.

I think that's a way to mitigate some of that.

But that can be all bundled into the same code review process.

Perhaps it slows it down a little bit initially.

One other thing I wanted to say is.

So the challenge here is that we're spanning tool sets.

Right like if all of this works just through coo cuddle.

So to say or like the Kubernetes API it be a little bit easier versus here where I'm sorry.

Go ahead.

I'm saying long held versus here what we're doing is we're straddling tool sets we're straddling a Kubernetes tool set and Terraform toolset for deployment.

And this is where I think that now.

Now, if we look in this repository you're technically using Digital Ocean so it's a little different here.

Yeah And some of our answers would be probably more obvious focused.

But this is where the e.w. s service operator is also, I think, really exciting is the ability not necessarily this project.

But this technique.

This badges.

So you have a Cuban ID.

Oh, interesting.

It's been archived.

But basically, you have a CRT that you deploy that defines that part of the infrastructure.

So that you can stick within the same toolset to deploy.

That's cool.

I actually today found cube farm, which basically gives Cuba need to see ID for telephone providers.

I posted in the office also.

Oh, I will not exploit get net but I think it could also off my issue where I said, it's a second link not the first one where people now see I want to switch, or maybe it's the same coupon coupon looks like this is they've gotten a lot more in.

Yeah, I guess I have not seen this one because ranchers started with one as well.

You know.

Yeah Yeah.

I have not tried it out yet, but it seems like you get a C see d for your telephone providers and can then do the same stuff.

Mm-hmm Yeah Q4, they also do this.

Let me double check the OK.

So it is just I thought Jim form did more than Terraform but OK.

Never tried it yet.

So I. Just want it today like a couple of hours ago.

But I will take a look into it as anybody kick the tires and in Q4.

Yeah nice.

That is the first one she heard about it.

Yeah or it was this based from the one from Capgemini or is that a for I'm not sure.

I did not investigate too much, to be honest.

Other products as well.

Cute folks keep farmers from like app code.

Yeah Yeah.

That's what it was.

It only went to the website saying absolute they do a handful of things I haven't looked them too deeply.

Interesting looking products.

Yeah, thanks for sharing.

Yeah you nailed it.

That's kind of like what I was thinking is.

I think that's where things are going to get easier and easier.

There's more of this is that we can use communities as the scheduler for everything to do.

Cool well I didn't want to embarrass Dale here a little bit.

Dale just got his certified Cuban his administrator exam K And so pass that.

Congratulations Dale.

Glad to.

Glad you accomplished that.

Great Thanks.

And that means anybody else is serious Cooper related questions, ask Dale any other questions see here got something that chat that someone actually do the Cuban writers application developer certificate.

Oh the.

Now I Katie Yeah.

Is that next on your list.

This one actually came about a little bit.

Maybe I think to go more into the security side of things as well.

Data science.

I'm not sure.

I may be doing them watching.

Yeah by the way Dale posted the resources he used to pass that exam in the humanities channel in suite ops.

Check that out.

Yeah, the first link for Linux Academy really does it good.

Or if you want to brace yourself itself, but the one on you dummy itself actually does this practice test that are pretty spot armed with a set of questions that actually go with it.

So are doing both of them you pretty much cover everything.

I have a question does Michael seem Michael a long time listener.

It looks cold wherever you're at in New York City.

I finally kind of gotten the weeds with this product.

I'm talking about was just started here, which is the rule, you are all service within a code fresh and so you know one of the things I do is I was to create a branch context read your book to create a project.

The project has its share configuration.

So acts like a meadow namespace and then what I'm doing is because we have essentially microservices that need to as an application must not support.

So what I've done is I've used Helen file right to basically install all of the applications as they are from the release branch of the latest basically, from our sort of the next branch of the charts.

Then basically running it again with just a diff image tag.

What have you.

I'm trying to figure out the best way to do that next part do I. I tried to do help while apply with just the you using a selector and with a namespace.

But that would seem to do just it all just ended up being reverting back to all having the same image tag and not like overwriting it with that diff.

I'm not sure how to make it a command line.

I have to write a file.

Unsure about what I needed to do in that circumstance if that makes any sense.

Yeah let me summarize then my own words also add some more context, I think for everyone else.

So see you're using ranchers distribution community rancher.

This concept of projects and you're using code fresh as UCS platform.

You want to create preview environments or like these unlimited staging environments as we call them.

And to do that.

You've defined your entire application architecture inform form as part of multiple helmed charts or releases and all of those charts the releases are in this health file.

And now you're deploying that file as part of your CI/CD pipeline.

And when you build a new image you kind of you want to update that service.

Now, the part or this is the part where I lost you a little bit is the problem like, why don't file apply wouldn't be working for you.

Well, I mean, I think I was running a health file sync at first while I biggest because I guess initially, I'm doing this in the model regrow itself.

Right like so it's not all of the list right now at the moment weren't being published because we still kind of build a pipeline.

So they weren't officially published.

We don't have a version.

I mean, no question when asked the versioning system for this.

I'm trying to figure that out to be able to create canary builds that kind of stuff.

And whatnot with you know I have other tools I use for my libraries that I obviously like NPM libraries that use like auto and semantic versioning, but unsure how to be able to make that work here.

Aside from that what site was a question you asked.

No, I didn't.

So I struggle to see where I didn't understand the problem you're talking about with drifts or patches or things like that.

Were you using tools outside of the home or outside of the home file to do the patching or changing images.

No I mean, you know what I mean.

I just use code fresh to create an image tag.

And then I don't like save that file.

I just kind of pass that as a set you assets as a set to the second.

Like yes I do the home file plan, which applies everything.

And I really do help upgrade with the diff.

You know for the one service that I'm trying to see against the overall and trying to.

I was unsure a what.

Which one to run like Darren helm file sync a second time he'll apply the first time I went.

I'm not entirely clear which I should be running in order to be able to your fly.

It gets to think.

Yeah Yeah.

Yeah, I understand the confusion and so part of the confusion is like hell fire as the tool has been growing really rapidly accepting a lot of cars and things like that.

So sink was the original command that helm file came out with.

But then to kind of more closely aligned with the pattern of Terraform where you have plan and apply and destroy the whole file implement that the same things.

So from what I recall I think helped file apply honours that installed flag inside of your releases so that you can maintain the state of weather services installed or not installed while helm files sink.

I don't believe we'll uninstall services based on that installed flag versus otherwise.

So that's that.

Now what you're doing.

I can't I can't say it's wrong.

Like the two phase apply process.

Well but it's different than what we do.

So what we do is that configuration like 4 4 held file, you would use environments.

For example, in the environments if you like.

You don't want to rewrite that file if you want that dynamic and that can just refer to a go template macro for interpolation function for get in right to get there.

That's exactly what I'm doing actually to be clear.

OK I was doing that itself.

And what I found was just weird stuff like music.

Initially I'd like I have environments like with ephemeral and broad and QA which is where my static aesthetic values live right.

So that's where like you know the main thing is that the image tags is what essentially, I wanted.

All right.

Now, if I tried to float up the end up everything there then everything would have the same very value that I set.

But that just sounds like the schema is wrong if you're held a file.

So like that.

And I'd like, OK.

So not everyone is familiar with how file Alma file is a tool that helps you describe all the helm chart releases that you want to have in your cluster.

And the problem with helm is right that the values file is static.

It doesn't support any templating any interpolation functions.

Anything like that.

So the tool help file came out.

Let you describe now what helm releases you want to have and lets you as a developer kind of define your own schema in the form of what they call environments.

The problem with helm and helm charts is that every Helm chart every developer has their own way of describing how to install this application.

So it looks like a mess when you try and stitch these things together versus the helm file.

Now you can define that schema.

All right.

So in the schema your case, you have somewhere where it says image tag.

But it sounds like you only to find that in one place.

It sounds like when you want to achieve this you want to define the image tag for each service.

In this modern reboot what does difference.

What I'm trying to do for the ephemeral right is not only define I have a default. So falls back to like the next tag.

Yeah which is what we tag everything with.

Once it goes away.

What then I want to override just a single image tag for a single service.

Now what I'm actually doing now because I couldn't figure out how to use it twice.

So I'm just doing the whole file sync on everything.

And then is running the helm itself.

I guess the one released and upgraded that I say.

So Yeah Yeah Yeah like that can work.

Yeah, I think maybe if you're able to perhaps after the call or even during the if we have time.

If you can share some portion of that home file that you have been better at answering.

Also the other day the other concern is that using the next tag while it makes a lot of sense to humans.

It's also really hard to really have assurances of what's running and what we would typically do is we.

So we always tag the images with multiple tags and code fresh makes that very easy.

So Yeah, we may be tagged with next.

But we always tagged with the commit shop and the commit shop.

I'm doing the commission with the code fresh branch normalized thing in front of it.

So yeah.

Which is what my name spaces while I also create a new namespace each thing.

So exact that the image tag is that.

And that's for the federal environment.

And then once it gets promoted to the next environment.

It has that plus the next day.

Yeah, I'm only doing next because it was like, how to like how other than pulling down the latest release from code fresh from the registry.

Once I do that, then it will be that I can use that instead.

So let's see here.

I don't know how I can quickly.

And I go upstairs will pick and go to the computer.

I'm on mobile right now.

OK Yeah.

I don't think of an example.

I can pull up.

Well, the reason we fast enough for this to not waste a lot of time on the call.

I don't mean to bother everybody's questions is probably go while I'm getting upstairs and changing my zoom to the computer.

Sure thing.

Also here.

Are you using helm file as part of your process or not.

Yeah, but not us.

Michael's planning it to do.

So we have basically each application has a helm shots.

And then we have basic infrastructure, which is deployed by hand file.

So basically, our set manage our ingress.

All that stuff was two part time file.

But the other stuff is not thinking about what he's trying to achieve.

I think he could use jq to see which tech is currently report deployed.

And then check basically, if there's a difference.

And then only deploy the same file apply the changes that are actually out there because I think and sync and apply.

Difference is only that there's.

So my understanding if we look at this example of that.

I add here is that you'll have many releases of related to this one monetary one.

One could be bold for example.

And then there's an image and then there's the tag here.

And then he has multiple releases.

So here's bolt and then down here will be another one like refiner.

And let's say he's reusing this syntax everywhere where he has the image required environment variable image tag.

But he has that repeated everywhere.

Then how do you target just want changing the image tag for 0.

Maybe I understand.

So that your problem is that it's not deploying maybe if because you're using the next tag and if you just target one service.

How do you just target that one service to update.

I think the disconnect here is that if you were using to commit shy instead.

And then deploying like refine it with that commit shot.

Then that one will update.

But not the other ones because what you can do is you can combine required and with like a coalesce so you can say like Grafana.

Image tag.

Otherwise default to the default image tag.

So that lets you override individual tags for each individual environment or individual releases.

But I am guessing he is in the staircase right now.

I will I will pause that temporarily.

I'm here now.

Sure can share that.

OK go to office hours.

Was that my thing or should I share my screen.

If we want to let this go.

Yeah, we can.

I'll start right.

I'd rather not you know share my code on a thing because.

OK if you don't mind, I'm sure, you can also just as a reminder, these are syndicated.

Yeah, that's OK.

I don't mind that.

I just you know.

So you see my screen now.

So Here's what I have now.

We have a model repo obviously.

Each of the services up here, and then I just a ploy folder in which case, I have the charts themselves.

Yeah And then I have the releases.

And then I have appetite.

Yeah Yeah.

This is kind of what I'm doing with this right now.

Yep this looks good.

Now show me where the image tag is to find the image tag right now itself would you define for a environment.

Image right now is here.

So define OK next.

And then I override this.

But there's also set in my view, in the chart itself, which is.

Yeah So it's just the that I would do would you want to do this replace line 16 with a go template interpolation.

And there, you're going to look for the value of articles image tag and if articles image tag is empty, then you default it to next.

I would get that by clear.

I'm just doing this right now.

So it would be.

Yeah articles knows the way articles.

No, no, no, no.

So put like required and not require them.

I'm drawing a blank on the exact function name right now.

I think it's getting this.

Or is this just.

OK And not.

And space like this right.

Exactly no.

Yeah, I would stand you know standard environment syntax.

So be or.

Exactly So like that.

Yeah But what I would actually do because you want to surgically target one of your releases not all of you.

So now what I would do is I would change that to be a prefix that with articles image I got it makes total sense.

And then whenever I'm going over something if all value is to be next.

OK So behind idea if you would basically do a pipe and then default next.

Yes, exactly right.

I gotta say.

So the default is here.

That pipe pipe default next is a pipe.

I thought it was the other one I think.

I planning for a pipe.

So sheer pipeline and a string with next guy makes it work.

Yes each of my services in that way, I can.

Exactly So now you can surgically target each one of these with the tag you want to deploy it, which would be in your case, probably the commit shop is right on.

Well, thank you.

I appreciate that.

That's an enormous help because I know that you would do that.

If I do this.

Now I can just run the helm file just once right because yeah you just run it once, then you call start to export or whatever, just set that environment right also.

But to test this, I would just like to just export a ticket image tag and to have high Def and you get a Def and a church only chose this.

Let it change.

Yes it's changed.

So I said export Oracle's image tag.

And then do that locally.

Yeah Yeah.

OK that actually would solve many of issues.

I'm Morgan Brennan to double double the point.

So thanks, guys.

I appreciate that.

So oh come on now stop my share.

But I appreciate it.

You know I was looking forward to this like all week sounds like you finally have it is you that's like hamstringing things that Secretary has a question regarding to get UPS if there's no one.

Yeah, one quick question.

Do I get rid of the initial values that are in charge themselves only use his values.

No you don't.

You can leave those.

It doesn't matter because values that gamble will override whatever you define in those charts.

So if you make your chart something deployable by default, I think that's generally the best practice.

And now what I was doing.

Also thank you.

Next question.

Yeah, I feel that if anyone has any like experiences with pipeline that's code managing with many multiple development teams.

I'm open to hear it.

And I'm at a point now where I'm at, 20, 30, 40 repos and know there's old pipelines code that I have to redo and I'm struggling just working with some offshore teams and they won't let me touch stuff.

They won't accept my peers.

And it's driving me out of my mind.

Well, I'm kind of curious like if anyone's strategically worked around that other than myself.

I'm working on like using outside libraries and some of my pipeline is code and you know you know I keep on thinking I just want to do my own dedicated branch to do that.

You know.

That just seems wrong to me, seems like an empty pattern.

So I'm curious if there's any other.

Experiences out there.

So I can go on talk about it if you want.

So I had this.

I had a similar issue.

We had outsourced one of our projects and the team was really afraid of.

If I do the pipeline stuff and they had a very weird schema.

I think I a similar issue right now, whereas I deployed to production into a branch and has three different benches for environments correct.

Yeah, we are multiple branches we're doing per environment branch at the moment.

So one of the issues that annoyed me the most was that I never knew which features were deployed in which branch.

And so I Justin also had what I was saying goes to mass and we use tick UPS to promote this was the only way for me to manage it.

But also you could have the deployment process in a separate repo and have just updating the array will come across or put up the available talk attacks pushed into a different repo out of the cope update.

And then manage the deployment process in this secondary repository.

OK That's why I'm starting as a viewer toward.

So at least I know I'm not feeling like I'm totally off my rocker and go on that route maybe.

Eric Yeah.

So if I also know that low end under wrath has some experience with this.

I think I might have misinterpreted the original question.

If you want to just restate the problem again.

Well, so I've got a lot of pipeline as code in the repos as it probably should be.

It's supposed to be, and I'm to the point now where I've got several many multiple repos that I'm helping manage as a singular DevOps resource amongst multiple teams and certain teams and projects are more sensitive and they will not allow me to make pipeline changes easily.

It's just seeming like for although I am in a release you know I'm going to try to maintain a cadence with the development teams in some cases.

I need to make changes to their pipelines to energize and improve things.

And then I'll let you know I guess like a better way of describing it.

And it seems like there might be a better way to do so.

OK So they're concerned about changes in flight to kind of these pipelines that could reduce the stability of deployments.

So don't blame them honestly, you know I don't know.

I mean, these things break relative you can break pretty frequently.

OK So one other one other kind of procedural thing or organizational thing is adding code owners right so that when changes to the pipelines happen members of their team that are set as code owners fought for those pipelines have to improve on those changes.

So that's one way to enforce it without disabling kind of like ability to do pipelines as code.

So that's one thing.

And then.

So pipelines.

Yeah, that those themselves have their own lifecycle almost right that you need to be able to validate that that's not breaking things and you have are you are you a what are the common types of changes to the pipelines.

And are those pipeline kit.

Can those changes be validated, but by using preview environments.

And that's the changes that I'm looking at.

So I mean, we put these things together about a year ago.

And there's a voluminous amount of pipeline delivery code and things like that that I personally wrote that I'm not even proud of.

But it was needed at the time right.

Yeah So you they're saying, you know and that is we all know the more could the have and your pipelines the less simple it is and less important is the more it's Yeah breakable.

Right Yeah.

So I mean, you know I'm so I'm trying to simplify things down and make it declarative as possible and going back to some of these old still working pipelines and trying to like remove unneeded elements and things wrong along those lines.

So I am unfortunately stuck in Azure DevOps in this particular case.

But I'll use it to its fullest to do things like pull out pipelines into its own repo and reference that repo externally.

And I greatly simplified a lot of the various repos pipeline code.

But in the same regards I'd point to an outside repo for the code.

So that's better or worse, that's what I'm doing.

But I want to do this more for some of the other for some of the more delicate pipelines and I just I'm just making them tripping up and working with some of the developers on these things and maybe it's just me.

And you know.

But I just wonder if anyone else has dealt with that in any other way.

I keep on thinking, maybe I should just have a dedicated branch for pipeline code you know in each repo that they can't mess with and that sort of strips the control from the team.

And that's not that don't feel right.

You know some way to break other things.

So I don't know.

That's a sort of deep coda.

Are you able or using get up or using another piece.

Yes it's the area of DevOps repositories.

Oh, OK.

I'm not familiar with those.

It is a lot of hassle oh in the audio environments.

OK Yeah.

So the code owners might not apply to that.

Right on.

Maybe you in their branch policies and whatnot that I've been slowly been able to get enforced.

But yeah.

Code owners is they've got something similar.

Yeah, people notify you when there's changes.

But then getting them to approve.

It's a different story.

So what maybe I can have a private conversation with you because it's maybe been over the scope for this conversation or for this talk right now.

But I had an accountant exactly the same issues that you had this application team removed the info as he pipeline stuff that I introduced.

And because I have like 50 to repositories that I'm working on.

Yeah, I wasn't aware of it.

And then they said, well, we are not able to put any more because I restricted the UCLA access.

And yeah.

I wasn't even aware that's a push differently now maybe.

So if you want to 52 plus repositories at the same time right.

Gets hot, especially if you have 120 injured.

Yes stuff.

All of really smart people you know it's not like you know.

Yeah, they'll find a way around things.

Yeah, I'd like to chat you on this.

I appreciate that.

Thank you.

All right.

It's really nice to hear from I I was going to wait to eat or was asking if we were still going to do the CloudFlare demo and I do want to do that.

I was going to give him a second to join here.

I'll get CloudFlare teed up in the meantime.

Any questions.

No, I love puzzles.

Well, I guess if we're just I'm trying to play GitLab on queries and it's going to be publicly facing.

So I'm trying to do it like I wanted to be an exemplary deployment and I've been using their help chart or I should say the instructions for their Helm chart.

And I'm thinking, wait a minute.

There's got to be a better way to do this.

They just seem to be lots and lots of little moving parts.

Like oh you wanted to s three.

Then you got to go over to this demo file and tweak that.

Well, you want to use those three with IAM roles.

Oh, hang on a second.

Now I run it.

So suddenly I find myself trying to configure k I am.

It's just it's not that everyone, or is it just me.

It's not just you.

OK I will.

You'll you'll see all these beautiful demos of my I did.

It's that easy.

But the reality is once you get to like operationalizing something for production, you've got to worry about emeralds.

You've got to get those roles into your services.

You gotta choose if it's like, OK, in this case, you need an S but it'll keep you've got a provision that lucky you got the back.

So that's why we have so much freaking Terraform stuff to support coober neti stuff because there's so much that Cooper needs can't do.

That is like cloud platform specific.

Yeah, this is where you know it's just that we're a little bit early in terms of where things are in humanities but like the operator that terror that cube form that was shared earlier by a peer that looks interesting or like a service operator.

The idea here is that if you can get more of this stuff all in to Kubernetes you're not bridging tool sets.

Yeah then it gets easier because you can have a helm chart that says, hey provision this Terraform code and take the outputs of that and stick it into this community secret.

And then, you know, blah, blah, blah, blah, blah.

Do all this other stuff in Cuban is.

I don't know that many people.

I don't know anyone doing this, though really in production right now typically multiple pipelines that have steps that this part doesn't Terraform this part does the OK robinet is and trying to make this turnkey for net new applications doesn't.

I'll just man up and Yeah, I don't know anybody disagree with what I said there.

I would love to be a rock.

OK So here is CloudFlare CloudFlare workers have gotten a lot of attention CloudFlare workers are a little bit like running lambdas on the edge like AWS lambdas.

I am not like a super expert at doing this stuff.

But I think that's a testament to workers.

But because you don't have to be very hardcore to get a benefit out of it.

So let me go about how we've benefited from using workers.

And I think these are some common problems that others might have too.

So like you know we launched our podcast at a podcast club posse and that's through buzz sprout and buzz sprout offers branded kind of micro sites for your podcast.

Lots of other services do this too.

For example, get review what we use for a newsletter that also offers branded sites.

Thing is how do you have all these different micro sites that they never give you the ability to customize the menu on them.

They never give you the ability to change the CSS on these things yet is your brand.

You want to have control over those things.

So how do you do that.

Well CloudFlare workers makes this pretty awesome because you go over here like to your DNS settings.

And when you set up like the DNS for look load here for a second.

All right.

So let's take a look at the DNS for our podcast.

So here is the podcast.

No, that's the record.

Yeah here's the podcast.

So you see it's a C name to the buzz sprout a year old.

But it's proxy.

So these all requests are going through CloudFlare.

So if I go over here to work here.

I can now set up a rule that says podcast star first goes through this work right here.

And let me just make sure unfortunately, there's no really good secrets management thing here.

So let me just make sure that I'm not going to leak any secrets by opening up this example.

Yeah muted or I hit mute somehow.

Thanks for pointing it out.

All right.

So what we have here is an example of where we're going.

We're rewriting the content on the fly that we get back from the upstream requests from CloudFlare.

So what I'm doing here.

The objective is that buzz sprout.

They offer this podcast stuff.

But they hijack the canonical link.

So that they say buzz Broadcom is canonical for my podcast yet.

I have a branded site that's not cool.

So what did I do.

I just configure the CloudFlare worker to rewrite that content on the fly.

So here I rewrite all a tags linked tags meta tags on the fly when I see these patterns I just replace it with my podcast.

So that it gave me full control over the content there.

Which is cool.

I can also do things like rewrite the rewrite the site map site map XML to 0.2 links in our own site as opposed to their site.

So that was some of the examples there that we did.

Another example.

That's really cool is.

Let's see here, if you need to dynamically install headers like content security policies.

You can do that on the fly.

For example, we're using the Slack in plugin.

So the slack in no JSF for Slack invites.

We could go modify that code.

But then that's just more overhead that we got to do when we just want to do simple things like maybe at the end or so years.

The ability that we can do that on the fly for those requests.

Very simple little script streaming optimizations.

No, I want to go to slack archive redirect.

So many of you run single page application space you deploy those to service like s three, you're limited in what you can do.

So if you wanted to do a redirect it's maybe a little bit harder to do so very kind of pattern is you have a stub file that has a metal refresh equip in there and that made a refresh does the redirect but that kind of sucks because you first see a blank page and then redirect and it's not good.

So here's an example where we fetch the upstream.

And then look for that tag.

And then we do a rewrite automatically at 301 and serve that back to that.

We see that rejects matches.

So that's how we split up our outside.

We actually see the console logs.

Can you see someone somewhere.

Or so.

Yeah So what's cool is the way they've done it.

I mean, if I go here to I got a change domain.

In this case sweet UPS and then go to.

So here's this route I've configured everything to go to slack.

Archive redirects I can launch the editor.

I'm going to open up a slack archive redirects I'm going to go to that domain here.

Mobile are still all the console logs and everything.

My debug output I see here.

OK So this makes it really good that you can just iterate very quickly, you can update the preview as you're updating your code here without deploying it and seeing it.

So it's a safe kind of developed mode.

The downside is they don't provide any access to logs period.

Cloudflare it's not a server.

So what you got to do is you gotta integrate this century or some other third party service where you can basically post your exceptions in your log events and stuff to some third party service be related to this.

There's another similar feature that's existed in CloudFlare for a long time.

But it was implemented using workers.

I believe a behind the scenes.

But not now.

I'm just truly in love with it for these reasons.

So if we go to apps.

They have a marketplace on CloudFlare with all these hacks.

It's like a marketplace of hacks like the different blades of your Swiss army knife and there's a few in here that are really special.

So if I go over to install apps one is you can inject a site.

Now onto any page, you can add CSX onto any page, you can install a Tag Manager on any image, you can add each email to anything age.

So you can see how you can really do anything you want at this point with any page any site that's part of your network.

So like if I go now to cast that cloud posse I have a navigation menu here that's not provided by them.

That's provided by CloudFlare that drops me over to the newsletter.

Would you look at that.

I got a navigation menu that's not provided by them as provided by CloudFlare.

So all these things.

Let me stitch together all these different properties provided by different services that micro sites and get review.

They had a really bad CSX problem that made the stuff look like doesn't look beautiful.

Now Don't get me wrong.

But it looks atrocious before.

And I was able to inject mounts yes to clean this stuff up.

That's called the vertical wow.

This is amazing.

Yeah So CloudFlare is pretty rad in that sense.

But there are other examples.

I'm not going to say names of services that do this, but there are certain sites that can go to a pay up if you want to have a branded hostname.

Not anymore.

If you use CloudFlare you can change the origin of your requests and you can serve it under any domain you want on your site.

So you can do your own branded versions or micro sites of services like that.

Oh, wow.

This is under the free offering too.

Yeah you basically.

And distinctly CloudFlare.

I mean, God bless them you know, they are really fair in their pricing it's not that they have also overage pricing.

So you can start for free.

And then they have a reasonable rate that you can pay per million requests or something like that.

I hate these services that require you suddenly to you go from spent $50 a month.

Then they want a one year commitment at $1,500 a month.

How does the basic fee and compare to front say that when we're done.

It's just a basic TDM capabilities.

How does that compare it to something like we crowdfund.

Well, so this cloud front is pretty easy to beat that.

That's probably the worst CDM functionality wise that I've used.

Now it's gotten a lot better with the edge lambdas on cloud front, but the fact that any change you make to cloud front takes like 30 minutes to propagate.

That's a massive business liability right.

Like oh, we made some mistake an honest mistake.

OK this is down for 30 minutes.

I don't like that was pretty rad with CloudFlare is changes take seconds or less.

Most of the time.

And they manage it to Terraform or manually.

I confess we personally are not for these things.

But CloudFlare is investing a lot into their Terraform providers.

And I believe a lot of this stuff can be done with their form.

They also have a command line tool I for get what it's called, but they have a command line tool to manage this stuff and be more command line driven perhaps somebody can correct me maybe.

But I don't believe CloudFlare has scoped API keys.

And that's kind of like scary.

So Stephen Dunn fought for companies that want to use Terraform with CloudFlare and control kind of the blast radius of changes is you can actually use proxies with Terraform right.

So then there's a company called b.s. very, very good security and very good security is pretty rad because they'll be your man in the middle proxy and that you tokenized is basically that request.

You can use dummy credentials.

And you can scope requests to certain parts basically certain API endpoints.

And it will inject the tokens only for those requests only from IP is that you want only using credentials that you set and those credentials.

If x still traded or leaked are useless to anyone else.

So Yeah Terraform with very good security and CloudFlare is kind of like the pattern there.

We're almost at the end of the hour here.

So I'm not going to do a demo of this, but it's interesting.

We can do something arrange it a better demo of this.

The idea here is like you can create a route for like API that GitHub you can whitelist a certain IP range.

And then when these conditions are met.

For example, the path is such and such, then you can do replacements on fields in that request.

And since most service most things support the HP proxy convention, then you can just have this happen transparent.

All right Erik do you know anything about the kind of phone data protection laws that go into effect beginning of the year.

The GDPR stuff.

Well, I don't know how you call it in California.

But I think it is the equivalent of the same.

Yeah, we haven't been doing that much.

We haven't yet.

We haven't you know buy our customers been asked to help them with any of that stuff too much.

The cool thing, though, is with CloudFlare you can implement that across the board automatically.

You can install those retarded little pop UPS.

You have to have now on sites that say we're exporting cookies as if you didn't know.

And those cookies store information.

Well, with this if they click No CloudFlare can automatically always drop it.

So you can't track them.

Yeah There's some issues.

A bank in Germany implemented Taillefer last weekend and didn't update the data processing agreements on the back page and didn't inform the customers because they're around the details attack and basically counter.

They did so as to ask termination thing.

And now all the customers are pretty pissed off because they gave closure over all their banking information.

But that's the thing that think that happen happened Germany, and it's a pretty big GDP violation when it did, they screwed up the cash rules.

So they were cashing private data or not.

But I think just we'll see initial things that happened.

But they didn't inform the customers that there is now a middleman.

You have to do.

You have to have data processing agreements with the companies and they have to be accessible for your end users.

Yes And they did not do that.

And yet.

Now that 30 is that is really interesting.

So I don't know.

Yeah somebody else is using CloudFlare and you can share more information about that next time.

Maybe it really is.

All right, it looks like we've reached the end of the hour.

And that about wraps things up.

Thanks again for sharing.

I've learned a lot this time and that some others got the problem solved.

So that's pretty cool caught a recording of this call will be posted in the office hours channel.

We'll also be syndicating it to our podcast.

So see you guys next week same time, same place.

Thanks a lot.

All right.

Public “Office Hours” (2020-01-08)

Erik OstermanOffice Hours

Here's the recording from our DevOps “Office Hours” session on 2020-01-08.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here:

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Machine Generated Transcript

All right, everyone.

Let's get the show started.

Welcome to.

Office hours.

It's January 8th 2020.

Can you believe that.

I just can't get over that it's 2020.

My name is Eric Osterman and I'll be leading a conversation.

I'm the CEO and founder of cloud policy where DevOps accelerator we help startups own their infrastructure in record time by building it for you and then showing you the ropes.

For those of you new to the call the format of this call is very informal.

My goal is to get your questions answered.

Feel free to amuse yourself at any time if you want to jump in and participate.

We host these calls every week will automatically post a video of this recording to the office hours channel as well as follow up with an email.

So you can share it with your team.

If you want to share something in private just ask when we can temporarily suspend the recording.

With that said, let's kick this off.

Here are some talking points.

I came up with.

We don't have to get to all of them.

It's just things I want to bring up and first thing is that we now have a syndicated podcast of this Office Hours, which is really cool.

It's so easy to do these days with things like zap gear, which relates to one of the other talking points.

The next thing is we also have the suite ops job board.

There are a lot of companies hiring that are in our suite ops community.

Our goal is to bring everyone together and that this falls in line with that.

So if your company is hiring for DevOps or something adjacent to that.

Let me know and we can post that to the suite ops job site.

Totally free then searchable slack archives.

So if you are in our slack team, you'll know that we are a free team.

So we're limited to the 10,000 messages but we do an export of that data and we posted to the archive, which is archive that sweet UPS and we've just invested in adding Angola to our search index to that.

So it's a lot easier to discover the content.

And we'll continue to invest in that and some other things.

The other thing is that public service announcement go into this in a little bit more detail later, but basically, AWS has announced that the s.a. the rootsy a search for RDA straw raw and documents will expire on the 5th of March.

I expect like a little mini Y2K kind of bug therefore, companies doing cert validation an arduous in March and then Andrew as Andrew Roth on the call.

He is not well Andrew had asked.

So we might push this.

He doesn't show up.

He was curious about how we are doing some of our slack automation using zakir as a pure enemy, which it is.

So I can show that.

And if we have time.

How we use CloudFlare is kind of like the Swiss army knife to fuse sites together inject content and do all kinds of magic with workers.

All right.

So with that said, let's get the open the floor up to anyone.

Anyone have any interesting comments or problems that they're working with.

Hey, everyone.

This is Brandon.

Long time no see it's been a few weeks since I've joined.

Just a quick question for his for people.

Does anyone have an that they can recommend.

I'm looking at our annual PCI external scans coming up and looking for a recommended RSV.

Anybody has one or some people.

And the PCI community, I can probably ask some friends if they have recommendation.

Yeah, just kind of looking for somebody you know trusted and decently well-known.

Yeah which area are you here in Los Angeles.

Well, I'd be happy to make an intro to any ocean.

They've helped out some of our customers and have been very pleased with their services.

It, man.

Brian Neisseria.

I'm seeing you join for a while.

How's it going.

Yeah, I always come to the country in December, I spent time with family.

Did you.

Where were you able to get to the bottom of your issues with Kubernetes and the random pod restarts.

I think the actual cops work or know to be starting oh, actually yes items cops work for nuts restarting correct.

Yeah still I can't quite pin it yet.

Nothing in the logs.

Nothing in the kernel logs.

It's weird like you'll just see like the Cuba logs in the kernel logs like they'll just go.

And then they'll have like one minute of one minute of nothing in the logs and then they'll just say reboot and then have like the new boot on the walker no nothing.

And easy to the system status checks everything good.

It's interesting just because like, we cannot find any logs that show why the reboot happens, we just can't see when it happens.

Did did you go about implementing those resource limits on the pods.

I did.

Yeah So now everything on those boxes have resource limits.

Limit on requests.

OK And Yeah.

Especially on memory.

And now are you tracking how much memory is being consumed on those boxes and can you see it reaching the upper limit.

So it does reach the upper limit.

Sometimes per pod.

Well, I haven't quite figured it out or I haven't quite tuned perfectly.

It is.

We have the historical historical data on not how high a part can go in memory, but now the question is, at what point do we say this pi is API We should kill it vs. this pi does need to use this in our memory for this time being, we should leave it up, because I don't want to add the limit be too low to the point where our clients are running into issues in the app because the party's getting killed right.

You don't want to artificially reduce the amount of memory to the pods to satellite.

On average, it will consume 1 gig of memory sometimes.

But historically, the data shows that sometimes these pods can go up to 5 gigs in memory.

Do I cut it off at 5.

Or is them going to 5 mean that something has gone wrong on that part.

That's a hard one to answer.

Does anyone have any insights on this.

Have you said your workload is it.

Well, welcome to a what kind of workload is it.

Yeah, this is just simple IV application.

You get on with like what kind of data those and all being is it it spike because users are uploading 5 gigs of data like some.

Yeah, we seem to have like the dependent.

So the arbitration dies does stories like Excel files and some of the files aren't started S3, but like a story like a modified version of the Excel file in the database.

So there are probably large amounts of data being sent up periodically, but not consistently.

So that will cause the spike in numbers.

I mean keep if you haven't already found a correlation in the metrics, then keep tuning the metrics until you find one.

Here's what I would say that it's a good strategy.

So going back to what you were saying though you don't know where to set that limit.

I don't know.

I don't know what the impact to the business is by following the advice.

I'm about to give.

That's what you're going to have to decide.

But what I would say is set the let's set it at what seems reasonable on average and not for the peaks.

Allow some pods to crash OR gate kill the reaper every now and then for the purpose of seeing if this address is the stability of the cluster itself.

If you're problems then go away that you aren't killing the servers.

Well, then at least you identify the problem.

I think it's better to not cause a reboot of the entire server than to lose a pod every now and then to know memory bursts.

Yeah, exactly.

Did you set up.

Do you have the Kubernetes autoscaler setup.

No, I don't.

But at the same time, these are these nodes are already massively under scaled in my opinion.

Oh, yeah CPU usage is at 40% And memory usage is at about 50, 0.

So it's like these nodes do have enough resources because my only concern is that when you set the memory limits that.

Now, you might be constraining pods from being scheduled.

I guess the limits are if the limits are too high, that will constrain them from being scheduled using.

Yeah if the limits are too high and you don't have free resources because it's actually the op the requested limit.

It's actually allocating so that that is reserved and can't be used by the parties.

That's the request.

Yes the requirements are not the limit.

Oh, yeah complaining that too.

Unless you have a limit ranger that has a max limit, then the limit shouldn't I have two requests that lower, much lower or and then the limits higher much like the request I have to attend to the average memory usage of a pod.

And then the limits.

I have it to the peak.

OK on the topic of pods I have a question that I've asked in suite ops.

I'm curious if anybody has run into this.

Just I have no 1800 pods on those of a specific type running on 60 nodes.

How do you get those pods to be distributed close to even the on each node.

I feel I've seen something.

Basically you're looking for something that does rebalancing and I'd have to look that up.

I don't remember off my head my issue being like, I have a pretty large difference in memory requests on some node versus others.

Some notes are up to the 90s, and then some notes have 40, 50.

So it's like I want to even out.

I don't think I need to scale up more nodes I just need those nodes with 90% 95% memory to give those type of pods to denote with 40.

I mean Cuba is humanity's tries to do that automatically, but for new parts right.

Not for existing pods is it.

I don't know.

Also I mean, if it falls, it won't kill a pod unless it's told to unlock you know if a pod is Mayor happily running.

And you know unless you have some other functionality that is going through and saying, I want you to go and go around and start rescheduling pods.

By default Cuban edit once you schedule a pod.

And it's there, and it's running on a node.

It's going to keep running.

Yeah if that pod dies and gets restarted there's no guarantee it's going to restart on the same node that it was before.

So one thing one thing that I'm really interested in that, I haven't tried yet myself, but I'm I'm incredibly interested in doing for two reasons.

Number one is what we're talking about with the efficient bin packing of pods across the nodes.

But the other reason being the enforcement of your immutable state is automatically restarting all pods in the cluster, or you know whatever pods you have in a white list every x number of hours.

I was talking to I was in the Ask Me Anything section that Nick's salon did for the DVD and they're doing it in their deficit gaps.

You know, high security clusters that they everybody gets restarted average every four hours.

That's very interesting.

I found not fire right now because I'm too lazy to write a prompter across the grease scopes helm shops on startup.

So I just kill it every couple of minutes, hours.

Well, we're using for that.

It's just time off.

OK Oh, yeah, you're at.

That's a smart trick right.

So what you see is your entry point script if you just add that it just add the timeout command.

Then you can have your pod automatically cycle itself.

This is optimistically right that, of course, they won't all do it at the same time, which is unlikely but good luck.

And also, it's a hard kill of the process.

I imagine.

So I give you all the existing connections.

So we'll just go, Yeah Yeah, it's not great since long timeout is actually six term.

Let's see.

Oh, OK.

So you can actually kind of let in the and past.

Yeah you have handles the sig turn gracefully.

Let's go.

Also I'd say I shared to office hours.

It looks like there's a project out there.

It's called scheduler and it will wake up when it thinks the clusters unbalanced and start killing pods off.

So that they get scheduled I would argue that that is not super necessary.

Because if I mean, yeah, sure if a node is getting full but everything's running fine.

So what Yeah.

More or less.

So right.

So if you have an overloaded node overloaded being a subjective term.

But let's say it's running at 80% capacity everything's hunky Dory, and then Brian's cluster.

This one pod gets that request that causes the memory to spike 5 gigs.

That's what he wants to avoid.

So if on average, it's more balanced.

I think the probability of that causing problems is less and we're assuming that that misbehaving pod doesn't have a limit on it.

Well well now they do.

Now it could still be all year.

And 90 like 99 or 96 percent.

Memory requires you don't have much of a memory left at least for me, I have one or two accidents.

Yeah, I mean so not like it's not like all my notes are the 99 96, 97 percent.

But some of them are.

But then there are on the other side, there are other nodes that are at 30% 40% request.

So I want to be able.

Like I want to get to a point where obviously all of them are like around 80% memory requires a 20% pool of memory for when you do spike up to five peaks in memory.

I do see these spikes but very short.

So that's two I 80% is too hard.

Yeah, I think 8% to my clusters are like 40% memory request, maybe that memory.

40% memory usage memory recall.

I don't know.

I'm not sure what's a request sir.

So for me, the I have nodes between 30% request memory request and 99 percent memory request.

I'm struggling to get them all to like 70 80% memory replace actual memory usage on each node is about 40% 50% actually.

No, I take that back 50 60% I do think this is good.

But I would say you know to make a general statement, I would say, don't sweat it too much.

You know unless things are going well.

Oh so.

Because I have these nodes that just reboot every week and a half randomly.

I have AI.

This is my best theory right now is that every single time.

There's no limits their memory requests are really high.

It's in the 90s every single time.

OK So I mean, because I can't find anything in the kernel logs that Cuba logs.

This is my best bet right now.

That is it's kind of why I've gone down this rabbit hole.

I'm not 100% sure if this is the reason why it's happening.

But it's my best bet right now.

Just because I don't see anything in the logs the other interesting thing just a theory.

But I'm not sure if it would help would be if you did have the autoscaler enabled.

What you will see is nodes more frequently getting replaced.

So that would automatically also impact the rebalancing of pods across the cluster.

Now that may or may not alleviate the problem.

For example, if you're using spot, like the service you would be replacing part nodes all day long every day, throughout the week.

And that would also keep the OS fresh on all those boxes and the pots continually getting rescheduled I have these ephemeral clusters.

So like some of these easy to notes or four or five days old, and then they'll reboot.

So it is.

Previously, we had clusters that were up for longer, maybe three, four months and we'll get that maintenance reboot, which is on the hardware.

But you'll see that in my cloud watch.

You'll say that you see two system health, whatever or whatever our system size check cause a reboot.

This is like a service level reboot.

Not easy to host reboot.

Any other questions related to Terraform DevOp security.

AWS I've been doing.

My good friend.

I'm going to start digging into the AWS SEC s web example a little bit on preparing a eight hour training session for there.

And perhaps the next.

Once I dig into it a little bit more.

I have questions, but I didn't see an ek as examples.

That's why I just swing straight to VCs one.

Oh, we definitely have examples of all of that stuff.

Let me show you that a full screen using this goes for pretty much all of our modules especially those that have been upgraded to Terraform zeroed out.

Well support.

Yeah, that's kind of why I'm not using my packages because I don't really want to bother with China.

And yeah.

What better way than to also start contributing to.

So like here.

Let's take an example, at the UK case cluster.

So one of things we've done is under test and source we've implemented some basic terror test examples that bring up the cluster ensure that the nodes connect.

And then tear it down when everything is done or if it fails.

So here you can see like the logic for waiting until the worker joins the cluster.

This has enabled us then to accept all requests much faster for the modules.

Now, if you go into the examples complete folder here is where we have.

This is not saying this is a reference architecture is a minimally viable example of how to get the UK s cluster up.

That should be sufficient for your class.

So as you know, our modules are very composed tables.

So then when you first bring up the BPC with provisioning subnets then we appreciate the workers and the UK cluster and then the this worker node pool is then connected to this cluster of a suite.

I'll take a look into this.

It's I'm not sure I want to overwhelm my students with all of that at the same time.

But I definitely want to reference if you are interested in using Cooper that is the kind of go down.

Yeah communities.

A lot of complexity there.

And I mean now as it relates to yes, I think they are easy yes implementation is more on the advanced side of how to do.

Yes and isn't it like the Hello World example of this.

Yes So just keep that in mind as well.

I know I'm kind of unsettling our module there.

But the point is that our modules are designed for advanced use cases and have a lot of degrees of variation configuration.

Yeah, I agree.

It can be tough.

I learned yesterday a module.

But I end up writing my own after that, especially like with the test definition that you have.

It's kind of funky.

I hope you do you guys do it, you do a module.

So it might be better to just let them know that this is a complaint.

I have about using ECM s in general, especially with CI/CD.

The fact that you need to use a heavy handed tool like Terraform to deploy a new task.

It's like the wrong tool for the job.

Yeah, I had quite an experience.

Yeah, I work with CEOs deploy, which is a Python script for the front, and it just like don't deactivate your old tasks.

So Terraform farm sells things that they are active in and won't try to modify them.

So yeah, the trick I did for OCD is not ideal.

But it works.

And this is why I fight Yeah.

So this is one of the complaints I have with these.

Yes And Terraform it is that deployment part in this is this would be true if you were also choosing to use Terraform to do your deployments to Cuba natives, but many of us do not use Terraform for those deployments and therefore, we don't have that same challenge.

And the thing is, it could be quite easily solved by Terraform.

The only issue they do is basically, if there is an inactive instead of taking the latest task definition, they just stick to the one they have.

And if it's inactive then consider it dead instead of like looking up inside all of that as they finish an on line and then just taking the latest one.

So I think that's a sharper strategy strategies that it could take that would be nice.

Yeah, I appreciate yells help on that.

And definitely will be checking back in once and start digging into a mine.

Yeah, feel free to ask away in the Terraform general, if you blocked.

I'm sure our Savior Andre will help out.

So then the vacation time I got into an argument with some friends that needed some help.

This community.

This cluster and they wanted to scale their services.

Some tons of close stuff and based on latency and I was like straight extending a latency something that I've never done.

And would never do.

And then we I. Why is this a bad thing.

And I just wanted to know how do you scale.

Like imagine provisioning and HP.

How How do you scale on which kind of metrics.

I guess superior memory is the most common one.

But I think latency if you have it don't seem at to the database.

Exactly I mean, that's why I was like, no, not good.

But they were pretty light persistent on it.

I was like, that's something for an alert but not for getting your pompous Yeah, I just would like to you input because I mean, I haven't.

Well, I guess they are going to open and we decided to play something out.

But I mean, it's kind of like, oh, did you guys like that.

I would just like to know.

First of all, the arguments you have during the holidays are very different than the arguments I have during holidays no politics none of them different kinds of family discussions.

Now you brought up the great point right.

Like in an academic setting like devoid of reality and all the other things that can go wrong scaling a latency seems like a great idea, except for in reality, it's usually because of all these upstream problems that you have no control over.

And like it's the data we a down your latency goes up.

So you scale out more workers, which slammed the database even more as it slows down even more.

And then you have a catastrophic failure.

Everything falls down.

So I think that when you scale or how you scale it.

This is a cop out.

I hate to say it.

It depends a lot on what you're building UK sees a TensorFlow right.

Let's see.

I thought that kicks of memory allocated to them.

I think it was for use.

So it must pretty hefty parts.

Something's wrong.

So let's see here if I remember this correctly.

I was a year and a half ago or so.

We actually we did a project with Caltech and using cops and EPA and TensorFlow and a have a reddish q with all the tasks in there leave or the images or something that they needed to process and they then rotate custom HIPAA as autoscaler there's a grid.

There are a lot of examples on how to do a custom setup for that.

And so they wrote a custom one that just looks at the size of the Redis q and then scales the pods based on that.

Yeah OK.

This one's more like a traditional application thing basically sending data.

Why a post.

And then it was doing language processing while also saves on a lot of lost in processing then a lot of traffic they had like 50,000 requests per second.

I don't know what it is.

Yeah, I don't know what aligning.

I didn't ask.

But it's like, yeah, the spots where image processing that's what the language processing and the eta plus I had 140 notes or something.

All 64 kicks a friend was kind of crazy.

But yeah.

So yeah, just like that.

Yeah, it is.

I was like, I was really interested in it.

But it's like, yeah, my counterpart was a scientist.

And he was like area using two to are aside each and doesn't have time.

I want to have.

I was like, OK, give me a contract.

I will sign and give you money.

And I will look at it.

Wrote quite well.

But it's like, yeah.

Is the argument I had.

But it was kind of like that.

Yeah Yeah.

I think at the scale that they're doing this example kind of isn't going to be helpful.

But for what it's worth here is that auto scaler and that stuff is all defined in this kiosk.

I'll share it in general office hours.

Jim OK.

Any other interesting questions, guys.

I have a question like him regarding communities like if you have servers that require direct mapping.

I have an ambassador in front of it.

So I was wondering like, oh easy or hard it is to get the raw socket there for a couple thuds and then doodle bunting.

You should read these suckers.

Let's see if I just to say to my own words you want to, you want a dedicated IP first service.

I don't really care about it.

Nothing I didn't understand lol.

Basically I have like a couple servers that are only raw socket connections.

Yeah I'm trying to root them using ambassador while location really matter.

But like something in front of it.

So they can act like a unique IP with a different public port and interest not due to the internal port gotcha.

I don't have any feedback on that one.

Somebody else want to chime in.

Maybe with the ambassador experience or alternative energy mix.

Or unbought or what.

Sorry Yeah.

Hey Eric Garner's going to question anyway.

Do you play with kids from Windows.

Does work in those games.

Mp one that I'm about to get I'm working on that.

You're a brave man coming here like that.

I'm kidding.

I know I saw you created the channel windows.

It is sweet ops.

Thank you for that.

I guess it's about time.

This is not the Yeah, it's not the same OS used to be.

But I can't help with that someone else you're beautiful.

Just about anybody stuck in it.

But what I'm asking.

Yeah, I'm dreading the day I will have to do that.

It's coming though.

It's in the future.

Yeah well I'm driving.

So what is it is what it is.

The last time I had to do windows knows I used as Azure Service Fabric and it sucked.

Well, let's see here.

So Andrews as you're on the call and you'd asked a week or so ago or two weeks ago over the holidays kind of like how do we do these chat notifications actually.

Perfect segue way because Carlos was the one who recently created this channel here.

So how do we set up this automatic notification when new channels come up.

So it's a pretty quick and easy demo.

Basically here I have a link here somewhere.

So do many of you guys use active today, or savior.

I never say I use it for our CNN integration with one of our products.

My mouse has.

So yeah.

This is just a deep link into our accounts.

I don't have to search around for it.

This is the app that we have that handles that.

So basically, with the zip here you can have all these different integrations to the products you use Gmail slack MySQL plus grass GitHub et cetera, et cetera.

And in this case, the application is when a new channel is discovered via web hook integration, then it will.

Here's an example.

When I set this up a year ago.

But then here's the metadata that came back and who created that channel.

So then we can just jump over to the next message, which with next step, which is to send a custom Channel message.

And what I love about it is you can really customize everything about that message and how it gets sent.

And it makes it very easy to just add fields that are available from any of that data that was passed from previous step.

And we've taken this to the extreme, you know in our account we have here in our account we handle almost 20,000 zaps every month.

And we have over 111 of these across our company.

And what we do to automate various aspects from things that happen in Gmail things that happen in QuickBooks things that happen.

And build out com all that we automate send it into our Slack channel as simple as view for what's going on with Social Security's app and a task could just be overloading.

Oh, OK.

Yeah, exactly.

Exactly I looking at the pricing and it's like, OK, the free play and you get five apps and 100 task execution.

So a zap is the is like the code is the program.

OK you get unlimited programs.

And then how many executions of that.

OK So it is another way of thinking about it, especially now that they have code steps.

So you can run JavaScript.

You can run Python code.

So it's almost like just a goofy for doing lambda and then they provide all the integrations to all these different services.

And it becomes the glue to tie everything together.

Yeah the downside since this is a ops oriented group.

There's no API.

Ironically there's no API to zap here.

It's really strange and because there's no API.

Exactly there's also no zaps to automate zappia in that way.

Which is unfortunate.

So you can't do zap erase code no Terraform module or anything.

Well, so disappointed.

What happens when you run out of your 20,000 packs or whatever.

It's a shelf.

It sucks.

They require you to upgrade to the next plan.

They don't have an overage model.

So if you.

So if by the 16th.

I max out this 1,000 zaps here.

I'm going to have to upgrade again.

Another $100 a month or whatever.

Yeah Yourself though.

So we create 1000 channels small and well I say reset.

So we create one, I want to go create an idea.

Don't do that.

Look look look.

How can we cost Eric money.

Let's do that.

OK Yeah.

That's funny.

OK How much.

It seems like that.

It seems like it seems like it would be not that hard to write to create, a lambda or something that you hooked together with a broken leg.

A true developer my friend.

Spoken like a fruit to developer.

Could you really just land.

You can with slack.

You can set up web books.

Yeah, no.

And the reality is there's a open source.

Zap your alternatives.

It's feel you know buy versus build.

Mentality it was just posted on Hacker News.

There's in good students or Huggins which h right here can get something.

Well, I see it there.

Hug Im I think it's huge.

Thank you.

The agents that monitor and act on your behalf.

Like I'm so blind here.

Agents that.

Yeah Georgie I and I just skip it.

Just go back to just this is the funny thing.

Yeah, just go back up there.

You got a different hash.

Wow Yeah.

Linus wow.

It wasn't taken back two months ago.

A woman.

Yeah So yeah, I think it's one of those things depending on what you want to accomplish that could be great.

Also a lot of the zap your integrations are actually on their GitHub.

I just realized the other day.

So they're open source as you can PR them if you need to be interesting if something like human actually worked with zap your plugins but I don't think it does.

Yeah, it's also not pretty.

Oh, it's been great.

I was like plugged in any moment that you added.

Yeah So let's hear it.

Here are the agents that it supports.

Where is the screenshot.

So I don't know how many agents.

It supports today.

I think that's the main thing.

The main consideration is like when you go to apps here literally, I think there are thousands of integrations that it has.

1,500 So it's quite a lot.

And anyone who's worked with third party guys knows they break all the time.

They upgrade they change.

So you might get it working on a weekend.

But it's going to break three weeks from now.

And then you're never going to get back to it, because now you're not on holiday break right.

OK, cool.

Thanks Or any other questions about this or something else.

Nobody using pot man yet.

No, I know they came up a few weeks or a month or so ago.

I know officers this is always a serious.

If they once kick the tires on this project.

I think it's in the incubator, a cube sphere.

If any of you guys think it's here.

So when I got this dashboard.

I want.

Yeah Yeah.

But it combines a lot of things.

And it's kind of interesting.

If you look under the hood on the architecture of what it uses.

It's pretty similar to the components that also first of all.

Here's the UI.

So you know pictures are always pretty.

So it's a nice UI of everything operating in the system.

You could say it's maybe somewhere.

It's something like an shift.

But maybe not as opinionated or as complicated.

I believe it deploys entirely on Kubernetes.

I like shift, which you have to have control over the most US, they have a demo environment you can log in and check out.

But here's the architecture.

I thought was interesting.

Let's see.

Why is not living it looks like ranger or ranger.

Yeah like, yeah, I'm looking at the I'm looking at the screenshots going to wait.

This isn't a rancher.

Yeah, looks like they're similar.

It looks like the GitHub Kamau proxy has bungled this image here is like maybe you can I. When you open up in the incognito.

So glad you're in production yet ranch though uses mostly their own in-house components, though, right.

And isn't built on.

Aside from communities, of course, which are built on, but it's a lot of in-house things versus this.

It looks like.

And I can't see what you would see here is that it uses like Jenkins under the hood, it uses Prometheus and the fun.

And I think it's had fluidity and it looked like very much the similar stack that we were using today.

I was going to deploy the not operator and I saw it.

Also, because it's on computers and then it's kind of knowing that everyone and everything is deploying its own dependencies.

A little bit.

I kind of like I it all the time.

And knows me.

I don't know where you're deploying.

What was it nuts.

It pops out alternative and operate off nets also deploys parameters and co-founder and all that stuff.

And I like my reach as operator.

What was it like.

There was one a bundle Prometheus with it.

Yeah Fox is doing the same.

It's like, yeah.

OK, I get it.

You need it, but it's like it's always included right now.

It's like so.

So from experience.

I think I have the answer.

Now to that.

But I am equally frustrated.

So we support a product called cute cost.

We have a channel 4 that's in this meetups that keep cost channel there and cost is a cost.

Like this visualization tool for Kubernetes, and your cloud at your cloud provider.

So like all sports to get and we'll show you how much that's costing in real time estimations of course.

But if two ships its own version of Prometheus and I shook my fist at them and said, why are you doing this to us you know we spent a lot like scaling Prometheus is non-trivial.

You can't just throw Prometheus and expect it to actually work in production.

You can do it for a great example and make your charts very presentable but it's not the way forward.

And then worse yet they didn't support a way to have your bring your own Prometheus so that they made it worse.

Well, we spent probably like a month on the integration effort trying to get there.

Keep costs working with our Prometheus and it just there's so many settings and there's so many assumptions and there's so many things that make everyone's Prometheus installation kind of different.

Especially, I guess if you're not maybe using Prometheus operate.

So I guess the reason why all these vendors are shipping it is because most people a lack for me is operational experience.

We know how to set it up the right way and see they got to have a successful product install in the first five minutes or people are just going to get bored and walk away.

So I guess that's why Nasa is doing that too.

And I can only assume that it's just like, that's fucked up.

I mean, it's not even using that scanning or something just like monitoring it doesn't have any use.

It's not utilized at all.

Only scaling something into just like it's there.

So I guess I guess I guess what we got to get more comfortable or not the right word.

Maybe or so.

I mean, the built in Prometheus the architecture is pretty nice right.

Prometheus can scrape other Prometheus is.

So I guess what.

We have.

So I guess really what we should be adopting is a pattern of more getting more comfortable with just scraping other Prometheus is an aggregating it rather than expecting all these third party services to use our Prometheus Yeah.

Just looking at the memory using obscene.

Can we use it open positive post, for example.

It's not much.

It's a 150 megabyte and being about.

But it's like still it's just an awesome idea Federated and died scraping basically from my Prometheus operator and it still is like one more component that you have to worry about.

And it's the same component.

And if someone tells you.

Hey, my computer since I'm working.

Which one is it.

Yeah Yeah, right.

So if you guys use titan or other like aggregator for Prometheus.

Like storage baggins.

Yeah Yeah.

Yeah, it's more like a federation also.

No, I only belch up.

I Volta as I function to actually save everything to pop search explode in search of search to have it acceptable that this came up recently also when talking with a customer and also I know Andrew Roth.

He does what we are doing and that I was actually going to chime in and give an update.

So now it's been probably two, three months operating.

Yes on the efforts for production cluster.

And it's working well.

But there were some growing pains along the way.

The two biggest problems we had was one was if you don't have the memory limit set correctly on previous overtime it will just crash and when it crashes, it means these wall files around and you can't Auto Recover.

After that, you have to exact into the container and delete as well.

But the other interesting thing we noticed from the Prometheus is.

So I forget what version it is.

But the latest version of Prometheus under Prometheus operator.

This is second hand.

One of my teams team members was relating this to me.

So forgive me if there's some loss of precision here.

Is that it will auto to discover what your memory limits should be for Prometheus when you deploy it, which is cool.

However, we noticed that when it was auto discovering those limits.

If we didn't have at least a minimum of 11 gigs available to Prometheus then it didn't matter what it set those limits to automatically it would always inch up against that hard limit and then get killed by the memory reaper on the Linux kernel.

So once we bump that up to 15 gigs it's now sitting under that limit and being respectful of everything and works.

We didn't dig deeper into why it didn't work when it was just under allocated.

All right.

Well, we have five more minutes here.

Any final thoughts or interesting news articles that you have on your agenda.

Well, we haven't talked about it yet.

There one thing.

But I don't want to I don't want to waste five minutes and rush through it.

I think you guys will dig this this using how to use CloudFlare as a Swiss army knife and kind of like we as the zappia for the same thing.

You can do a lot of nice dirty, nasty hacks with CloudFlare and workers.

And we're doing that now.

So I wanted to show how you can do content injection.

Page rewrites and proxy basically any site you want.

I did a little bit with called fertile not too long ago.

And I was thrilled with the quality of the Terraform provider for CloudFlare.

I was able to set everything up in Terraform yet.

Now I know they're investing a lot in that.

I think a couple of the people involved in that project are actually in suite ops and I know they've actually reached out to cloud posse on their Terraform provider for feedback oh cool.

Well, yeah, we are they we're on the radar one small project have seen this a couple months ago, I just posted the link in the channel.

It's a native policy management for communities.

It seems nice.

It's a company you like.

I don't remember which company, but they didn't give a presentation is seem interesting for, like you can set this policy, for example, every node needs to have limits set in your cluster or they have more complex examples.

And then it evaluates like which boards are in noncompliance with your policies and also blocks new deployments new policies.

So I thought it was cool.

Yeah, pretty simple way to express it.

Exactly Yes.

Well seals are native to communities.

Yeah star someone already tried out.

OK whatever complaints of policy agent because I wanted to do policy agent that basically trails off.

Basically, you need an approval that's only lead happening.

It's just like something you edit the just older proofs and runs through the eye stuff like that.

I wanted to try out.

Let's go down.

Yeah So some plans.

Now this is probably a topic for another office hours, but one of the things I really want to try and do with this Office Hours now in 2020 is more demos and more demos by our members.

So I know a lot of you are doing some cool things cool things that we don't have a chance to do ourselves or see.

I would like to see that.

So if you guys are working, for example, with OK or somebody kicks the tires on you know I would really like to get a demo of that.

Also maybe we can get some nice other speakers onto our person can ask them questions.

I really want to get you on here and wish you as prolific GitHub contributor to the community's ecosystem.

I thought about something cool that my team is starting to do a lot of work with some file and something that I don't think file has yet period that we miss from Terraform is that Terraform Doc's command.

Yes if there was like a hellfire Doc's command where you know where it would generate a table of all the environment variable inputs Yeah, that would be cool.

The cool thought about doing something like that.

Yeah, well, you know, at least the float the idea in the old file channel.

And let's see if there's some more interest in something like this auto documentation for help file.

Yeah, I'm having to write out the table right now because we were doing a big one for four GitLab and they're sticking point.

Yeah And we've got like 60 input variables for everything it can get unwieldy.

I agree with any.

And then the more anything, the more you pramod tries anything the harder it is to actually deploy.

We discovered that the environment's don't really work the way we expected them to and we weren't able to really use them the way we wanted them to.

If you're going to have a helm file and its purpose is to be a sub helm file meaning like I've got a helm file forget lab and the end user is going to create their own helm file and then it's going to have helm files pass you know get Cole and whatever GitLab get Cole and whatever key get go and would you know that if the sub helm file has environments in it your route helm file needs to have those environments declared and it can't have any other environments.

Yeah You have to do that.

You have to.

Exactly it's kind of an artifact of how file is implemented and the path of least resistance.

But yeah, those environments need to be declared if they exist in all hell and files where they get included.

Yeah, so we ended up scrapping that idea and just doing it because what we wanted to.

It was you know we talked earlier about how I don't I don't like how helm charts default to you know insecure by default.

I want secure by default.

So our home file is going to be the opinionated secure way that we want you to deploy helm file by default.

And then there is an environment variable called local mode or whatever that you can turn on that sets all the other variables period for you into the insecure mode.

So instead of having to settle kinds of stuff know by reading documentation and going have to do here.

Yeah run on my fucking laptop just do like local mode equals true.

And then running your laptop.

Yep So we were originally going to do like the default environment was production.

And then, of environment called local but that didn't end up working out.

I be happy.

If you want to share a little bit of that next call or in a future one I'd be happy to look over them and see if there's another way of doing it.

Or some feedback based on our experience.

Sure All right, everyone looks like we've reached the end of the hour.

That about wraps things up.

Thanks again for sharing.

I always learn so much from you guys on these calls.

A recording of this call will be posted in the office hours channel.

See you guys next week.

Same place same time.

Have a nice one.

But if you.

SweetOps Newsletter – Issue #3


Job Board

We've launched our jobs site to keep track of new opportunities posted by companies in our community. Have a job you want to post? Ping @erik on slack.

Check it out here: 👉

“Office Hours” Pod Cast

We've launched a podcast, which is a syndicated version of our weekly “Office Hours”. Tune in on the train or in the car!

Subscribe here: 👉

Slack Archive Search

You can now search

Slack Archive Search

We've added search functionality to our slack archives with Algolia. Now you can easily find solutions to common bugs or problems encountered by our community.

Kubernetes News

A lot of interesting links were shared related to the Kubernetes ecosystem. Here are some that stood out!

A Practical Guide to Setting Kubernetes Requests and Limits

Setting Kubernetes requests and limits effectively has a major impact on application performance, the stability of applications and the cluster, as well as cost. And yet working with many teams over the past year has shown us that determining the right values for these parameters is hard or worse yet teams don't even set them! Kubecost created this short guide to help teams more accurately set Kubernetes requests and limits for their applications.

Excellent Resource to Learn Kubernetes
Excellent Resource to Learn

If you want to learn more about Kubernetes, this is a phenomenal resource for learning the basics. For example, here's how to increase kubectl productivity when looking up resources, customizing the columns in the output format, switching between clusters and namespaces with ease, using auto-generated aliases, and extending kubectl with plugins!

CNCF Cloud Native Interactive Landscape
CNCF Cloud Native Interactive

There's an overwhelming number of cloud native technologies and open source tools. How do they all fit in? Here's a site that let's you filter and sort by GitHub stars, funding, commits, contributors, hq location, and tweets.

Terraform News

We've been hard at work upgrading all of our Terraform Modules to support Terraform 0.12 (HCL2). As part of this, we've added automated tests using a combination of bats and terratest. We publish all of our hundreds of Terraform modules in the official HashiCorp registry.


Terraform module to provision an EKS Fargate Profile to spin up a node group that run pods entirely on Fargate


Terraform module to provision a fully managed AWS EKS Node Group using terraform. Plus it works with all of our other EKS node pool modules so it's possible to run multiple different types of node pools in the same EKS cluster.


We've updated our Terraform modules for provisioning an EKS cluster to support Terraform 0.12. As part of this, we've added terratest to run automated infrastructure tests. Check it out here!

Want more news? Check out our Slack archives to learn what our community is all about.


SpotOn Senior DevOps Engineer
SpotOn Senior DevOps

SpotOn is a cutting-edge software company dedicated to redefining the merchant services industry. SpotOn, is hiring in either Krakow, Poland or Mexico City, MX for a seasoned DevOps engineer. BTW, they use TONS of Cloud Posse Terraform modules and Codefresh! =)

SettleMint Experienced SRE
SettleMint Experienced

SettleMint makes blockchain technology deployment, development and integration practical for the enterprise. Settlement is looking for someone in Belgium for a full on helm, kubernetes, helmfile gig (fulltime) in the Blockchain space.

What is a DevOps Accelerator?

Curious about what well built infrastructure looks like? Check out what we do at Cloud Posse.

Find out if we can help you by taking our quiz.

Free Weekly “Office Hours” with Cloud Posse

You are invited to our weekly Zoom meetings every Wednesday at 11:30 am PST (GMT-8). Join us to talk shop! This is an informal gathering where you get to ask questions and watch demos.

Register here: 👉

After registering, you will receive a confirmation email containing information about joining the meeting.