Public “Office Hours” (2020-02-12)

Erik OstermanOffice Hours

1 min read

Here's the recording from our DevOps “Office Hours” session on 2020-02-12.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here:

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Machine Generated Transcript

Let's get the show started.

Welcome to Office hours is February 12th 2020.

My name is Eric Ostrom and I'll be leading the conversation.

I'm the CEO and founder of cloud policy.

We are a DevOps accelerator.

We help startups own their infrastructure in record time by building it for you and then showing you the ropes.

For those of you new to the call the format is very informal.

My goal is to get your questions answered.

So feel free to unmute yourself at any time if you want to jump in and participate if you're tuning in from our podcast or YouTube channel, you can register for these live and interactive sessions just by going to cloud posterior slash office hours again, cloud posse slash office hours.

We host these calls every week will automatically post a recording of this session to the office hours channel as well as follow up with an email.

So you can share with your team.

If you want to share something in private just ask him could temporarily suspend the recording.

With that said, let's kick things off.

So here are some talking points that we can cover to get the conversation going.

Obviously first, I want to first cover any of your questions.

So some things that came across or came up in the past week since we had our last call.

Terraform cloud.

Now supports triggers across workspaces.

John just shared that this morning.

I'll talk about that a little bit.

The new ADA of US clay is available with no more Python dependencies.

However, I'm still not celebrating it entirely based on my initial review.

Also this is really wise is quote that was simply put in our community yesterday or some things like you can't commit to the overhead required to run something you're introducing a weakness into the system rather than a strength as they'll quickly end up in the critical path.

So that was the way that Chris Child's said something and I want to talk about that some more.

See what reactions we get.

But before we go into those things.

Let's see what questions you guys have.

I have one thing when you're going through tariff for Terraform cloud can you also go through just your general experiences with Oprah.

And I were looking at using it earlier this week or just having a little bit of some pain doing so.

Yeah, just some general experience that would be useful.

I can give you some kind of like armchair review of Terraform cloud.

We are not using it in production as part of any customer engagements we've done our own prototypes and pieces.

So I think the best thing would be when we get to that point if the other people on the call that are actually doing it day to day.

I know John bland has been doing a lot better from cloud.

I don't let me paint him on suite ops.

Let's see if he can join and share some of its experiences.

Do you guys know if you can continue using remote state with S3 with Graham cloud.

I couldn't figure out how to do that.

Well, you should be able to let me explain.

Mm-hmm It's a good question.

But IM not 100 percent of what you put into it.

So yeah.

So I Yeah, I cannot speak from experience trying to do it.

What were your problems when you tried.

I mean, I assume you had the best credentials and everything hardcoded.

And if you had that provider block or that back in Setup set up it was airing or it requires that validates that you have a from Workspace back in.

I personally came to find the place where you can even put Intel from cloud the crates.

I would be in environments settings.

So there's the build up using it as an environment variable.

I know by guy I exactly you have to do that for every single workspace yet as retarded as it is.

Exactly we don't like that either.

No awesome.

John's joining us right now.

So John has spent a lot of time with Terraform cloud.

So he can probably answer some of these questions or you and Mark.

Welcome howdy.

I is going to have you mark have you gone to play with Terry from cloud at all yet.

No, I haven't even browsed the docs.

OK Just curious.

But Brian, your.

You've been dabbling with turn from cloud a little bit or.

Yeah, just taking it out.

It was because I was working on data from provision provisioning of my EFS housing on a kill two birds with one stone.

Yeah And dabbled with it didn't love it.

So I probably am just going to do my on provisioning code fresh.

Yeah, it's a little bit more intuitive for me, especially because I used her from CLI workspaces.

Yeah, it it'll be a lot easier for me to implement something that's driving reusable if I were to just do like a cut.

I could fetch that already does the right like workspaces commands for me.

Yeah, I'm hoping that maybe in a couple of weeks or a few weeks, maybe we can do a revised code fresh Terraform demo on this.

We did one about a year ago or more.

But this time Germany on my team is working on it.

And we want to kind of recreate some of the constructs of Terraform cloud.

But inside a code fresh.

So that it integrates more with all the other pipelines and workflows we have there.

On the topic of terror from I would want it, I would want it.

So like I go back and forth on my decision to use Terraform workspaces.

I love the fact that it was so easy to use the same configuration code for so many different environments and I've been able to take advantage of that.

Why didn't love was having to kind of act together way to get all the back end to point to different S3 buckets and different AWS accounts.

I'm curious if anybody's ever worked with that.

Plus if they ever switched off of it to go the other route where we kind of have configuration per her database account that might be less dry.

Such using tigre.

OK, let's.

Yeah OK, let's table that temporarily.

I see John just joined us here.

Let's start with the first question there on first hand accounts and experiences using Terraform cloud.

I know John has spent a lot of time with it.

So I'd love to for you guys to hear from him.

He's also a great speaker.

Well, thank you.

I've seen the check in the mail.

I actually I've done a lot.

Whatever form cloud as the primary c.I. for all of our Terraform and generally, I like it mainly because it's a little malleable like you can use it for like the whole Ci aspect or you can run your c.I.

I mean, you're Terraform in fresh air anywhere.

And it's just your back.

And that's it.

Instead of having S3 buckets everywhere all your state is just stored.

They're really easy to do remote data things that I saw terraforming it is really easy to do.

They have a provider that gives you poor access.

And I did see on the agenda there the talking points the workspace, the run triggers actually did it video or that be wasn't that only to already.

Well, yeah, except now.

Yeah, I've wanted to just play with it non-zero models were recorded and it's decent.

I think they have some improvements to do to be able to visualize it.

But we actually do utilize I forget who it was speaking Brian.

I think we do utilize multiple AWS environments and from our Terraform scripts where we set it up.

We actually have each workspace control or we tell each workspace, which the environment is going to use.

Now this is using access key and secretly preferably we'd have something a little more cleaner that was a lot more secure than just having stale access fees sitting around.

So that's one gripe I didn't have with it.

But in general, we've had a lot of success running it in Terraform cloud water.

OK So you're leveraging like the Terraform cloud.

I don't want to say it's like a feature chair from cloud.

But the best practice a chair from cloud using workspaces or using lots of workspaces and terrified and how has that been.

Because while workspaces has existed for some time it wasn't previously recommended as a pattern for separating you know multiple stages like production versus dev.

How's that working out.

It actually worked out really well, because locally where you set it up, you can set up locally.

Sure Yeah.

So I mean, the reflection here once again.

There you go.

There is a difference between tech from cloud workspaces and from CLI workspace stuff.

Yeah, sure.

So the this is just my little sample account that I play around with the tutorials.

But if this was set up locally in the CLI and because this prefix is the same.

I can set my prefix locally and my local workspaces would be called integer and separated.

And so locally.

It maps directly to the local CLI actually.

So I can say Terraform workspace, select integer.

And now I'm on that.

And I can see a plan and it'll run not playing right writing on Terraform cloud.

I don't have to have variables or nothing locally.

It'll run everything in there unless it's set up as a local because you have multiple settings here.

Would you be able to do that right now.

Sorry to put you on the spot.

This is exactly what we're trying to do.

And if you're saying that it's actually much easier than I initially thought then I might reinvestigate this.

Let's try it.

But thanks for roll.

It's lights up for everyone else.

Maybe if you just joined.

John has been put on the spot here to do a unplanned for demo of Terraform cloud and workspaces.

Possibly even the new beta triggers function.

But on a different sort as it's set up there and working.

I can definitely walk through the triggers.

Peter would be especially useful for us as well, because there are scenarios where we run multiple turns from the place where you can sleep to reports, a serious long shows.

And we can also get close to Yum It want to have like five minutes and then we can talk about some other things and come right back to this.

Yeah let me get a few of my things to worry about just the connection and all that sort of stuff.

Yeah Cool.

Let's do that.

And we'll just keep the conversation going on, other things.

See what we can talk about there.

All right.

Any other questions.

All right.

So I guess I'm going to skip the Terraform cloud talking point about triggers across workspaces.

I think that's going to be really awesome to get a demo.

Basically to set that up as you decompose your terror lists into multiple projects.

How do you still kind of get that same experience where once you apply changes in one environment can trigger changes in another environment.

And that's what these triggers are now for moving on AWS has announced this week, that there's a new clay available.

I'm not sure how new it is per se, but they are providing a Binary Release of this clay.

I suppose it's still probably in Python they're just compiling it to write code might.

The downside from when I was looking at it is it's not just a single binary, you can go download somewhere there's still like AWS clay installer.

So they're following like the Java pattern right where you still got to download zips and sell stuff.

Personally, I just I've gotten so spoiled by go projects which distribute single self-contained binary.

And I just download that from get up to this page.

And I'm set to go.

So has anybody given this new client, a shot.

Now you mark calling you out.

All right.

Don't buy it yet.

Cool And then there was one other thing that came up this week as somebody was asking kind of like you know I think the question the background question was like alternatives to running bolt and if it's worth it to run bolt and Chris Chris files responded quite succinctly so thank you.

We've heard this said before, but I thought this is a really succinct way of putting it.

And that's like if you can't commit to the overhead required to run some new technology like Cuba and 80s balls or console you're introducing a weakness into the system rather than a strength as the quickly end up in the critical path of your operations.

And I think this really resonated with me, especially since we run this DevOps accelerator and our whole premise is that we want our customers to run in and take ownership of their infrastructure.

But if they don't understand what they have, and what they're running, then it is a massive liability at the same time, which is why we only work with customers ultimately that have some in-house expertise to take over this stuff.

And also Alex just Yeah getting some thumbs up here from Alex Eagleman and Bryan side both agreeing with this statement actually.

Yeah, that actual response to the that's what's the response to something I mentioned.

So the original person asking a question that came about came at then and also get.

From what I understand.

And I just thought that maybe I'd remind them you know like maybe want to just give centralized management a shot the volt really is going to want to sound like I'm pushing it that much.

But the reality is if you look at a hash record that created Terraform created volt they make most of their money from both.

They really do put a lot of product hours into featuring sorry.

Why didn't the feature set that product.

So really, it's a mature solution.

Yeah 100 percent when it comes to houses response.

It's very true.

What happens is actually, a lot of the time is if you don't commit a lot of people they like they take the route token, and then they distribute it to everybody and it becomes more of a security hold than a security feature really.

Yeah And really, it's reminiscent of terminators as well.

In my opinion, we really need like a large team of people putting energy into that to actually make full use of it.

So it's not a burn anymore.

It's actually something that can help you pick up velocity exactly like you want to take these things when it gives you an advantage a competitive advantage for your business or the problems you're trying to solve.

Not just because it's a cool toy or sounds interesting, but Yeah, those are really good summary.

Thank you for it.

For a peer.

Just secrets management.

I would.

Probably doesn't have all the bells and whistles of a vault obviously.

But I went with it to be a secret manager.

Or you can use parameters store does it much easier to maintain.

Yeah Are you making copious use of the lambdas as well with Secrets Manager to do automatic rotations.

Not yet by but definitely something that I wish I had time for.

Yeah, because it also requires application changes right to get all right, John, are you.

You need some more time.

No really.

All right.

Awesome Let's get the show started.

So this is going to be a tear from cloud demo and possibly a demo of triggers across workspaces, which is a better feature and terrifying cloud.

So this isn't going to actually give me a plan, because I don't have the actual code for these repos locally on this computer.

But this is the time to show how the workspaces actually work.

So essentially, you set up your back in as remote hostname.

You don't really have to have it.

That's the default. What organization and in this case, I'm saying prefix.

So if I actually change this in the say name to random.

And if I did in a net on this.

It no.

Yes, I need to.

Yes, there because I already initialize.

That's why so by setting the name essentially, it's supposed to.

What did I miss.

They weren't just a minute ago, I promise.

It's always this way.

Let me clean up this dirt one.

But by saying a name I kind of found this by accident.

I didn't mean for it to do it.

But they go it'll actually create the workspace for you.

So you technically don't have to do anything to create a workspace.

It'll do it for you.

In that case, it doesn't give all the configuration there.

But if you utilize prefix here instead of name just wipe it.

What it does is basically create to Terraform cloud and says, hey, everything with this as a prefix is going to be my workspaces.

So in this case, I can say, let me select the integer workspace.

That's awesome.

And so if I do a workspace listed, you'll see that same list there.

And then you can do select separator any one of those.

You can also.

And I'd have an alias for Terraform by the way.

So that's why I'm just saying to you.

You can also get your state.

Of course, if you have access to that.

So we can say show we can pull that locally.

And so it'll output the actual state here.

And then my favorite part is actually planning.

So I don't have any variables everything.

Mind you this workspace doesn't have a lot anyway.

But it's actually running this plan on Terraform cloud.

It's piping the same output.

It's common just like you would normally expect.

So it's piping everything to my local CLI here.

But for console.

But it's basically this.

So you can see the output matches.

But the beauty of this part is I can have all of my variables in here completely hidden.

Any secrets that I want and none of my developers ever see them.

They never know that they exist or anything locally, but they can play an all day and do whatever they want.

And so this is destroying because I don't have the code.

So it's like, well, it's gone.

This random resource integer, but that's the quick run through of utilizing those workspaces locally.

I mean, it's really just this.

And I have a tee up bar set with my token locally.

So that's I have that work to also go back to the tower from cloud UI that we had the settings.

The variables because just because it came out the second a little while ago.

Brian was asking about environment variables you see there that bottom.

Brian Yeah.

If you need obvious credentials you can stick with me.

OK You're not using a dubious provider right now.

No, I'm not going to put them off.

Yeah, no.

OK And nothing precludes you from using the obvious provider.

So long as you still provide the credentials.

Yeah Yeah.

So you know your random workspace.

I got created.

Do you have to go.

Do you manually go in and add the eight of his grades for those I actually Terraform the entire thing.

So that's all done through Terraform so basically terraforming Terraform cloud.

So essentially I'll generate a workspace and that'll help my general settings there, and then I'll just you the last two or three variables.

And you can do environment variables this way too.

So you can kind of tie this in too with like your token refreshes and things of that sort.

Especially those I've mentioned or.

So there's the ball provider and you can actually tap into a ball here actually gets you a token key from your AWS or however you want to do your authentication there get your token from AWS your access key, et cetera.

And then plug that into Terraform cloud.

So that way it's all automated.

And you're not just wasting variables in there.

Do you have to run your own vault or do they run that for you know you could if you're running your own.

So I would assume assume someone doesn't have all.

How do you plumb anybody's credentials and or ask.

Yes token generation in.

So we just came.

So just utilize came to mark them all as you like.

You don't want to put that stuff in code right.

And so use came as encrypt the values manually.

We built a little internal tool to do it.

But encrypt those values put them in code and then once the workspace actually runs, it'll actually create manage update all the other workspaces.

So in essence, you have one workspace that has all the references to all the other work spaces are supposed to create and it'll configure everything.

And so in that one, it'll decrypted came and then add it to the project or the specific workspace as a environment variable.

And so there is when you say came as you're using SFA we do use that system to store that the product of the commercial kitchen blob no.

Now we just encrypt the value in games.

But we actually we actually use it as a Sim for farm gate secrets.

But this repo here is where I actually have a video of it where I kind of walk through how to do the full thing with Terraform your own workspace and then using their remote data as well to pull from.

And so the pipeline feature that was playing with earlier essentially this repo.

I mean, this workspace is going to trigger this one it's going to trigger this.

And so the way it's set up and they definitely say do not use this in production yet.

But these run triggers here.

So you can specify all your workspaces that you want to actually trigger something here.

And so anytime they trigger oh it's a loop because I already have that one set anytime they trigger they will actually trigger the next one.

When we delete these real quick, and I'll just show a click Run.

And so if I cue this one where it finally kicks off.

There we go.

So that's going to go through the plan.

And this is just going to generate a random integer.

It has an output and all it does is output the results of random integer.

And so once it finishes the plan is actually going to show me which or any workspaces that it will trigger next.

In this case random separator so if I go ahead and Confirm and so this one is applying if I come over to random separator nothing's happening here.

I'm not I haven't hit anything haven't pressing buttons.

This appliance finished and there's random separator that was triggered automatically.

And so you can see like it's essentially going to go down the line there.

The good thing is it will tell you here that the run was triggered from this workspace.

And it was triggered from that run.

So you can kind of rabbit hole your way backwards into finding where and what actually triggered that one.

And so if I confirm and apply on this one that someone is actually going to trigger the last one, which is random pat.

Now, pull up the code real quick as well so that when finished and random pat is here.

And there's random pit running.

Quick, quick question.

Do you guys ever use it because I know what the VCR is integrations.

You can actually kick off a plan and apply from GitHub for example together is that.

Yes exclusively.

Yeah And so this is these repos are actually tied up to get up as well.

You do the confirmed circuit collaborative send them to the UI here do it through the UI, you could tie it in like there's a CSI you can tie it in and do it through any c really.

And so there's the end of the pipeline.

But as you can kind of tail like you will rabbit hole right like you're here.

And then it's like, well, was generated from here and you go back there and it's like, well, one was generated from another one.

And so then you end up having to go back there.

So a good visual tool would be really useful.

Jenkins blue ocean or something where or circle C I kind of chose you the path of something would be really useful, but it's kind of interesting.

I'm sure that's coming.

Yeah So this code just to show this real quick isn't using the remote state.

And so I set up variables manually for this demo, but utilizing these variables.

And it just uses the remote state data to get the value from the integer workspace.

And then the pet basically uses to remote datas to get the workspace state for both the separator and the integer workspace and then it just uses it down here in the random bit.

So it's decent.

I'm liking where they're going.

Yeah, I think this has some potential, especially to minimize the configuration drift and simplify the effort of ensuring that change is promoted through multiple workspaces.

And the good thing is like if you saw on separator I actually had to confirm and of course, that depends on your settings.

Of course.

Because you can tell it if you want to auto apply or manual apply in this case, I'm set to make makes sense.

But it goes auto.

They would have just cheered it all the way down the rewind as many.

And so the practicality of it is like if you separate your networking stack from your application and you update your networking stack and for whatever reason, it needs to run the application form as well.

You can kind of automate that now as opposed to where you ran that one.

Now the one person in the company that knows the order thing can go in and manually hit q on something else.

So yeah.

So I think there was some questions in the chat here.

Let's see.

Alex Sieckmann asks, how do you handle the chicken and the egg problem with bootstrapping saying AWS account and then Terraform enterprise to have creds and such.

Actually a good question.

So it would have to come from some somewhere right.

So like especially if you set up like AWS Organizations.

And you had like your root account that you were set with you can utilize that Reed account.

And you can actually Terraform it obvious orbs and then once that new account is set up, you can assume role and those sort of things in order to access that other client.

I mean that other account.

But there is still some manual aspects of that right.

Like you have to search your email address and then that email address is your root account.

And then you want to kind of lock that down.

So you can do use like some service control policies and things of that sort.

But there's still a little bit of a manual piece to bootstrap a full account.

That's the part that really sucks and we go through this in every engagement right.

Because if you don't reset the password and have MFA added anybody with access to email of that root account of that sub account or for that matter can do a password reset.

And take over the account.

Yeah, exactly.

So there was a question about automated destruction.

So terrifying cloud actually requires you to set confirm destroy.

So what.

So if you do automate confirm destroy set to 1, then yes, you can.

You can delete from trip cloud.

But you can't cure or destroy unless you actually have that environment variable z So you can set it like and have it as a part of your workspace and then you destroy will actually destroy it.

Nice cool.

But yes, that aspect of the chicken and egg is something that is definitely something that could be cleaned up on either side, just to help the bootstrapping especially for the clients that have like 78 IBS accounts.

Yeah which isn't as abnormal as it sounds it's one enterprise.

Yeah Any other questions related to Terraform cloud and put this in queue.

Not sure cost value.

Opinions now it's way better than before.

It was rough like multiple Tens of thousands of dollars for the Enterprise version.

And so now it's actually to where you can basically sign up and utilize it now for free.

You have to keep in mind that it is still a subset of different features.

But it is really good.

And it is.

Obviously, if you're a small team and you don't have $100,000 to spend that yet.

But I figure that it is a subset.

Yep And so the main things that you do miss cost estimation is actually pretty cool.

It will tell you if you're starting up like the T3 micro how much that's going to cost or involve large it'll kind of give you those pieces as an estimate.

But it kind of helps.

And you can utilize sentinel which is basically poly policy as code utilized sentinel and say no one can create a project.

If the cost is over $1,000 or whatever.

Or you can say, hey notify somebody or whatever already requires approval.

And so syncing it was actually pretty useful.

And then, of course, you get the normal sample.

And this is the private install a small sample clustering as you go up.

But you can.

But funny how it goes from unlimited worth.

Everything else is unlimited and unlimited workspaces the enterprise no matter limited anymore just 100 plus.

But I mean, really, this free up to five users is pretty much all you really need unless you are on a larger team than movie roles and the role basically plan read right now and admin support.

But the private registry is actually pretty cool too.

I think, as I said, as a profession need to push back on enterprises that try to make you pay for security here the MLS ISO is pretty much the only thing controlling the keys to your castle and I don't think that it's right for people to hold security as a tool for making money.

Yeah, that's like always there 1, 2 right.

Yeah, but I hope we get the industry aligned with security as a first printing like the first class it isn't all products, not just for if you're willing to pay.

Yeah, there were other.

We've shared this before like the SSL attacks website, go to ss no doubt tax, then it says it all.

Yet it's funny.

It's the wall of shame and the price increases with pleasure.

Areas things I need to add terrible glad to know.

Exactly base price pressure.

So price.

It's just insane.

The gouging that goes on.

Look at HubSpot my Jesus.

63 percent increase 586 call us.

Well, you want to factor that's going to cost you.

Yeah Yeah.

Two factors another.

Yeah Well, I mean that comes usually with whatever you picked as you say.

So Right but I don't just a cloud offer to factor.

It does.

Yeah So I have one set up here just normal.

I use all the networks, but then again, I'm also I have my day jobs account on theirs.

And it's paid.

So maybe that's good.

Yeah, maybe that's where it comes from maybe that's not over for.

Any other questions on cloud for a small team.

Do you guys think $7 a month is worth it for just set to no.

It depends what kind of roles do you have in place for like your instruction.

My team is a team of one right now.

So there is no like actual rules automated but obviously being proactive about it.

Just when I'm speaking of infrastructure.

But as the team grows.

And I think we're growing our security function here too.

I think a lot of security engineers I'm talking to, they're doing it manually where they go into your database console and like check it you know your last $3 are public.

I was saying like we could automate this with a sentinel.

I was curious if I do say a team of five is $7 a month worth it if you have those sort of rules in place.

Yes Yeah.

This is basically like a pre-emptive you can choose to block or you can just walk.

And so in this case, it was essentially a function and they're adding to this resources and they end up pulling it back.

Right And so you can basically take those things.

And as you validate them you can give specific messages that you want.

And basically say yay or nay if it's approved and it'll basically block the run.

So it is a good way to catch it ahead of time.

And you can catch some of those things.

Another thing that you can do as a team of security.

We talked about open policy agent integration with Terraform that can also do some of this stuff and also someone else recommended comm test, which is built on top of.

OK and add support for HCl and Terraform plans as well.

Yeah, there's a little library that's like a Lancer as well to offset this one.

And it's pretty decent too.

And I can catch like $3 an HTTP where so they should be.

Yes And it also provides a way here to where you can take these rules and you can actually ignore it for like a specific line like if you want an ingress here and you don't care about this.

This rule here.

It's just it's a requirement.

You have to have it mean you can ignore it.

But you can tie this directly and with see I can just run to you set up for locally with Docker and so I would probably start there as opposed to going to sentinel because then you do have to manage quite well you have to write the Central Policy you need to manage that.

And then you assume a lot of that risk at that point to you know all you have to develop all those opinions on what you mean.

Right well that makes a lot of sense though, when you have like cyclops that focus on that if you're the 119 and suddenly just adds to your plate.

Can you share this link to get up 45 seconds and office hours channel and the episodes after shooting for.

And let's see.

So we got 15 minutes left or so 15.

There were some other questions here unrelated to Terraform cloud as one to see if we can get to that.

Alex, do you still want to talk about this your Prometheus question.

See I can't.

He is chatting in the Zoom chat.

Looks like Zac helped you out with the answer a little bit.

I guess I'll just read it for everyone else's benefit.

How do you know.

Let's see.

Assuming you have the Prometheus already running the Prometheus operator and you run gipsy deal get Prometheus all names faces you'd set up a Service Monitor.

Oh, this is from Zach.

I did not.

Yeah, I gave him like 1,000 foot view of an answer for how to set up a Service Monitor for Prometheus a custom service running in a cluster.

I wouldn't have answers so quickly if I weren't your candidate the exact same moment.

That's cool.

Yeah, maybe we can.

The essence, you don't have to make today.

Let's punt on the question to next week if you're able to join us and we can talk more about service mind stuff.

I'm curious if anyone else is using anything other than custom rules for you know, if there's any other tooling out there for Service Monitoring or adding you know people have multiple teams, multiple microservices and you know if there's any organizational strategies around tooling.

This in a declarative manner any I can answer how we do it.

But I'm interested also first before I talk about what other people are doing.

We've talked about just monitoring the individual services oh Yeah.

Just Prometheus right.

Hangs in multiple services and you know there could be a thumb roll.

You know that some of them come and go ensuring that generic monitoring gets put in place and teams that they want to put extra and additional monitoring and you know, for items you know that those are also able to be deployed.

Yeah I'm just really struggling with the getting a good template going I thought.

Yeah Are you using helm your.

I am.

Yeah then sorry I missed my computer's not responsive here.

So then yeah, I can kind of show you because this came up recently, for example, with century.

That's the good example, my ship is going to invest one but I'll show it.

So we've talked about in the past that we use this chart called the chart that we developed.

Zach, are you familiar with the chart.

Dude I am so familiar with the model chart.

OK created my own version of it.

So yeah.

Well, thank you.

Yeah So the pattern there that we have.

And then are you familiar with the service monitors that we like the Prometheus findings that we have in the model chart, you know I probably should go revisit that on that.

Honestly, I haven't looked at it.

OK So I will give an example of that here and I'm getting it cued up in my browser.

So let me rephrase the question or let me rephrase.

But let me restate the question and add some additional context.

So in my own words, I think what you're describing is how do you offload the burden of how a application is monitored to the application developers themselves or the teams at least responsible for that service.

In the old school model it kind of be like employer your services, and then you throw it over the fence to ops and say, hey, I was deployed.

Update now those are some archaic system like that.

And monitoring and that never worked well.

And it's like very much like this data center mentality static infrastructure.

And then you have a different model, which is kind of like an Datadog where it will maybe auto discover some of the things running there and figure out a strategy for monitoring it, which is magical but it isn't very scalable right.

Magic doesn't scale.

So you want something that allows configuration, but also doesn't bottleneck on some team to roll that stuff out.

So this is why I think Prometheus operators pretty rad.

Because you can deploy your monitoring configuration alongside your apps themselves.

So we had this just came up kind of like what you said Zack about you were just actively working on this other problem that Alex heads.

That's why I was fresh in your memory.

So this is something that we did yesterday, actually.

So we run a century on prem.

We've had some issues lately with century stop ingesting new events while everything seems totally normal.

So it's passing health checks.

Everything's running everything's green and hunky Dory.

But we wanted to catch the situation where it stopped processing events.

So at the bottom here, we've added using the motto chart, for example, we don't have to create a separate release for this technique but we're doing that here and using them on a chart.

What we do is then we add the Prometheus rules.

So we can monitor that the rate or the delta here of jobs started in five minutes over a five minute period is not zero or in this case is 0.

So that's when we alert k minus 1.

My point here.

Those So let's see are we using mono chart to deploy this.

Do you have something that keeps a baseline level of jobs starting a busy cluster a busy environment.

So like is this generalizable no.

But in this environment.

So here's the thing.

Oh, I think it's generalizable because you could make that Cron job you know it does.

So in our case, we have century Kubernetes deployed.

So we have a pod inside the cluster that is ingesting all of the events from the Kubernetes API and sending those to century.

So you could say that we just buy it by having that installed.

We have our own event generator because Kubernetes is always doing something right.

So we ran when we ran this query we side identified the two times over the past month that had the outage.

So we deployed it, and went live with that.

But I just want it.

So this mono chart though, is this pattern where you can define one chart that describes the interface of a Mike or service in your organization.

This happens to be ours that we use in customer engagements.

But you can add you can forget or you can create your own that does the same kind of thing.

And let me go over to our charts here and see Mike in more a different example that we have.

So here's an example, like a simple example of deploying an engine x container using our motto chart.

And the idea is that, what does everything you deploy to coordinate his needs.

Well, it needs.

Well, OK, if you're pulling private images you're going to need maybe possibly you'll need a secret.

So we define a way of doing port secrets.

Everything is going to need an image.

So we define a way of specifying the image.

Most things are going to need config maps.

We define a simple way of having consistent config maps and all of these things are a lot more terse than writing the raw Kubernetes resources.

But you can also then start adding other things in here like then we provide a simple way of defining infinity rules.

So you can specify an inline affinity rule, which is very verbose like this, or you can just use one of the kind of the macros the place holder ones that we define here should be on different node.

And this is an example of how you can kind of create a library of common kinds of alerts that we deploy.

Now I'm talking.

I'm conflating two things affinity rules with alerts.

I just happen to have an example here of infinity in helm files here to share your screen.

Oh my god.

I can do that again.

But I'm just always used to having my screen shared.

So I. Yeah So sorry.

OK So this makes it a little less handwaving then by seeing my screen here.

Here's what I had open, which was just an example of using our monocytes chart to define the Prometheus rules to alert on centuries.

So here at saying century job started minus a century job started five minutes ago.

And if that's zero we weren't on.

So mono chart itself.

Here are using Monroe chart just to define some rules.

But Monroe chart allows you to define your deployment.

So here's where deploying engine x we're setting some config map values we're setting some secret some environment variables.

Here's the definition of the deployment.

But we also, unfortunately, we don't have adjacent schema spec for this yet.

So you kind of got look at our examples of how we use Monroe chart.

And that's a drawback if I search for this here that we'll find a better example.

Who's so for example, I'm not sure Calhoun is going to help fly the where we use Monroe chart frequently is a lot of upstream charts that we depend on.

Don't always provide the resources we need.

So then we can use Monroe chart much like we used the rod chart to define the rules.

So here is where we're deploying Cam for a k I.

This is a controller that pulls metrics out of k I am and sends them to Prometheus.

So somebody provided a container for us.

But the chart was apart.

So we just used our monocytes chart instead.

So here we.

Define a bunch of service RGB monitor rules to monitor.

In this case kIm so this is complicated like using the raw expressions for Prometheus but I don't want to say that like in your case.

Zack what I would do is I would define canned policies like this that you can enable in your chart for four typical types of services.

OK So that is the route I'm going.

And so with the sanity check means I'm not going the wrong round.

I know it's just it seems like a lot of work.

It is.

But the thing is like so.

But nothing else does.

There's no one else is doing this.

So this is like I say, I don't see.

I haven't seen any other option out there, aside from magical auto discovery of things running for monitoring this thing where applications, deploy their own configuration for monitoring very.

I don't know of any Sas product that does that.

And it's very specific to the team and organization and the labeling that you have in place.

Yeah So.

All right.

Well, I mean, did the model chart is the right route in my mind as well.

I've been going that route.

I call it chart architecture.

Yeah but I'm using that to do a bunch of other deployments and not forget to microservice.

So this will be rolled into it.

So thank you for the answer.

Yeah, no, I want to just add one other thing that came up just for to help contrast the significance of what we're showing here is yes, this stuff is a bit messy.

I wish this could be cleaned up.

And it wasn't as dense, but when you compare this to like let's say, Datadog and Datadog has an API.

There's a careful provider for Datadog.

But I would say that's the classic way of setting a monitoring.

It's a tad better than using nodules because there's an API, you can use Terraform but it's not much better than using nodules because there's still this thing where you deploy your app and then this other thing has to run to configure monitoring for that versus what we're saying here is we deploy the app and the monitoring of the app as one package to Kubernetes using.

Well, we're almost out of time today.

Are there any last thoughts or questions related to perhaps as Prometheus stuff.

I didn't check if you posted anything else here.

Alex and Chad thank you for the Terraform cloud demo.

Thank you.

That was all a demo.

Thanks, man.

No problem.

Was month month this year.

I think it's half on sales next month.

The lies and generalizations which can be hard.

I suppose for most HP REST API.

You could do some kind of anomaly detection or basic five minute alerts but there is Yeah, there's not a general, there's no general metrics across all kinds of services.

So Yeah, that's right, Alex.

So that's what all these other services do like data dogs this thing is they'll provide you some good general kinds of alerts but nothing purpose built for your app.

All right then let's see.

I'm just going to be up closing.

Slide here.

Well well there you go.

There's my secret sauce.

That's what we're doing here.

We're at the end of the hour.

There are some links for you guys to check out.

You enjoyed office hours today.

Go ahead join our slack team if you haven't already joined that.

You can go to a cloud posse slash slack you sign up for a newsletter by going to cloud posse slash newsletter.

If you ever get registered for office hours.

Definitely go there and sign up.

So you get the calendar invite for future sessions.

We post these every single Wednesday.

We syndicate these to our podcasts.

If you go to cloud policy slash podcast.

You can find out the links where you can subscribe to this.

Like iTunes or whatever podcast player use connect with me on LinkedIn.

And thanks again for.

Yeah, for all your input and participation area.

This is awesome.

What makes meet UPS possible.

Thank you, job for that presentation.

And I'll see you guys all in the hall next week.

Take care.

Thank you guys.

Thank you.

Public “Office Hours” (2020-02-05)

Erik OstermanOffice Hours

1 min read

Here's the recording from our DevOps “Office Hours” session on 2020-02-05.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here:

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Machine Generated Transcript

Let's get the show started.

Welcome to Office hours.

It's February 5th 2020.

My name is Eric Osterman and I'll be meeting conversation.

I'm the CEO and founder of cloud posse.

We are a DevOps accelerator.

We help startups own their infrastructure by building it for you and then showing you the ropes.

For those of you new to the call the format is very informal.

My goal is to get your questions answered.

So feel free to unleash yourself at any time if you want to jump in and participate.

If you're tuning in from our podcast or on our YouTube channel, you can register for these live and interactive sessions by going to cloud pop second slash office hours.

We all these we host these calls every week and will automatically post a video recording of this session to the office hours channel as well as follow up with an email.

So you can share it with your team.

If you want to share something in private.

Just ask.

And we can temporarily suspend the recording.

And with that said, let's kick it off.

So I don't have really any demos prepared for today or more interesting talking points.

One of the things that we didn't get to cover on the last call or maybe what some of the interesting DevOps interview questions are what are some of your favorites.

Interestingly enough.

This has come up a couple of times lately in the jobs channel either people going in for hiring or people using some of these questions in their recent interviews.

So I'd like to hear that when Brian Tye who's a regular meals.

He hasn't joined yet.

So we'll probably cover that as soon as he joins.

He has been working on adopting Prometheus with Griffin on EFS but he has a very unique use case compared to others.

He does ephemeral blue green clusters for their environments and he's had a challenge using the CSS provision or on ephemeral clusters.

So you need a different strategy.

So he's got to talk about what that strategy could look like.

But I'm going to wait until they grow quicker.

Ender I work with Brian, I'll grab him and love.

Now OK, awesome.

Thanks Oh, yeah Andrea.

All right.

In the meantime, let's turn mic over.

Anybody else have some questions something interesting.

They're working on something, I want to share.

It's really open ended.

Well, I'm putting together.

I've been doing this as a skunk works at the office for the last couple of months.

But I'm putting together like a showcase of open source DevOps.

They don't have to be open source, but they have UPS tools.

OK So if anybody wants to anybody has something they want to contribute or any experiments they want to run or anything like that.

There Welcome to do that.

Cool can you just mention that in the office hours channel so that they know how to read you so Mike going to put you on the spot here.

What are you working on.

How'd you end up ending up in arms.

Yeah, so I'm we.

My companies recently started using Terraform and we found that cloud policy posi templates to be very helpful, especially when learning at all.

And so I just came here to kind of find out some practices beyond what's available online and a bunch of us on our team actually read the Terraform up and running the second edition.

Yes, that's a good one.

Yeah Yeah.

So we're just trying and more media specifically under trying to find best practices.

And I guess one of my questions was going to be, how does everybody kind of lay out their terror Terraform like I understand the concept of modules being reusable but next step is like defining different environments.

So like we're going to be using separate AWS accounts for different environments.

And so just wanted to get to more expert advice from you guys are just also just learn more about DevOps in general.

Yeah, it's a broad definition.

Nobody has an accurate their own.

Everybody has their own definition of DevOps.

So Yeah, it's loaded.

Really we don't even have to talk about that.

I go there to block all.

So Yeah, it's a hot topic.

I'm sure a lot of people here you can share kind of what their structure is.

There's no like canonical source of truth and there's been I would also like to say that there has been a pretty standardized convention for how layout projects for Terraform that I think is being challenged with the release of Terraform cloud.

And what I want to point out here is that.

So they actually caught dogs and gotten a lot better, especially like becoming a little bit more opinionated or idiomatic on how to arrange things.

And that's a good thing because they're the maintainers of.

So one of things I came across a few months ago when I was looking to explain this to a customer was, what the different strategies for.

And here, they kind of lay out, for example, the common repo architectures as my mouse pings and I can't scroll down.

This is the problem of screen sharing with do it.

All right.

I'll wait for that to wake up and continue talking what it lays out here is basically three strategies.

One is the strategy basically that hashi corp. is now recommending which you can briefly see at the top of my screen here, which is using them on our people or using kind of poorly mono repos.

What I mean by that is maybe breaking it out.

Maybe by a purpose.

So maybe you have one repository for networking one repository for some teams project or something like that.

But what you do is you don't have environment.

Let me see if I can.

Everyone Waking up as my mouse is going crazy there you go.

So multiple workspaces for people.

So what.

What has she caught.

But started doing was saying, hey, let's use work the workspaces concept for environments.

Even though originally, they said in their documentation don't use workspaces this way.

So I don't know if you saw it anywhere but now they've done a mea culpa on that an about face and we've been taking a second look at this.

And I think that it's making projects a lot easier actually for developers to understand when you just have the code and you just have the configuration and you don't conflate the two.

So the other approach is like more traditional and development, you have maybe a branch for each environment.

There's a controversial thing as well.

I don't like it because you have long live branches and keeping them in sync and merge conflicts and managing that they don't diverge is not.

It still takes extra effort.

Also, if you're an organization that leaves and trunk based development, then the branch mile won't long based long live branches like this isn't ideal.

And then there's the directory structure.

This is what's going to get to.

It's like this has been the canonical way of organizing Terraform projects maybe in large part because you know grunt works has a lot of influence in this area and with the usage of terror grants and tools like that.

This has been a good way of doing it, but there's some problems with this.

So what I like about it.

First of all, as you have the separate modules folder right in the modules folder has your what we call route modules that top level implications.

And those are reusable.

That's like your service catalog the developers.

And then you have your environments here and those kind of describe everything that you have for production everything that you have for staging and like these might be broken out more like you should have project folders enterprise.

You wouldn't have them.

This would be considered like a monolithic project or terror list.

You don't want triplets right.

So you underpriced you maybe have VPC you'd have maybe chaos and you'd have maybe some other project or service that you have.

And then under there all the Terraform code.

The problem is that these things still end up diverging because you have to remember to open for requests to promote those changes to all to all the environments, or you have to create one heavy.

Pull request that modifies all those environments.

This has been a big pain for us.

Even at cloud passes.

So we started off with an interesting approach a cloud posse which is, which is to organize accounts treat accounts like applications.

And in that case when I say accounts.

I mean, Amazon accounts.

So like the root account or your master account is one configuration you have your staging configuration, your core configuration, your data configuration.

And what I love about this is you have like a strict shared nothing approach.

Even the good history shares nothing and you share nothing has kind of been this holy grail to reduce the blast radius.

The other things like when your web hooks fire.

They only fire on the dev account.

And because we have these strict partitions there's no accidental mistakes of updating the wrong environment.

And every change is explicit.

Now there is a great quote in a podcast that just came out the other week, whatever the change log and they interview Kelsey Hightower on like I think the top was like the future is model.

And this is the constant battle of bundling and unbundling and bundling and unbundling is like basically, I guess you get anywhere you go to consolidate and then expand.

And then you expand and you realize that that can work well you consolidate and you expand it and you say so.

But my point here is more like one of the things he said, and that was like the problem with microprocessors is that it requires the developers to have a certain level of rigor.

I'm paraphrasing my own words.

It's asking it in an organization that wasn't practicing it before.

So how are they going to get it right.

This time by moving to microservices.

I want to find the exact quote somewhere maybe somebody can post office hours if they have it handy.

But that was it.

So that's the thing here.

What I describe here.

This is beautiful.

And when you have a well oiled machine that is excellent at keeping track and promoting changes to all your environments and no change is left behind, then it works well, but this is an unrealistic expectation.

So that's why I'm we're considering the Hasse corp. recommended this practice now under using multiple workspaces repo and under this model when you open up a pull request kind of what you see is OK, here's what's going to happen in production staging and anywhere else you're using that code because inadvertently you might have drift either from know human monkey patching going into the console or that maybe applies failed in some environments.

And that was never addressed.

So now you have diverged that way.

That's a valid error or maybe you have other Terraform code that is imported to state or something.

And it's manipulating those same resources.

There's been bugs and Terraform dividers and all that stuff.

So I want to see when I open up a board with what's going to happen to all those environments that that's what I really like about this workspace approach.

That's the opinion we keep getting like we'll see people that broke it out in a similar way that I can do Adam cat home experts where you'll have like the different end of US accounts and then it's free and it's interesting to me that you're like, well, now we're going to try this multiple workspaces so tight.

Where do I go.

Where do you go from here.

And you're right to feel frustrated and you know the ecosystem is in frustration towards you.

No no I know, just in general.

I would say like you know in software development more or less the best practices for how to do your pipeline some promote code is very well understood.

And we're trying to adapt some of those same things to infrastructure.

The problem is that we're different.

We're operating at different points in the software development lifecycle and the capabilities at our disposal are different.

So let's take, for example, infrastructure infrastructure.

But if you listen to this podcast you talk to us.

I always talk about this.

It's like the four layers of infrastructure you got your foundational infrastructure you shared services.

So you've got your platform and then on top of your platform you've got your shared services.

And then the last layer is your applications, most people working with this stuff.

Assume layers one through 3 exists and don't have to worry about it.

It's a separation of concerns.

All they care about is deploying their app.

They don't have any sympathy or empathy for how you manage Layers 1 through 3.

But if you're in our industry, that's one of the biggest problems is that we have to maintain stability of the platform while also providing the ability to run multiple environments for developers and stuff like that.

So my reason for bringing up all of this is that like Terraform as a tool doesn't work like deploying your standard microservice your standard go rails node app or whatever.

Like if you to play.

Note app, and it fails, you can roll back or you didn't even need to expose that error at all because your health checks caught it and you're running parallel versions of all that all that's offshore et cetera when you're dealing with infrastructure and using a tool like Terraform, it's a lot more like doing database migrations without transactions.

So there is no recourse.

So how do you operate when you're dealing with this really critical stuff in a place where you have poor recourse.

So you need good processes for that.

And let's see here.

So that's why we're still trying to figure this out.

I think as relates to DevOps what is the best course of action.

There's been Atlantis.

There's been Terraform cloud a lot of people are using Jenkins to roll this out.

Some people combine things like Spinnaker to it.

You can orchestrate the whole workflow, because the problem is you're standing CCD systems and the answer for this.

I don't think.

I don't think there is AI don't think anybody's just nailed it yet.

We're all so that it's a fair question if you go back to your cat home experts and you can click on one of those accounts.

I'm just curious now are you still referring to the cloud as opposed to modules or do you build modules within here.

No It's really all just config.

Both both.

So let me describe that a little bit more of what our strategy has been kind of like food for thought.

So So here I have one of these eight US accounts and under control, we have the projects.

So this is how we had kind of like the microservices of Terraform.

So it's not a monolith it's micro certain microservice architecture.

So to say, if I look in at any one of these we've been using a pattern similar to what like you'd get with grant works, which is where you have a centralized module repository.

So here we have the Terraform route modules and cloud posse.

So where you pull those root modules from really doesn't matter like you could have a centralized service catalog at cloud posse which works well for our customers because their customers are implementing.

I mean, our customers are implementing our way of doing.

Now if you know depending on how much experience and opinions you have about this stuff you could fork out modules you could start your own.

And what we have typically is the customer has their own modules.

We have our own modules and you kind of pick and choose from those here what we have is we're using environment variables with Terraform, which this pattern worked really well in Terraform 0 at eye level but in Terraform 0 12 for example Terraform broke a fundamental way we were using the product.

So when you run Terraform admits dash from module and you point out a module, you used to be able to initialize your current working directory with a remote module, even if your working directory had a file in it.

So that works really awesome.

We didn't need any wrappers we didn't need any tools.

But then Terraform 0 12 comes up and they say, no, the directory has to be empty.

Oh, and by the way, many Terraform commands like Terraform output.

Don't allow you to specify where your state directory is.

So that it could get the outputs.

So most commands you can say Terraform plan in this directorate Terraform applied this plan file or in the strategy Terraform it is in this directory.

But other commands like Terraform show.

I think and Terraform graph and Terraform output.

Don't allow you to specify the directory.

So there's an inconsistency in the interface of Terraform and then they broke the ability to initialize the local directory.

So anyways, my point with this is saying perhaps there is some confusion on our side as well, because the interface was changed underneath us.

So going back to your question under here.

So then if you go to this folder here and you go to like Terraform root modules here a bunch of opinionated implementations of modules for this company.

So here's the idea as the company grows in sophistication you're going to have more opinions on how you're going to do EMR how you're going to do Postgres how you do all those things.

And that's the purpose of this.

And then a developer all they need to know is that look, I can follow best practices.

If I point to that.

And then you DevOps team or people stakeholders in that with that label.

So to say, can be deciding what are the best practices for implementing those modules and then you've seen quite possibly them.

Here we have our terraforming modules.

Now our modules.

I want to preface this with.

We have these as kind of boiler plate examples for how to implement common things, but they're not necessarily the canonical best practice for implementing.

So our case root module here implements a basic case cluster with a single of scale group basically a single notebook.

Now that gets complicated right because starting companies they're going to need to know pool would use other.

And they're going to need a high memory node pool they're going to need a high you pool and how you mix and match that.

That's too opinionated for us to generalize Yeah.

Well, that's all I can say is yes.

Now, I guess.

And then what.

Yeah, and that's kind of what we're going through now just figuring out what works for us deciding on the different structures of everything and definitely taking advantage or looking at what you guys have already done and looking at a lot of things.

And just reading all over.

So yeah.

Well, that's a good read it posts.

You know everyone happens to blog and medium.

So take Jake out.

Exactly And think that's the other thing is just finding documentation on zero that 12 compared to zero 9/11.

And you know, refining my Google searches and only searching the past like five month type of thing.

That's a good point.

Yeah, there's not.

You can you can end up on a lot of outdated stuff, especially how fast and stuff.

So you know I was reading a blog from July of 2019 and blindly assumed that they were talking about doing about 12 when in fact, they were talking about zero 9/11.

So yeah, but I got to move on to the next question.

I see Brian Tyler joined us.

He's over at the audit board, and he's been working on setting up Prometheus on with the effects on ephemeral shortly clusters.

Kind of an interesting problem.

Those of you attended last few calls this Prometheus best thing has come up quite a lot.

I want to just kind of tee.

This question up a little bit and frame it for everyone.

So we called upon.

So we've been supporting a lot of different monitoring solutions, but we always come back to that yet Prometheus with refined eyes pretty awesome or Cuban community provides a lot of nice dashboards and it's proven to scale.

So one of the patterns we've been using that others here have used as well.

It works pretty well is to actually host the Prometheus time series database on each of us.

And I guess your mileage will vary and if you're running Facebook scale.

Yeah, you're probably going to need to use Daniels or whatever some bigger things.

But EFS is highly scalable.

It's POSIX compliant and it works ridiculously well with Prometheus for simple implementation.

The problem that Brian has is his clusters are totally ephemeral like when they do.

Roll out.

They bring up a whole new cluster and deploy a whole new stack to that.

And then verify that that validate that that works.

And shut down the old one.

And with Prometheus any offense.

Well, we've been using is the EFS provisionally and with the F has provisionally it'll automatically allocate a BBC system volume claim on your yet this file system.

Problem is those system ideas are unique.

They're generated by the company's platform.

So if you have several clusters how do you reattached any of this system.

If you're using the yet this provisional well the kicker is if you are doing it this way.

Well, then maybe the DFS provision provision or isn't the right tool.

You can still use CSS but the provision isn't going to be the right.

And instead, what you're going to need to do is mount the S file systems to the host using traditional methods amounting DFS fought for your operating system.

So if you're using if you're using cops you're going to use the hooks for when the service starts up and add a system hook that's going to mount that DFS file system to the host OS.

And now with Kubernetes what you can do is you can have you can mount the actual hosts volume that's been mounted there into your container and then you know what you're naming you are the decider of the naming convention at that point how you keep an secure and everything was Brian, is that clear.

I can talk more on that.

Yeah, no, that makes sense.

Yeah, it's unfortunate that I have to go that route.

But sometimes you don't get the turnkey solution.

Yeah, I mean so before the FSA provision or we were using this pattern like on core OS with Kubernetes for some time.

And it worked OK.

So I just wanted to point out for those who are not familiar with the cops manifest and capabilities that the cops manifest.

So under cloud positive reference architectures is kind of what we've been using in our consulting engagements to implement cognitive clusters with Cox.

I'm mostly speaking here to you.

Mike, who is looking for inspiration.

This here is doing that bolt poly repo architecture, which I'm undecided on at this point from being totally gung ho.

So we go.

So we.

So we go here now to the cops private topology, which is our manifest for cops.

What I love about cops is cops basically implemented a Kubernetes like API for managing Kubernetes itself.

So if you look at any C or D They're almost the same as these.

But then there's this escape hatch that works really well is that you can define system.

These units at startup that get installed.

So here you go.

Brian you'll create a system d unit that mount that that just calls exact start to mount the file system the file system.

Yeah Have you guys looked at offense CSI driver.

No So maybe there's other maybe there's more elegant implementations like I pasted in the chat.

But it was what I was looking into next, which I believe could solve my problem for me, just because I create the PVR itself.

Then I can define the path for that person simple.

Oh, OK.

Yeah If you can define in a predictable path, then you're around it.

Yeah, it's just so long as long as you're in the territory of having random ideas then yeah.

I believe, if I'm green that persistent volume myself instead of doing the PDC route, then I can create the volume.

And then when I'm provisioning Prometheus I can just tell it, which volume to mount instead of going and doing and creating a PVC through Prometheus.

Operator OK.

I think that's the route I'm going to try first.

If not, then obviously, the Cox system D1 is the one that for sure will work.

Yeah, that's your that's plan B self.

Andrew Roth you might have some experience in this area.

Do you have anything to add to it.

I've never used a CSI driver if only used.

I've only used DFS provision prisoner in that deal.

Our very own certified cougar Kubernetes administrator.

Any insight here will.

Your initial thought was the tree my have to have decreased itself be attached.

And then have to continue from there.

The CSI driver before.

I've only just started to mess around with it.

But not enough experience to really say anything much.

Playing around with rook sort of when it's set.

All right.

Well, I guess that's the end of that question then unless you have any other thing you wanted to add to that, Brian.

No thank you.

Yeah Any other questions.

We got quite a lot of people on the call.

I just have a general question.

It's worn.

So I know in the company that I work for we use a lot of the love you guys repost.

I know a lot of it is revolving around AWS.

Now do you have any or do you plan on doing any repos for like maybe Digital Ocean.

That's a good question.

But fun fact is that we use a bit of Digital Ocean kind of for our own pieces to keep our costs low and manageable.

I don't know if where would we're going to really be investing directly in Digital Ocean because of who we cater towards which are more like well-heeled startups.

And they need serious infrastructure.

So I think Digital Ocean is awesome for pieces.

But I wouldn't want to based my $100 million startup on it.

Cool I do have like a low com and also we actually do use that data process of using the telephone cloud with the workspaces.

How how's that working out for you.

Anything you want to add.

Like a for use case or firsthand experience so far.

And pretty good.

You know anytime we have any issues with a certain, I guess stage, we just have to do a pull request requested that particular stage.

OK You just go from there.

And it's pretty simple.

You know because I've only been involved with a company with a little over year.

So which of the repo strategies are you guys using with Terraform cloud the repo strategy.

I guess, depending on the stage or so like we have stages like proud of you 80q a.

So like maybe 80 and support they all branched off from the master branch.

OK So you're using the branching strategy, then where were you a team would be a separate branch, not in git repo it'll all be on the same branch in the git repo.

OK So anything that's merged into master any structures then or using the official approach that they describe up here.

Once again, it sounds like a single branch like a master of the branch from that.

And then merge back to yes in the long run.

Yeah, but are the environments is you a team is you just a workspace or is you a team a project folder like this.

You said you a team will just be a workspace.

OK So yeah, you're using that strategy.

The first strategy we were talking about here multiple workspace.

I wanted to expand on that in that we have an older Terraform repo that has a few of them that are using workspaces but we have a lot of stuff that we're doing tarragon to manage it.

And I haven't sat down really to think about it.

But is there a best practice for managing workspaces VA terabyte.

I was exploring that the other day just in terms of EPBC and out of the box.

Terry Grant does not support workspaces.

It's anti pattern for them.

And I would say that based on what I said earlier that what they used to be the official anti pattern.

Like you don't use workspaces for this but even hashi corp has done an about face on this.

So terror.

I'm not sure if they're going to be adding support for it or merging or even addressing this topic.

However, I was kind of able to get it to work using.

And I don't know what the term was in tarragon but I think the equivalent of hooks right.

So I had a pre had a well, it was a prerequisite Yeah, for either for the plan or the net or whatever.

I just use a hook to select the workspace.

And then the or create the workspace.

If it did any if the selecting failed.

And that seemed to more or less work for me.

So the challenge is you need to call this other command, which is Terraform workspace, select or Terraform workspace new and using the hooks was one way to achieve that.

You see here in a terminal window.

I might still have that question on those using Terraform workspaces.

And I haven't been keeping too up to date with Terraform 12.

Have they made the change.

So that you can interpolate your state as to your state like location or not.

It's still not a thing that's the amount of things I can talk to that as well actually.

Another vehicle that kind of.

So the OK.

So for everyone's clarification.

Let's see here.

That's a good example.

So the Terraform state back in with S3 requires a number of parameters like the bucket name the role you need to assume to use that bucket key prefixes all that kind of stuff.

And one thing that we've been practicing a cloud policy has been this thing like share nothing right.

So here's an example.

Brian's talking about.

So you see this bucket name here.

You see this key prefix here.

You see the region.

Well, the shared nothing approach right.

You really don't want to share this bucket especially if you're having a multi region architecture.

And you to be able to administrator multiple regions into that one region is down you definitely have to have multiple state buckets right.

However, because Terraform to this day does not support interpolation basically using variables for this.

That's kind of limited how you can initialize the back.

So one of the things that they introduced in a relatively late version of Terraform like Terraform from or something was the ability to set all the back end parameters with command line arguments.

So you could say terraforming minute.

And I think the argument was like config and then back n equals.

Yeah bucket equals whether I know something.

And then.

But then there was still no way to standardize that.

So that's why a cloud posse we were using.

There's this thing that you can use.

It's going to type it out here.

So Terraform supports a certain environment variables.

So you could say export TF Kawhi are was kind of weird t if Clyde plan args I think it was equals and then here you could say like dash BackendConfig equals bucket equals like my bucket.

So because you had environment variables you could kind of get around the fact that you can use interpolation.

Are we clear up to this point before I kind of do my soul searching moment here for me, this is clearer.

Yeah OK.

So I'm just one voice.

Eric, are you trying to shoot your screen.

Yeah, you might need to.

It's definitely being shared because I see the green arrow around my window.

You might need if you're in Zoom you might need to tab around to see it.

Yeah, I can see it in here.

Yeah zoom is funky like that.

I see it now.

Yeah So you see here.

I might be off by some kick camel.

Some capitalization here somewhere maybe.

But it is more or less what you can do to get around it.

So you can stick that in and see you can stick that in your makefile.

You could have users, you could have your wrapper script.

OK And then the last option.

Yeah if you're using terror ground terror Brent will help set this up for you using bypassing these arguments for you in your terror.

I grant that each sealed.

So there are options right.

But a lot of these are self-inflicted problems based on problems that we have created.

Like I said self click flick the problems we've created for our self right.

The problem we create for ourselves that cloud posse was this strict adherence to share do not share the Terraform state bucket across stages and accounts always provision a new one.

This made it very difficult when you wanted to spin up a new account because you had to wire up the state in managing Terraform state with Terraform we have a team of state back in module, we use it all the time.

It works, but it's kind of messy when you've got a initialize Terraform state without Terraform state using the module.

So it creates the bucket.

The Dynamo table and all that stuff.

Then you run some commands via script to import that into Terraform and now you can use Terraform the way it was meant to use with this bucket those created by Terraform but it's real.

And you see you see the catch-22 here.

And if you're doing this at every account is that idea.

So has she.

So grunt works took a different approach with Terry Grant and Terry Grant.

They have the tool Terry Grant provision.

The Dynamo be back in.

And the term the S3 bucket for you.

So it's kind of ironic that your whole purpose with using tenure reform is to provision your infrastructure as code and then they're using this escape hatch to create the bucket for you.

And so.

So let's just say that that was a necessary evil.

I'm not.

Know I get it.

It's a necessary evil.

Well, on the topic of necessary evils let's talk about the alternative to this to eliminate like all this grief that we've been having.

Screw it.

Single state bucket and just use IAM policies on path prefixes on workspaces to keep things secure.

The downside is yes, we gotta use IAM policies and make sure that there's never any problems with those policies, or we meet we can weak state.

But it makes everything 10x 100x easier.

It's like using.

So one of the things Terraform cloud came out with was managed state like.

And that's free forever or whatever you believe.

But just having the managed state made a lot of things easier with terrible cloud.

I still like the idea of having control over it.

We Terraform in S3 on Amazon where we manage all the policies and access to that.

So that's what.

All right.

So when you're done using workspaces together with the state storage bucket the other thing you gotta keep in mind is using this extra parameter here.

So workspace key prefix.

So if you're using the shared S3 buckets strategy, then you're going to want to always make sure you set the workspace key prefix so that you can properly control IAM permissions and that fucking based on workspaces.

So a workspace might be dead.

A workspace might be prob somewhat thank you very explaining that perhaps Teflon cleared up some confusion when it comes to workspaces.

But you said, where do you keep the state.

But if you divide that all overtime rules and policies it can be done by keeping it in a single state.

The single buffer could I should say one of the things Why we haven't liked this approach is that.

OK Let's say, OK.

And I just want to correct one thing I was kind of saying it wrong or misleading.

So the workspace key prefix would be kind of your project.

So if your project is case, the workspace key prefix would be e chaos and then Terraform automatically creates a folder within there for every workspace.

So there'll be a workspace folder for private workspace all to fit.

So there that.

Now why we haven't liked this approach is the.

So I am is a production grade where we're using I am in production, and let's say the master This bucket is in our master AWS count and we're using I am there and we're using I am to control access to death or control access to the staging workspaces or control access to some arbitrary number of workspaces while we're actually doing is we are modifying production.

IAM policies without ever testing those IAM policies in another environment by procedure like we're not enforcing.

You can still test these things.

But it's on your own accord that you're testing that somewhere else.

And that's what I don't like about it is that you're kind of doing the equivalent of cowboy.

I am in production, obviously with Terraform as code, but nonetheless, you almost need a directory server for this sort of thing.

Yeah Yeah Yeah Yeah.

That's interesting.

Is there.

I am directory integration.

I haven't looked trying to do that.

But Yeah.

So sorry once I got some comments here by Alex but if you update the reference version and dev but.

But then get pulled away, and it gets forgotten later environments come back to it's like default.

So the then coming back and being like, what if we dropped the ball on requires some fancy diff work and just tedious investigation.

I kind of want a dashboard that tells me what version of every module.

I'm referencing in each environment.

This doesn't cover everything but doing cool stuff like this is just messy, so Yeah.

So Alex.

So Alex siegmund is actually one of our recent customers and one of our recent engagements and I'm not.

So one of the problems that we have in our current approach that we've been doing, which we've done up to this point has been that you might merge pars for dead and apply those in dev but those never get promoted those changes never get promoted through the pipeline.

And then they get lost an advantage.

And that is the whole problem with what I argue is that with both the poorly repo approach that we've been taking.

But it's also the problem with the directory structure approach that's been the canonical way of doing things in Terraform for some time.

The proud directory the dev directory.

All of those things have the same problem, which you forget what's been ruled out.

So that's why I like this approach where you have one PR and you'd never work.

OK There's variations of this implementation.

One One variation one is that PR is never merged until there is an exit successful applied in every environment.

So then that PR is that dashboard that Alex is talking about.

He wants a dashboard that tells what version has been deployed everywhere.

Well, this is kind of like that dashboard when that PR is open.

Then you know that there's an inconsistency across environments.

And based on branch based on status checks and your pull request you see success in that success in staging.

And no, no update from production.

OK that's pretty clear.

Now where it's been updated versus this approach where you then merge it.

And then you need to make sure that whatever happens after that has been orchestrated in the form of a pipeline where you systematically promote that change to every environment that it needs to go after that.

But now the onus is back on you that you have to implement something to make sure that happens in Terraform cloud.

They have this workflow where it will plan in every environment.

And then when you merge it can then you can set up rules.

I believe on what happens when you merge.

So when you merge maybe it goes automatically out the staging, and then you have maybe a process of clicking the button to apply to the other environments.

What's nice about this is you'll still see that that has not been applied to those other environments.

And you need that somewhere.

So whether you use the pull request to indicate that that's been applied or whether you use a tool like Terraform cloud to see that it's been applied or whether you use a tool like Spinnaker to create some elaborate workflow for this.

That's open ended.

Let's see.

I think we've got some more messages here.

You've removed the need for such a dashboard by making part of your process ensuring that it's all repositories or environments.

Yes So I'm still not 100% sure.

OK, awesome.

So Yeah, Alex says that this alternative strategy we're proposing eliminates some of these needs for doing that because in the other model the process is basically the onus is on you to have the rigor to make sure these things get rolled out everywhere.

And especially the larger your team gets, the less oversight there is ensuring that that stuff happens.

And so are one thing I'm personally taking away from this is to try the work space recommended thing on Terraform docks first, and then you share and report back and report back.

All these pitfalls.

Yes also I am.

And to be candid, I have not.

I have not watched or listen to it yet.

And you're going to find detractors for this, right.

But that's good.

Many rich ecosystem conflicts and Anton Anton the thing go is a was it is it prolific is it forced them or I forget what the conference is he has just done a presentation on why you want to go on selling everything I said and saying, why you have to have the directory approach.

So that might be a video to watch.

Is it good.

Anton is a great speaker.

I know I'll look, I with the I actually met this guy upstairs.

Yeah Yeah.

He was there last year.

Yeah super nice guy.

He's awesome one.

Yeah, I really like.

I like him.

I met him up in San Francisco and reinvent I think this guy goes to like 25 conferences a year or something.

It's a.

So you were saying he kind of has the right combination even with zero.

That's 12 to stay with a director.

I think so.

So he is a very he's a very big terabyte user.

And Tara grant is a community in and of itself with their own ways of doing things.

So I therefore, I suspect this will be very much at promoting that because in the abstract it was something like, why you have to do this this way.

I'll find it though while this is going.

Any other questions you're I have a bit of a question that kind of is a little bit higher level.

But my experts whatever form is that when you use it, it's kind of very low level.

It's a fairly abstracted from the API.

And you have, of course, you know the built in kind of semantics that has you gives you rails as it were sort of like you know, this is how we just say transitions.

So we do this.

So we do that.

And it's kind of like you know operate inside of that construct.

Yeah What's your experience with four thoughts around using higher order constructs like what's available database TDK for example, in some of the things you could do with that in a fully complete language.

Yeah Yeah.

It's good.

I like the question.

And let's see how I answer it.

So this has come up recently with one of our most recent engagements and the developers kind of on that team challenged us to like, why are we not using TDK for some of the stuff.

Let's be like, let's be totally honest.

That like scene k is pretty new.

And we've been doing this for a long time.

So our natural bias is going to be that we want to use Terraform because it's just a richer experience.

But there's a lot of other reasons why I think one can argue for using something like Terraform and not something that's more Turing complete like SDK or blooming and that's that requirements don't always translate to awesome outcomes or products.

And the problem is that when you can do everything anything possible every way possible you get into this scenario of why people hate Jenkins and why people hate like Groovy pipelines and Jenkins because you develop these things that start off quite simple quite elegant.

They do all these things you need and then 25 people work on it and it becomes a mush of pipeline code a mush of infrastructure code.

If we're talking the c.d. k right.

And things like that.

This is not saying you can't use it.

I'm thinking more like there's a time and place for it.

So if we talk about Amazon in the Amazon ecosystem.

For example, I like to bring up is easy.

It's easy s has been wildly popular as a quick way to get up and running with containers in a reliable way that's fully integrated with AWS services.

But it's been a colossal failure.

When you look at reusable code across organizations.

And this is where Kubernetes just kicks butt over.

Yes Yes.

So in Kubernetes they quickly Helen became the dominant package manager in this ecosystem.

Yeah, there's been a lot of hatred towards helm for some security things or whatnot, but it's been incredibly successful because now, are literally hundreds and hundreds of helm charts many written by the developers of these things to release and deploy their software.

The reason why I bring up helm and Kubernetes is that's provided proved to be a very effective ecosystem.

We talk about Docker same thing incredibly productive ecosystem.

And so with Docker Hub.

There's this registry.

And you know security implications aside there's a container for everything.

People put them out there your mileage may vary and some might be exploitable but that's part of the secret.

Why doctor has been so successful.

It's easy DSL easy distribution platform and works everywhere.

Same pattern like going back in the days to Ruby gems and then you know Python modules and all these things.

This is very effective.

Then we go to Amazon and we have easy yes and there's none of that.

So we get all these snowflake infrastructures built in every organization to spec.

And then every company, every time you come into a new company at least as us as contractors you know two environments look the same.

They're using the same store tools stack.

But there's too many degrees of variation and I don't like that.

So this is where I think that the six part of the success of Terraform has been that the language isn't that powerful and that you are constrained in some of these things.

And then the concept of modules is that registry component, which allows for great tremendous usability across organizations.

And that's what I'm all for.

And that's like our whole mission statement that cloud passes to build reusable infrastructure across organizations that's consistent and reliable.

So back to TDK question and the answer that I gave the customer was this.

Let's do this.

Let's continue to roll out Terraform for all the foundational infrastructure, the stuff that doesn't change that much the stuff that's highly usable across organizations.

And then let's consider your developers to use TDK for the end for the last layer of your infrastructure.

What I'm talking about there.

And I'm not sure at what point you join.

But in the very beginning, the call.

I talked about what I always talk about which are the four layers of infrastructure basically layer one, layer two layer through layer 1 is foundational infrastructure layer 2.

This your platform layer 3 are your shared services layer 4 are your applications, your applications go for it.

Go nuts you may use CloudFormation, you know you server framework like if somebody is using the service framework, which is purpose built for doing lambdas and providing the structure other rails but for lambdas use it.

I'm not going to say because we use Terraform in this company, you're not going to be able to use service that's not the right answer.

So the answer is that's going to depend on what you wear at where you're operating.

Yeah, I really I really like that for Lamont.

I miss that from the beginning of the call, But that really makes a lot of sense because you want to have your foundations a little bit more rigid you don't want to have that much that you described earlier.

And that's where I think at a lower level the tie constructs that that Terraform gives you the high opinionation, I should say that makes sense, because you can only do so much.

And moreover you have a pretty mature kind of you used to be with Terraform it know you'd have temporal plant and then Terraform Apply could be quite different.

But I think this equipment has become much more mature at this point.

Yeah And they and they really do a good job predicting when they're going to destroy your shit.

Yeah And yeah.

And they have and they added just enough more functionality to HCM to make it less painful.

Which I think is going to quell some of the arguments around Turing completeness.

And then the other thing I wanted to say related to that is like the problem we had in pre h CO2 all the time was count of cannot be computed.

That was the bane of our existence.

And one of the top one of our top linked articles in our documentation was like, all the reasons why count cannot be computed.

Now that's almost we don't see it as much anymore.

So I'm a lot happier with that.

The only other thing I was going to add and I'm not sure I'm 100% on this anymore.

I was alone.

Well, I wasn't 100.

So I was maybe 50 60% before.

Now maybe 30, 40 but I was wondering like maybe maybe HDL is more of like a CSS language and you need something like Sas on top of it, to give a better developer experience.

But for all the reasons I mentioned about CTE came my concern is that we would get back into this problem of non reusable vendor lock kind of solutions and unless it comes from hashi core you run the risk of running afoul of kind of division.

They see for the product also.

Alex siegmund shared in the Zoom chat don't Alex keep you posted to the suite ops office hours channel as well.

Yeah, this is the.

This is the talk that the Anton Banco did at DeForest them and the recording has been posted and he just posted it is LinkedIn.

I'll look it up after this call and share it.

I do think actually though boy, this has been a long conversation today.

I think we already at the end here.

Are there any last parting questions.

No more just to thank you.

Thanks for calling me out earlier.

And then taking the whole hour to talk about her farm.

I appreciate that.

Well, that's great.

I really enjoyed today's session.

As always so lets see are you going to just wrap this up here with the standard spiel.

All right, everyone looks like we reached the end of the hour.

That about wraps things up.

Remember to register for our weekly office hours if you haven't already, go to cloud posterior slash office hours.

Again, that's cloud posse slash office hours.

Thanks again, everyone for sharing your questions.

I always get a lot out of this, and I hope you learned something from it.

A recording of this call will be posted to the office hours channel and syndicated to our podcast at dot cloud plus.

So see you guys next week.

Same place same time.

Thanks a lot.

Thank you, sir.

Public “Office Hours” (2020-01-29)

Erik OstermanOffice Hours

1 min read

Here's the recording from our DevOps “Office Hours” session on 2020-01-29.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here:

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Jenkins Pros & Cons (2020)

Erik OstermanCI/CD, Cloud Technologies, DevOpsLeave a Comment

9 min read

I spent some time this weekend to get caught up on the state of Jenkins in 2020. This post will focus on the pros and cons of Jenkins (not Jenkins X – which is a complete rewrite). My objective was to set up Jenkins following “Infrastructure as Code” best practices on Kubernetes using Helm. As part of this, I wanted to see a modern & clean UI throughout and create a positive developer experience. Below is more or less a braindump of this experiment.


  • Jenkins has a lot of redundant plugins. Knowing which one to use takes some experimentation and failed attempts. The most common example cited is “docker”. Personally, I don't mind the hunt – that's part of the fun.
  • Jenkins has many plugins that seem no longer maintained. It's important to make sure whatever plugins you chose are still receiving regular updates (as in something pushed within the last ~12 months).
  • Not all plugins are compatible with Declarative Pipelines. IMO using Declarative Pipelines is the current gold standard for Jenkins. Raw imperative groovy pipelines are notoriously complicated and unmanageable.
  • No less than a few dozen plugins are required to “modernize” Jenkins. The more plugins, the greater the chance there will be problems during upgrades. This can be somewhat mitigated by moving towards using command-line driven tools run inside containers as opposed to installing some of the more exotic plugins (credit: Steve Boardwell).
  • There's no (maintained) YAML interface for Jenkins Pipelines (e.g. Jenkinsfile.yaml). Most modern CI/CD platforms today have adopted YAML for pipeline configuration. In fact, Jenkins X has also moved to YAML. The closest thing I could find was an alpha-grade prototype with no commits in 18 months.
  • The Kubernetes Plugin works well but complicates Docker Builds. Running the Jenkins Slaves on Kubernetes and then building containers requires some trickery. There are a few options, but the easiest one is to modify the PodTemplate to bind-mount /var/run/docker.sock. This is not a best-practice, however, because it exposes the host-OS to bad actors. Basically, if you have access to the docker socket, you can do anything you want on the host OS. The alternatives like running “PodMan“, “Buildah“, “Kaniko“, “Makisu“, or “Docker BuildKit” on Jenkins have virtually zero documentation, so I didn't try it.
  • The PodTemplate approach is problematic in a modern CI/CD environment. Basically, with a PodTemplate you have to define before the Jenkins slave starts the types of containers you're going to need as part of your pipeline. For example, you define one PodTemplate with docker, golang and terraform. When the Jenkins slave starts up, a Kubernetes Pod will be launched with 3 contains (docker, golang and terraform). One nice thing is that all those containers will be able to share a filesystem and talk over localhost since they are in the same Pod. Also, since it's a Pod, Kubernetes will be able to properly schedule where that Pod should start, and if you have autoscaling configured, new nodes will be spun up on-demand. The problem with this, however, is subtle. What if you want a 4th container to run that is a product of the “docker” container and share the same filesystem? there's no really easy way to do that. These days, we frequently will build a docker container in one step, then in the next step run that container and execute some tests. I'm sure this can be achieved, but nowhere near as easily as with codefresh.
  • It's still not practical today to run “multi-master” Jenkins for High Availability without using Jenkins Enterprise. That said, I think it's moot when operating Jenkins on Kubernetes with Helm. Kubernetes is constantly monitoring the Jenkins process and will restart it if unhealthy (and I tested this inadvertently!). Also, when using Helm if the rollout fails health checks, the previous generation will stay online allowing the bugs to be fixed.
  • Docker layer caching is non-trivial if running with Ephemeral Jenkins Slaves under kubernetes. If you have a large pool of nodes, chances are that every build will hit a new node, thus not taking advantage of layer caching. Alternatively, if using the “Docker in Docker” (dnd) build-strategy every build will necessarily pull down all the layers. This will both add considerably to transit costs and build times as docker images are easily 1G these days.
  • There's lots of stale/out-of-date documentation for Jenkins. I frequently stumbled on how to implement something that seemed pretty basic. Anyways, this is true of any mature ecosystem that has a tremendous amount of innovation, lots of open sources, and been around for 20 years.
  • The “yaml” escape hatch for defining pod specs is sinfully ugly. In fact, I think it's a horrible precedent that will turn people off from Jenkins. It's part of what gives it a bad wrap. The rest of the Jenkinsfile DSL is rather clean and readable, but embedding raw YAML into my declarative pipelines is not a practice I would encourage for any team. To be fair,  some of the ugliness could be eliminated by using readFile or readTrusted steps (credit: Steve Boardwell), but again it’s not that simple.


I would like to end this on a positive note. All in all, I was very pleasantly surprised by how far Jenkins has come in the past few years since we last evaluated it.

  • Helm chart makes it trivial to deploy Jenkins in a “GitOps” friendly way
  • Blue Ocean + Material Theme for Jenkins makes it look like any other modern CI/CD system
  • Rich ecosystem of Plugins enables the ultimate level customization, much more than any SaaS
  • Overall simple architecture to deploy (when compared to “modern” CI/CD systems). No need to run tons of backing services.
  • Easily extract metrics from your build system into a platform like Prometheus. Centralize your monitoring of things running inside of CI/CD infrastructure. This is very difficult (or not even possible) to do with many SaaS offerings.
  • Achieve greater economies of scale by leveraging your existing infrastructure to build your projects.  If you run Jenkins on Kubernetes, you immediately get all the other benefits. Spin up node pools powered by “ Ocean” and get cheap compute capacity with preemptible “spot” instances. If you're running Prometheus with Grafana, you can leverage that to monitor all your build infrastructure.
  • Integrate with Single Signon without paying the SSO tax levied by enterprise software.
  • Arguably the Jenkinsfile Declarative Pipelines DSL is very readable, in fact, it looks a lot like HCL (HashiCorp Configuration Language). To some, this will be a “Con” – especially if YAML is a requirement.
  • Jenkins “Configuration as Code” plugin supports nearly everything you'd need to run Jenkins itself in a “GitOps” compliant manner. And if it doesn’t there is always the configuration-as-code-groovy plugin which allows you to run arbitrary Groovy scripts for the bits you need (credit: Steve Boardwell).
  • Jenkins can be easily deployed for multiple teams. This is an easy way to mitigate one of the common complaints that Jenkins is unstable because lots of different teams have their hands in the cookie jar.
  • Jenkins can be used much like the modern CI/CD forms that use container steps rather than complicated groovy scripts. This is to say, yes, teams can do bad things with Jenkins but with the right “best practices” your pipelines should be about as manageable as any developed with CircleCI or Codefresh. Stick to using container steps to reduce the complexity in the pipelines themselves.
  • Jenkins Shared Libraries are also pretty awesome (and also one of the most polarizing features). What I like about the libraries is the ability for teams to define “Pipelines as Interfaces”. That is, applications or services in your organization should almost always be deployed in the same way. Using versioned libraries of pipelines helps to achieve this without necessarily introducing instability.
  • Just like with GitHub Actions, with Jenkins, it's possible to “auto-discover” new repositories and pipelines. This is sweet because it eliminates all the ClickOps associated with most other CI/CD systems including CircleCI, TravisCI, and Codefresh. I really like it when I can just create a new repository, stick in a Jenkinsfile, and it “just works”.
  • Jenkins supports what seems like an unlimited number of credential backends. This is a big drawback with most SaaS-based CI/CD platforms. With the Jenkins credential backends, it's possible to “plug and play” things like “AWS SSM Parameter Store”, “AWS Secrets Manager” or HashiCorp Vault. I like this more than trusting some smaller third-party to securely handle my AWS credentials!
  • Jenkins PodTemplates supports annotations, which means we can create specially crafted templates that will automatically assume AWS roles. This is rad because we don't even need to hardcode any AWS credentials as part of our CI/CD pipelines. For GitOps, this is a holy grail.
  • Jenkins is 100% Free and Open Source. You can upgrade and get commercial support from Cloud Bees which also includes a “tried and tested” version of Jenkins (albeit more limited in the selection of plugins).

To conclude, Jenkins is still one of the most powerful Swiss Army knifes to get the job done. I feel like with Jenkins anything is possible, albeit sometimes with more effort and 3 dozen plugins. As systems integrators, we're constantly faced with yet-unknown requirements that pop-up at the last minute. Adopting tools that provide “escape hatches” provide a kind of “peace of mind” knowing we can solve any problem.

Parts of it feel quite dated like the GUI Configurations, but that is mitigated by Configuration as Code and GitOps. I wish some things like building and running Docker containers inside of pipelines on Kubernetes was easier. Let's face it. Jenkins is not the cool kid on the block anymore and there are many great tools out there. But the truth is few will stand the test of time the way Jenkins has in the Open Source and Enterprise space.

Must-Have Plugins

  • kubernetes (we tested 1.21.2) is what enables Jenkins Slaves to be spun upon demand. It comes preconfigured when using the official Jenkins helm chart.
  • workflow-job (we tested 2.36)
  • credentials-binding (we tested 1.20)
  • git (we tested 4.0.0)
  • workflow-multibranch (we tested 2.21) is essential for keeping Jenkinsfiles in your repos. The multi-branch pipeline detects branches, tags, etc, from within a configured repository, so Jenkins works more like Circle, Codefresh or Travis.
  • github-branch-source (we tested 2.5.8) – once configured will scan your GitHub organizations for new repositories and automatically pick up new pipelines. I really wish more CI/CD platforms had this level of autoconfiguration.
  • workflow-aggregator (we tested 2.6)
  • configuration-as-code (we tested 1.35) allows nearly the entire Jenkins configuration to be defined as code
  • greenballs (we tested 1.15) because I've always green is the color of success =P
  • blueocean (we tested 1.21.0) gives Jenkins a modern look. It clearly depicts stages, progress as well as most other systems we've seen like CircleCI, Travis or Codefresh.
  • pipeline-input-step(we tested 2.11)
  • simple-theme-plugin (we tested 0.5.1) allows all CSS to be extended. Combined with the “material” theme for Jenkins you get a complete facelift.
  • ansicolor (we tested 0.6.2) – because many tools these days have ANSI output like Terraform or NPM. It's easy to disable the color output, but as a developer, I like the colors as it helps me quickly parse the screen output.
  • slack (we tested 2.35)
  • saml (we tested 1.1.5)
  • timestamper (we tested 1.10) – because with long-running steps, it's helpful to know how much time elapsed between lines of output

Pro Tips

  • Add the following nginx-ingress annotation to make Blue Ocean the default” /blue/organizations/jenkins/pipelines
  • Use this Material Theme for Jenkins with the simple-theme-plugin to get a beautiful looking Jenkins
  • Hide the [Pipeline] output with some simple CSS and the simple-theme-plugin:
    .pipeline-new-node {
      display: none;


When researching this post and my “Proof of Concept”, here some of the links and articles I referenced.

Public “Office Hours” (2020-01-23)

Erik OstermanOffice Hours

1 min read

Here's the recording from our DevOps “Office Hours” session on 2020-01-23.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here:

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Machine Generated Transcript

Let's get the show started.

Welcome to Office hours.

It's January 22nd 2020.

My name is Eric Osterman and I'll be leading the conversation.

I'm the CEO and founder of cloud posse.

We are a DevOps accelerator.

We help startups own their infrastructure in record time.

By building it for you and then showing you the ropes.

For those of you new to the call the format of this call is very informal.

Michael my goal is to get your questions answered.

Feel free to unmute yourself anytime if you want to join.

Jump in and participate.

Excuse my dog.

He's having a fun time downstairs here.

We all these calls every week.

We'll automatically post a video recording of this session to the office hours channel as well as follow up with an email.

So you can share it with your team.

If you want to share something in private.

Just ask and we'll temporarily suspend the recording.

With that said, let's kick it off.

Here's the agenda for today.

Some talking points that we can cover just to get the conversation started.

One of the announcements for all of us using excess was pretty good is that the master nodes have now come down and cost.

It's ugly.

It's a 50% reduction across the board.

Again, this is only on the cluster itself.

Like the master nodes.

It has no bearing on your work or nodes themselves.

So there's a link to that.

The other good news is Terraform docs has finally come out of release and that they have an official 8th photo release.

It's actually up to eight that won already this morning that supports each seal.

2 If your team isn't automatically using Terraform docs I highly recommend it.

It's a great way to generate markdown documentation for all your Terraform code automatically.

Some people use pre commit hooks.

We have our own read generator that we use it for other people use it to validate your inputs have been set descriptions and stuff like that.

Also, we launched our references section.

This is if you go to cloud policy slash references.

This is links to articles and blog posts around the web that have been written about how to use some of our stuff since I think it's interesting to see how others have come to use our modules.

This is a great resource for that.

And the first like a technical topic of the day that I'd like to talk about.

If there are no other questions would be using open policy agent with Terraform and then I'm also curious about firsthand accounts using Jenkins X. So with that said, let's take this off.

Any questions are you sharing the right screen.

Oh my god.

I do not share my screen.

All right.

Let's do that alone.

Let's see here.

Sure I am not showing the right screen.

You see my notes.

All right.

The magic is gone.

All right share this window.

Hey you.

All right.

Thanks Ed.

There we go.

All right.

Well, if no other questions I want to share a status update on behalf of Brian Tye use the guy over at audit board or Sox up that had the challenge with those random Kubernetes notes getting rebooted.

Well, you finally got to the bottom of it and everything else he did to try and treat it.

Although it wasn't actually related to the problem.

So they rely on shoot not Datadog.

What's that monitoring tool.

Start off as an open source project tracking all the traffic in your cluster.

Andrew help me out here.

Sorry what is this doing.

Yeah I think cystic is what they're using.

So assisting for monitoring.

So they have cystic installed on all their nodes in the cluster and was finally able to find the stack trace that was getting thrown by going to the AWS web console.

And looking at the terminal output of one of the nodes that had rebooted.

We thought we'd actually done this got to actually do it.

So this is where the actual exception was final cut.

The problem was a null pointer exception insisted running on their nodes.

That was causing the nodes to crash.

So it was actually their monitoring platform itself.

Assad's platform albeit nonetheless, that was causing it to crash.

So they so they got that fixed and now they're nodes they're not crashing anymore.

They have a cluster that some people spun up period that they just kind of gave everyone access to it like room.

You know this is just kind of a playground whatever, again.

And which in theory sounds nice, but what's happening is.

People are not being good stewards of using the cluster properly and setting resource limits and requests and everything.

And so it's causing like starvation on the nodes and it's causing not just the one team to have issues, but like anyone using the cluster to have issues and everyone's going what's going on with the cluster.

My stuff's not I don't think my stuff is hurting the cluster.

And it's everyone pointing fingers at each other.

That's all it's turning into.

And it's just a giant.

So So what are your thoughts on how to resolve that.

Don't do that.

Well, yeah, but I mean, I agree that it should everyone should play fairly with each other.

But maybe there are some pretty simple thrones games like don't give anyone don't give anyone cluster admin.

And if a team wants to use the cluster as a sandbox create of create a namespace you create a service account for that namespace that has whatever access namespace they want.

Create a limit range on that namespace so that if they don't set a request or a limit that it sets default then they don't have access to the entire cluster they just have access to the namespace itself.

And we get people bitching about.

Well, I need to set you know AI need to set a cluster role or whatever.

It's like, well, if you're on cluster.

Yeah OK.

I agree with that.

That's what was basically going to be my recommendation exactly.

You said they're the only other augmentation to that.

But I think it's going to take more effort would also be to deploy something like one of these open policy agent based agent solutions that require limits to be set that require all of these.

Now for the roles that can kind of be sold perhaps if you had get UPS related to this cluster, you can add the roles but that has to go through this different process.

Right anything that is global or the cluster or the next cluster is going to be.

No one has access to it.

And if they want something in the cluster, it goes through CSG like Jenkins or whatever.

OK gotcha.

You know that.

So they register you know so they say I have this app you know I'm going to push up.

I'm going to push up new images to this Container Registry you know and then set up harness or flux or Argo or whatever on the other end, you know and it requires kind of an ops team, which is not super DevOps.

But at the same time, it's like if I give access to the cluster to everyone.

It's just going to cause tons of problems.

So like, yes, I have an ops team.

And you are the dev team.

But like I want you to come work with me to set up your Argo stuff.

And then we'll be in communication on Slack and we can join each other standups or whatever and do that for you know DevOps collaboration versus here's access to the cluster do whatever you want is a free for all.

But are the consequences.

So one of things you said just jog a memory I met with Marcin.

He's a founder of space.

Lift shared it in this tweet UPS earlier this week.

And one of that was pretty neat with what he's doing is two things that come from experience doing get UPS.

One is addressing a big problem with Terraform cloud and Terraform cloud.

The problem is down to a good way for you to bring your own tools.

You have to bake those into your git repo and if you have like 200 Git repos that all depend on a provider, you have to like bake in 200 binaries and on repos and yet no fun.

Or if you depend on other tools and use like local exact probationers and stuff like that.

A fun way to do that.

Terrible so in space left.

He took a different approach.

You can bring your own Docker container.

Kind of like what you have Andrew and what we have geodesic with your tool chamber.

But the other thing that he does.

And this is what jogged my memory with what you're saying is sometimes you need escape hatches.

Sometimes you need a way to run ad hoc commands.

But we also want those to be auditable.

An example is you mess up the Terraform states and you need to unlock like the Terraform state is locked somehow.

You need to run the force unlocked.

How do you do that.

In order for the other example is you're running an actual real live environment and you need to refactor where your projects are.

So you need to move resources between Terraform state.

So you need to use the Terraform import command to do that.

How do you do that in an auditable way.

So what he's done is he.

I forget what he calls it here.

I think he calls it tasks but he provides a way for you to run one off tasks as part of your CI/CD city.

The other thing he supports is a Sas product right now is you delegate roles to the Sas and then those roles allow this to run without hardcoded credentials which is the big problem right with Terraform cloud as you've got a hard put your credentials.

And tear from the open policy agency.

I posted a bunch of links this week in the Terraform channel related to the open policy agent.

I confess it was it's been brought up several times before blaze I know has been doing some audio seized on this one of our community members, but it was my scenes showing me his demo how he's integrated into space.

Lift using.

OK that got you really excited.

So I started doing some research on that.

And then I saw it's actually pretty straightforward.

How you could integrate this into your CI/CD pipelines.

So if you look at these two links here.

I'll post them as well to the office hours.

Channel Nazi era.

I know I forgot to also announce that office hours have started here in Libya peeing everyone that office hours started.

Yeah, right.

So office hours.

So there's the link to the open policy agent.

So in here they provide an example of basically how this works.

So here's an example of a typical resource that you define and Terraform code and then what you do is you basically generate the plan as Jason using this argument.

Now you have this Jason plan and then that makes it very easy to operate on the output of this in the open policy agent.

So they have some examples down here of where you can now describe what is described the policy for this.

So here here.

It has an example of looking over scanning over the plan.

Let's see what this is.

So number of deletions and resources of a given type.

So here.

He's counting the resources you could have, for example, a policy, which is that, hey, if there are no deletions that reduce maybe the number of approvals that you need or something to get this PR through.

So then someone else in the community posted a link to conf test, which seems even more rad because it adds support for lots of different languages or types.

So it supports out of the box.

Jason Hammel but also they have support for ACF too experimental.

So now you can actually do policy enforcement on your Terraform using totally open source tools.

So that should play well with like your Jenkins pipelines code fresh pipelines sufferable Andrea of you are giving this a shot at all.

I am not able to do much experimentation and learning lately and trust me.

So no less that Carlos.

This sounds like something might be up your alley given the level of sensitivity of some of the things you do operating at a hedge fund.

Carlos idea with you.

And I was muted.

Yeah Yes.

Sounds very nursing.

I haven't seen it until now.

Cool Yeah.

You are you see.

Let me know whether I re posted those links into my companies slack in my Terraform channel.

And somebody said open source told everyone that they heard of.

Oh, yeah.

Yeah because sentinel is Ohashi corpse enterprise offering.

So yeah, there's an open source sentinel was saying that what I was linking was open source sentinel.

Yeah, I know the equivalent of basically open source sentinel.

Gotcha exactly.

I hadn't heard of central I didn't.

Yeah So sentinel is an offering of terror from enterprise on prem.

I don't believe you can have it on the cloud version that host an version.

I believe like the way that opa has kind of created this standardized format for creating these policies using this language.

It's called Rigo right.

Oh is that the name of this.

We have opa opa language is called Rigo.

And the fact that opa has created this standard.

It's going to it's going to be it's going to have to be something super awesome nowadays for me to go with something that doesn't support opa.

Yeah like, you know, I'm looking at said and it's going to be like this you know proprietary thing that works.

You know, for this one aspect.

And that's kind of it versus what's the reason I went to what to Terraform in the beginning.

Well, it's because it was standardized like you know if I go AWS, Azure GCP whatever you know it's all kind of I can use Terraform.

I don't have to learn some new tool.

But I'm feeling is that same kind of mentality.

Now Yeah.

Yeah, I think so.

If for kind of testing or validation of all these formats and testing for us because certainly I mean, let's face it, the tools that we use move bleeding we fast and many of them lack sufficient validation on what they have or features.

But now, like let's say say helm file, for example, using this would be very easy to express.

Now some policies around helm files and some of your best is basically codifying your best practices.

Now are unhelpful in doing the same for Docker file et cetera.

So Yeah like what.

So for helm file what would be a good best practice that everything has been solvable flag that should be enforced in all of your home files.

What else would be a good one that you're pinning image tags or something.

I'm struggling right now on my head.

I was prepared for it.

But yeah, I think we could find some things to validate and help us install a little flag.

Oh so how file has moved in the direction of mirroring the interface of Terraform so like Terraform has a plan phase and an apply is now in Terraform the plan phase actually can generate a plan and then you can execute that plan in hell and file.

It has it something analogous to the plan.

But not an artifact like a plan file.

So what I'm referring to is Helm.

That's right.

So helm Def shows you that the potential changes without making those happen.

Well Ben helm file has two other commands.

One is apply and the other is destroyed in those mirror Terraform.

So the apply flag honors the installed flag versus sync.

I don't believe sync.

So when you're in sync I don't think it'll uninstall things.

But apply will uninstall things when you run home file.

Interesting that's not what the documentation is saying.

But maybe it more.

OK I let's get that right.

I might be wrong.

There is a nuance like that.

Let's see what the documentation is saying that the difference between apply and sync is that apply runs Def and if Def shows that something is different than it runs sync.

Am I getting that right.

So you applied the helm file apply sub command begins by executing.

If that defines that there are changes sync is executed adding the interactive flag instructs helm file to get your confirmation vs. sync sync sink says sink all resources from state file.

Yeah So I don't think it mentions it.

And it doesn't work like that.

It doesn't mention anything about that installed flag.

I suspect before consent either I could be wrong.

Maybe maybe it doesn't work that way.

So you know what I said I made.

Maybe I am misled or maybe the functionality has changed since we started using it.

But the idea that apply is intended to be used as part of your c.I. workflow.

We're getting in town file.

And I'm starting to try to come up with some best practices because know some of us.

And now have more experience with home file than others.

And so we're looking at things like, is it a fight to clean up on fail flag for example.

Is it a best practice to have that on or just leave it as a default. False like.

Yeah Well, so you're guy I would be careful where you about that in production and I'd be I'd be recommending and perhaps and staging.

But it depends.

Maybe not at all staging environments maybe just on your like demo or preview type environments.

This just came up today.

So let me explain a little bit more.

So as we know with home if you do an upgrade it will bail on you if the previous releases failed.

So then you can't move forward.

And if you add the force flag for example, then it will uninstall that failed release and then reinstall.

But that might be pretty disruptive if you're running in a production environment, especially since a failed upgrade or a fail deployment doesn't mean a dysfunctional service.

It just means the surface is in an unknown state.

So this is why you might not want to clean up resources if it'll help you debug what went wrong.

However, in a staging environment or like preview environments where you're deploying frequently and you don't want your builds to constantly fail, especially when you're developing, especially when you know things might be breaking and unstable, then I like the force flag would and possibly the cleanup flag, then to just keep things moving humming along.

Even in staging, I would like I would totally agree with that for dev but in staging.

Wouldn't you want to have like.

My thought process for staging is like, all right, I want to have whatever production has now.

And then run the exact same command that you're going to run in production and see what happens.

Yeah, I'm so sorry we're just overloading the term staging like every company does.

Staging in your context is like a production.

Yeah, we have something that we saw staging to us is an account stage where you will have environments with different designations so one designation would be production and production should be almost identical to production.

But then you have unlimited staging or preview environments.

Those are also running in staging or staging accounts.

But they have a different designation.

So it's just that different companies use these terms differently.

And that.

So we're saying the same thing.

We're just using different terms Eric can correct me if I'm wrong, but it almost still deployed to a field.

The play if there were previous deployment deployments that ass right.

My understanding if the most recent deploy failed the upgrade will fail unless you have the forced flag.

But I don't want to claim expertise on that.

I've been using helm files.

So long.

So help file kind of redefined some of those behaviors and use the force flag on helmet file.

The first thing it does is an uninstall of the release if it has failed to ensure that this seat successfully sells.

But the raw home functionality.

Maybe if somebody knows for sure, please check in I'm almost positive.

That's the case.

Just because I have instances where the initial deploy will fail and then and then I won't be able to deploy to that again until I do like a delete purge.

Right I will I will have long running deployments that will fail and then running like another deploy does work.

And I don't have to like delete that.

Delete that.

So you're saying when there are multiple helm releases you're able to do it.

But if there's only one and release.

You can't.

So for a particular release.

If it's been deployed successfully before and the mostly deploying as long as it was successful before.

OK that's worth that that might be the case.

I believe, at least in my case, I haven't seen that.

But definitely if it's the initial deploy of a released and it's never been deployed before and it fails, you cannot redeploy it until you kind of until you purge the that release.

Yeah So maybe what we're seeing to see my screen here.

And I think it's the atomic flag that helps with this magic here.

Last nice.

By the way Brian, I shared just before you joined I think the resolution to your issue with the random node reboots.

Oh, yeah.

Yeah, just for those that ever have to run into this is to check that easy to system logs.

We'll have all of your kernel logs versus checking on the box, or guess ship it yourself by think just if you're on a date.

Is there a ship for you.

She module is.

There's cool any other questions or insights here.

I need to come up with a demo of infrastructure as code using just my local computer.

So self-contained like a demo.

Yeah Any ideas.

There's like I could do Vagrant with you know spinning up a couple of virtual machines.

I love to use Terraform because that's what I use for.

For real infrastructure as code.

And so if I'm just showing code.

I'm going to show Terraform code, but I need to be able to do a demo of like, hey, look at me spinning up actual infrastructure.

But I won't have access to a native US account or anything because multiple people are going to do this.

I assume you're saying, what does it specifically need.

Siri woke up here.

Does it specifically need to demo kw s functionality or demo.

I see any kind of I see wonderful food for thought.

If it jogs your memory because I know you're working on dad's garage you're equivalent of dude s sick one of the truck true prototypes.

They did here.

It was just AI think it was a Yeah.

So I did a prototype of working with Minikube but then it also works with Docker for Mac and I assume most of your company would be had we'd have Docker for Mac installed right now.

He would think, OK, maybe then the value of what I'm saying is moot.

But what I was going to say is if you're running dog for Mac or probably dogs for Windows 2 you can just check off this box enabled Kubernetes.

That's not a bad idea though.

And that you do to play circuit you could do Terraform code that just spins up Docker containers.

Yeah Yeah.

Because all I really care about is showing like you know look here is code that spun up you know it spun up a bunch too with this exact diversion put up with this exact amount of memory and CPR you power that I told it to set up you know and it spun up three of them because I told it to spin up three of them you know.

And now I'm going to kill them all with one command.

Yeah, that is in that's infrastructure as code right there doctor.

Instead of virtual machines.

But the concept is the same.

Faster demo is lighter weight.

And as you know, there's a doctor providers you know you can also skip what I said about cougar genetics and you could just do plain vanilla doctor as well.

You provide the context of this demo is it like a lecture learn for your co-workers it's for an intro to deficit gops class.

So we've got a section on infrastructure as code and the, the customer wants a live demo of the infrastructure as code.

I did a.

So this demo does include AWS.

But I did talk on using Terraform to deploy and apply a simple web API on this young group in the US and then do a looping deployment of that.

So that's kind of fun.

I hope it's open source I can send it to you.

But it does require a database account.

But I actually was going to do this for my co-workers and provide and we'll just create a dummy database account provide them the credentials and then just ask them to do the Terraform from destroy after.

So you could technically do that right.

I guess it depends on the size of your class.

Yeah, that gives me some ideas.

Like, I could do something like I could do something like you know given the way that you guys currently handle your infrastructure.

How would you handle doing OS version upgrade.

Oh, well, you know we'd have to SSA into each and every one of them, then know apply all the commands and shepherd them through the whole pets versus cattle thing versus I could be like, I'm going to change this variable and then redeploy.

That's a good idea.

I like that.

Well, I did also which was really cool too when I did our people that was really cool when I did my demo was I deployed it to two different regions and two different environments all with like the same Terraform and yet.

It was just a search of a variable and when I did that demo everybody appreciated that.

Yeah, that's going to do well, you have identical environments.

But one is staging and one is of the product one might have t to X large is the staging one might have t to my.

Other than that Alice is wearing the same.

Yeah let's call that reminds me of one of the posts at one of our customers shared with us only because it was a nice summary of most many people's experience with Terraform like when you start to where you are today like when you start you start with small project.

And then you realize you need a staging environment.

So you clone that code in the same project.

And now you have two environments.

But any change is in the blast radius of all of those.

So then you move to move having separate projects and all of that stuff, you find that linking to Terraform channel what was the first time I heard of the term a terror list for Terraform monolith material.

I love it.

Yeah, it's not a perfect description of it.

Everyone knows what you mean, if they've ever done it.

Yes speaking of terror let's.

I was thinking about the wave Yeah.

You had showed.

The way you do the terror form infrastructure for your clients with like each client gets a GitHub organization.

And then there is what's under the organ.

We're going to like do you have is it just one big Terraform Apply or whatever for like the whole thing or is it no go out.

Yeah, it's all decomposed.

So basically, our thing is that we've been using it.

So here's like an organization.

And then each e.w. us account represents basically dedicated application and then so therefore, each one of those has a repository but then you go into each one of these repositories.

And then they have software just like your applications have software.

But your applications.

What do they do.

They pin at library versions.

So that's what we do here.

So you want to run the write the same versions of software in all these environments all these office accounts and we do that just by pinning like to a remote module.

So here we can throw a remote module.

So when we made changes to this directory that triggers a plan and apply workflow with Atlantis there's a plan applied just for the cloud trail directory or whatever except for the whole staging got Catholics did.

Exactly So basically, you open pour requests that modify any one of these projects.

And that will trigger, you can modify multiple projects at the same time.

There's nothing no multiple applies to do that they exist.

Well known in multiple plans multiple plans multiple tries to do that.

But all in the same pull requests and all automatic.

Right OK.

What does that look like.

Does it Lantz ask when you submit a pull request does Atlantis say you know this affected 3 Terraform projects.

And here's the plans for all three of them is that what it does.

Yeah So it would.

So you know the pull request was opened automatically or as the site was open manually by a human a developer.

Then the plan was it kicked off automatically.

And then we see this out here.

This happens to be output from helm file.

Not Terraform but we because we use Atlantis for both my file.

And terrible when we see it or what's going to change.

And then somebody approved it.

And then we can surgically target what part of that we want to apply for.

We're going to say Atlantis apply and it'll apply all genes catch a so one thing to note about Atlantis, which is really nice is that it locks that project that you're in.

So to developers can't be trying to modify it in this case cube system at the same time.

Because otherwise, those planning to apply those applies will be under each other's changes and they'll be at war with each other.

So Brian any interesting projects you're working on.

I am starting my Prometheus initiative.

Oh, nice.

Very cool.

Well, are you going to take a Federated approach or as in like a remote or remote storage.

No So basically, you have a centralized Prometheus but then each cluster runs its own Prometheus and the centralized Prometheus grapes.

The other ones.

That possible.

I'm not sure yet.

I actually just started yesterday looking at the helm file for.

But yeah I'll be honest, I'm not stupid familiar with Prometheus.

It's just I know that it's a lot more lightweight than what we're currently running, which is the Stig is third party.

But takes up takes up 30% of our CPU on our 4x 2x large is on a clause of the military boot and it's just for a mining tool is just asking for too much also is memory intensive as well.

So Prometheus obviously is on the other side of that spectrum where it is not a resource hog which is why I mean, yes.

I mean, it takes a brute force method to monitoring right.

It looks at like a packet inspection and everything happening on that box.

It's but I guess it's a testament to how fast sea views have come and how cheap memory is that they're able to get away with doing that.

What does.

But it's still a problem when you're doing things like you are saying, yeah.

Yeah, I have you guys gone the Federated Federated approach.

What would you say about that.

So we are we're starting a project in the next month or so.

And with a new customer.

And that's going to be doing Federated Prometheus.

The reason for that that cases, they're going to be running like dozens or more single tenant dedicated environments of their stack for customers is basically an enterprise that's so having dozens and dozens of Prometheus is in for fun as would just be too much noise and too hard to centrally manage.

Also scaling that architecture.

If you get if you only had central Prometheus would be dangerous because at some point, you're going to hit a tipping point.

And it's just not going to handle.

So with a Federated approach basically, you can run Prometheus because Prometheus is basically time series database.

In addition to its ability to harvest these metrics.

Now it can use like a real time series database like influx TV or post scripts with time series TV or whatever.

And so forth and a bunch of others.

But the simplest way, I think is just to think about Prometheus basically, it's a database, you know we can offer.

So then what you can do is you can run smaller instances of Prometheus on each cluster with a shorter retention period.

So you can still get real time data in, but you don't need to run massive deployments of Prometheus, which can get expensive because I'm running for methe s with less than 12 15 gigs allocated to the main Prometheus operator is required before for any sort of retention more than like a week or two.

So So in the Federated model basically set up like on a shared services cluster.

What we call core one for me at this instance that then scrapes the other ones.

And it can do that at its own pace and you can also downscale the precision in the process of doing that.

If you need to.

Plus if you do need absolute real time, you could still have Grafana running on all the clusters.

And you can have real time updates on those environments, different ways of looking at you guys use the remote data remote storage.

We don't.

So it's kind of it's been in our backlog of things to solve.

And there's been there's a few options of it.

The one that keeps coming up is Thanos where we look at the architecture.

I mean, it suddenly it increases the scope of which you've got to manage right.

And when is your monitoring system.

It's really important that it's up.

So I understand and appreciate the need to have a reliable back, but also the more that you the more moving pieces the bigger the problem is if something goes down.

And then what monitors the monitoring system and all these things.

So we've gotten away with using EFS and as scary as that sounds is actually less scary these days because DFS has basically x compatibility and you can schedule.

You can reserve.


So the problems that we've had with Prometheus have been related to AI ops and then we just bumped up you pay to play.

And it's not that much more to pay to play compared to engineering side.

So we just bumped up the AI ops and all our problems.

One way.

The other option.

I think Andrew's brought up before is that you can also just allocate more storage and then you get more high ops, which is great because it leaves it.

First of all, it just gives you more credits.

So if you still have first of all, IOPS don't need that higher baseline on of credits you're getting is not high enough given the amount of data you have just store a bunch of random data and the amount of money you a knowing that I got your signals pretty bad Andrew at least for me, is anyone else hearing the feedback a little bit.

Yeah So would I be able to use.

So you know I have a ephemeral cluster type situation would I be able to use ACFS with an outcast cluster.

Oh, yeah Yeah Yeah.

Yes cluster that ties back into the same ACFS.

Yeah Yeah Yeah, you could have a static Yeah.

Yeah, this file system.

For example, and just keep reusing that and unique.

Yes, your family.

Yes clusters.

Awesome Yeah.

That's really nice about CSS is it does not suffer from the same limitations of being it EFS is cross availability zone.

Yes versus IBS, which is not one.

So the issue.

People run into frequently when doing persistent volume storage in Kubernetes is let's say you spin up Jenkins and you have a three node cluster across three availability zones because you like high availability will who Jenkins spins up in USD one and then for some reason Jenkins restarts.

And next time Jenkins spins up in USD 1B.

But that EBS volume is in one a still Jenkins can't spin up because it can't connect to the b.s. volume.

Yeah So your host DFS does not have that problem.

Yeah In fact, we've changed the default.

I forget what the term is this default storyboard class.

The storage class for cabinet is to be cast and make it explicit.

If you want PBS like before certain situations.

But the that's worked out really well for us.

Plus it simplifies backups too, because you can just set up AWS backup on your DFS file system.

You don't have to worry about backups of every single volume provisioned by Kubernetes.

So it's a lot easier just to back up the whole file system.

We don't do that because it would be backing up random crap that we don't need.

But you can definitely do that.

Yeah, and if she's a shameless plug.

If use the cloud posse in DFS module.

It supports database backups.

Well, that's good to know.

Thank you.

I probably wouldn't have looked the VFX.

It sounds like the right thing to do.

But we were running it right now with so far we have maybe three months of retention in it.

And it's fine.

Obviously, we have to wait and Seattle goes up to six months.

But I don't see any problems right now.

Once we address the memory issues the meat is we've been running for way over six months.

And we're fine with it.

It's about.

I hope it's about ops right.

As long as your AI ops handle you know, if you are in production and your Amazon, of course, you shouldn't use EFS or your Google or whoever.

That's dumb, but if you're a small you know if your traffic is small go for it.

And don't listen to all the people saying, oh, you should never run Postgres on EFS because it's the NFL.

It it's fine.

Don't worry about it.

Try it.

And if it causes issues, then figure out something some other something else don't add Nazis don't add complexity.

Right off the bat until you have tried the less complex option and determined that it's not a good option.

Yeah, I think that's a good way of putting it.

Are you guys using the CSI driver for EFS.

I'm using the original or whatever that's you that.

I don't know.

There's a tool called the effect provision here that you deployed your QB native cluster that just provisions perceptive volumes for you given consistent volume claims and it works great.

I just posted what I was looking at in the office hours channel.

OK So you guys are using the offensive provision here a fester Virginia.

I think we have a health file for it too.

Thank you.

Thank you.

Good to know.

Yeah if this provision and here's the file that we used and work straight.

We've been on it over a year.

No problems.

And so you guys are testing out coubertin entities federation.

No, we have not on the federation root for Kubernetes just for clarification.

When I was mentioned federation that was for Prometheus, which is unrelated.

I often look into that or Prometheus sandwich.

Protip with the effects if you're going to go into Governor cloud for any reason, use Governor cloud WF cloud E does not have CSS.

We made that mistake and it has cost us.

Well O'Brien by the way, where I pretty much learned about what this looks like for the Federated Prometheus is from your pal Corey Gail over at gundam so they can definitely shed some light.

They're not doing it under Kubernetes.

They're doing it like on bare metal.

But they can tell you about why they did it.

Yeah And I know that they just started that Prometheus initiative like maybe half a year ago.

Yeah, even that period, which is really cool.

Yeah Cool.

So as any other parting thoughts here before we start wrapping things up for today.

Any other interesting news announcements you guys have seen.

On Hacker News don't you.

Apparently friggin what's it called Whatsapp.

Check if you're Jeff Bezos don't use Whatsapp.

Yeah, that more then that sounds too good.

Also I thought apple backups were encrypted.

But now they're saying they're not encrypted.

I have started using Keybase heavily and I'm in absolutely in love.

Awesome but keep AI mean, that's great for chat and validation or whatever.

And maybe with AWS can creations and whatnot.

But it's not going to help you back up your iPhone right.

I mean, I don't have an iPhone.

I'm using it.

Give me some Keybase has file storage.

And it even has encrypted get repositories.

I saw that.

Yeah What's interesting and underrated are they're exploding messengers.

If you're sending secrets once you get them to exploding messages like like Mission Impossible.

They literally explode in the UI.

I haven't seen that.

How does that work.

I mean that on your message.

Oh, yeah.

Sure enough.

Let's go.

Let's go.

I never use that.

I just don't really use it for the check because none of my friends are on it for me.

It's what we use in-house to send like some of our secrets.

Yeah It's one of our best practices.

Well, we start with a new customer.

I always recommend that they set up a team under key dates were for those situations where you need to share its secrets that don't fit into like the best practices of using a password manager or the best practices of using like how she caught ball.

Just kind of a one off.

Yeah, for the one off things right.

Because a lot of these other tools don't support direct sharing of secrets.

To individuals in an encrypted fashion just put your passwords in Slack.

What could go wrong.

Seriously they cannot get access.

I cannot get this stupid doctor provider to work or forget.

All right, everyone.

Well, on that note, I think we'll leave it.

Let it be.

We'll we're going to wrap up office hours here.

And thanks again for sharing everything.

I always learn so much from these calls reporting on this call is going to be posted in the office hours channel.

Do you guys next week same time, same place Thanks, guys.