Public “Office Hours” (2020-02-21)

Erik OstermanOffice Hours

Here's the recording from our DevOps “Office Hours” session on 2020-02-21.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here: cloudposse.com/office-hours

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Public “Office Hours” (2020-02-12)

Erik OstermanOffice Hours

Here's the recording from our DevOps “Office Hours” session on 2020-02-12.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here: cloudposse.com/office-hours

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Public “Office Hours” (2020-02-05)

Erik OstermanOffice Hours

Here's the recording from our DevOps “Office Hours” session on 2020-02-05.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here: cloudposse.com/office-hours

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Public “Office Hours” (2020-01-29)

Erik OstermanOffice Hours

Here's the recording from our DevOps “Office Hours” session on 2020-01-29.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here: cloudposse.com/office-hours

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Public “Office Hours” (2020-01-23)

Erik OstermanOffice Hours

Here's the recording from our DevOps “Office Hours” session on 2020-01-23.

We hold public “Office Hours” every Wednesday at 11:30am PST to answer questions on all things DevOps/Terraform/Kubernetes/CICD related.

These “lunch & learn” style sessions are totally free and really just an opportunity to talk shop, ask questions and get answers.

Register here: cloudposse.com/office-hours

Basically, these sessions are an opportunity to get a free weekly consultation with Cloud Posse where you can literally “ask me anything” (AMA). Since we're all engineers, this also helps us better understand the challenges our users have so we can better focus on solving the real problems you have and address the problems/gaps in our tools.

Machine Generated Transcript

Let's get the show started.

Welcome to Office hours.

It's January 22nd 2020.

My name is Eric Osterman and I'll be leading the conversation.

I'm the CEO and founder of cloud posse.

We are a DevOps accelerator.

We help startups own their infrastructure in record time.

By building it for you and then showing you the ropes.

For those of you new to the call the format of this call is very informal.

Michael my goal is to get your questions answered.

Feel free to unmute yourself anytime if you want to join.

Jump in and participate.

Excuse my dog.

He's having a fun time downstairs here.

We all these calls every week.

We'll automatically post a video recording of this session to the office hours channel as well as follow up with an email.

So you can share it with your team.

If you want to share something in private.

Just ask and we'll temporarily suspend the recording.

With that said, let's kick it off.

Here's the agenda for today.

Some talking points that we can cover just to get the conversation started.

One of the announcements for all of us using excess was pretty good is that the master nodes have now come down and cost.

It's ugly.

It's a 50% reduction across the board.

Again, this is only on the cluster itself.

Like the master nodes.

It has no bearing on your work or nodes themselves.

So there's a link to that.

The other good news is Terraform docs has finally come out of release and that they have an official 8th photo release.

It's actually up to eight that won already this morning that supports each seal.

2 If your team isn't automatically using Terraform docs I highly recommend it.

It's a great way to generate markdown documentation for all your Terraform code automatically.

Some people use pre commit hooks.

We have our own read generator that we use it for other people use it to validate your inputs have been set descriptions and stuff like that.

Also, we launched our references section.

This is if you go to cloud policy slash references.

This is links to articles and blog posts around the web that have been written about how to use some of our stuff since I think it's interesting to see how others have come to use our modules.

This is a great resource for that.

And the first like a technical topic of the day that I'd like to talk about.

If there are no other questions would be using open policy agent with Terraform and then I'm also curious about firsthand accounts using Jenkins X. So with that said, let's take this off.

Any questions are you sharing the right screen.

Oh my god.

I do not share my screen.

All right.

Let's do that alone.

Let's see here.

Sure I am not showing the right screen.

You see my notes.

All right.

The magic is gone.

All right share this window.

Hey you.

All right.

Thanks Ed.

There we go.

All right.

Well, if no other questions I want to share a status update on behalf of Brian Tye use the guy over at audit board or Sox up that had the challenge with those random Kubernetes notes getting rebooted.

Well, you finally got to the bottom of it and everything else he did to try and treat it.

Although it wasn't actually related to the problem.

So they rely on shoot not Datadog.

What's that monitoring tool.

Start off as an open source project tracking all the traffic in your cluster.

Andrew help me out here.

Sorry what is this doing.

Yeah I think cystic is what they're using.

So assisting for monitoring.

So they have cystic installed on all their nodes in the cluster and was finally able to find the stack trace that was getting thrown by going to the AWS web console.

And looking at the terminal output of one of the nodes that had rebooted.

We thought we'd actually done this got to actually do it.

So this is where the actual exception was final cut.

The problem was a null pointer exception insisted running on their nodes.

That was causing the nodes to crash.

So it was actually their monitoring platform itself.

Assad's platform albeit nonetheless, that was causing it to crash.

So they so they got that fixed and now they're nodes they're not crashing anymore.

They have a cluster that some people spun up period that they just kind of gave everyone access to it like room.

You know this is just kind of a playground whatever, again.

And which in theory sounds nice, but what's happening is.

People are not being good stewards of using the cluster properly and setting resource limits and requests and everything.

And so it's causing like starvation on the nodes and it's causing not just the one team to have issues, but like anyone using the cluster to have issues and everyone's going what's going on with the cluster.

My stuff's not I don't think my stuff is hurting the cluster.

And it's everyone pointing fingers at each other.

That's all it's turning into.

And it's just a giant.

So So what are your thoughts on how to resolve that.

Don't do that.

Well, yeah, but I mean, I agree that it should everyone should play fairly with each other.

But maybe there are some pretty simple thrones games like don't give anyone don't give anyone cluster admin.

And if a team wants to use the cluster as a sandbox create of create a namespace you create a service account for that namespace that has whatever access namespace they want.

Create a limit range on that namespace so that if they don't set a request or a limit that it sets default then they don't have access to the entire cluster they just have access to the namespace itself.

And we get people bitching about.

Well, I need to set you know AI need to set a cluster role or whatever.

It's like, well, if you're on cluster.

Yeah OK.

I agree with that.

That's what was basically going to be my recommendation exactly.

You said they're the only other augmentation to that.

But I think it's going to take more effort would also be to deploy something like one of these open policy agent based agent solutions that require limits to be set that require all of these.

Now for the roles that can kind of be sold perhaps if you had get UPS related to this cluster, you can add the roles but that has to go through this different process.

Right anything that is global or the cluster or the next cluster is going to be.

No one has access to it.

And if they want something in the cluster, it goes through CSG like Jenkins or whatever.

OK gotcha.

You know that.

So they register you know so they say I have this app you know I'm going to push up.

I'm going to push up new images to this Container Registry you know and then set up harness or flux or Argo or whatever on the other end, you know and it requires kind of an ops team, which is not super DevOps.

But at the same time, it's like if I give access to the cluster to everyone.

It's just going to cause tons of problems.

So like, yes, I have an ops team.

And you are the dev team.

But like I want you to come work with me to set up your Argo stuff.

And then we'll be in communication on Slack and we can join each other standups or whatever and do that for you know DevOps collaboration versus here's access to the cluster do whatever you want is a free for all.

But are the consequences.

So one of things you said just jog a memory I met with Marcin.

He's a founder of space.

Lift shared it in this tweet UPS earlier this week.

And one of that was pretty neat with what he's doing is two things that come from experience doing get UPS.

One is addressing a big problem with Terraform cloud and Terraform cloud.

The problem is down to a good way for you to bring your own tools.

You have to bake those into your git repo and if you have like 200 Git repos that all depend on a provider, you have to like bake in 200 binaries and on repos and yet no fun.

Or if you depend on other tools and use like local exact probationers and stuff like that.

A fun way to do that.

Terrible so in space left.

He took a different approach.

You can bring your own Docker container.

Kind of like what you have Andrew and what we have geodesic with your tool chamber.

But the other thing that he does.

And this is what jogged my memory with what you're saying is sometimes you need escape hatches.

Sometimes you need a way to run ad hoc commands.

But we also want those to be auditable.

An example is you mess up the Terraform states and you need to unlock like the Terraform state is locked somehow.

You need to run the force unlocked.

How do you do that.

In order for the other example is you're running an actual real live environment and you need to refactor where your projects are.

So you need to move resources between Terraform state.

So you need to use the Terraform import command to do that.

How do you do that in an auditable way.

So what he's done is he.

I forget what he calls it here.

I think he calls it tasks but he provides a way for you to run one off tasks as part of your CI/CD city.

The other thing he supports is a Sas product right now is you delegate roles to the Sas and then those roles allow this to run without hardcoded credentials which is the big problem right with Terraform cloud as you've got a hard put your credentials.

And tear from the open policy agency.

I posted a bunch of links this week in the Terraform channel related to the open policy agent.

I confess it was it's been brought up several times before blaze I know has been doing some audio seized on this one of our community members, but it was my scenes showing me his demo how he's integrated into space.

Lift using.

OK that got you really excited.

So I started doing some research on that.

And then I saw it's actually pretty straightforward.

How you could integrate this into your CI/CD pipelines.

So if you look at these two links here.

I'll post them as well to the office hours.

Channel Nazi era.

I know I forgot to also announce that office hours have started here in Libya peeing everyone that office hours started.

Yeah, right.

So office hours.

So there's the link to the open policy agent.

So in here they provide an example of basically how this works.

So here's an example of a typical resource that you define and Terraform code and then what you do is you basically generate the plan as Jason using this argument.

Now you have this Jason plan and then that makes it very easy to operate on the output of this in the open policy agent.

So they have some examples down here of where you can now describe what is described the policy for this.

So here here.

It has an example of looking over scanning over the plan.

Let's see what this is.

So number of deletions and resources of a given type.

So here.

He's counting the resources you could have, for example, a policy, which is that, hey, if there are no deletions that reduce maybe the number of approvals that you need or something to get this PR through.

So then someone else in the community posted a link to conf test, which seems even more rad because it adds support for lots of different languages or types.

So it supports out of the box.

Jason Hammel but also they have support for ACF too experimental.

So now you can actually do policy enforcement on your Terraform using totally open source tools.

So that should play well with like your Jenkins pipelines code fresh pipelines sufferable Andrea of you are giving this a shot at all.

I am not able to do much experimentation and learning lately and trust me.

So no less that Carlos.

This sounds like something might be up your alley given the level of sensitivity of some of the things you do operating at a hedge fund.

Carlos idea with you.

And I was muted.

Yeah Yes.

Sounds very nursing.

I haven't seen it until now.

Cool Yeah.

You are you see.

Let me know whether I re posted those links into my companies slack in my Terraform channel.

And somebody said open source told everyone that they heard of.

Oh, yeah.

Yeah because sentinel is Ohashi corpse enterprise offering.

So yeah, there's an open source sentinel was saying that what I was linking was open source sentinel.

Yeah, I know the equivalent of basically open source sentinel.

Gotcha exactly.

I hadn't heard of central I didn't.

Yeah So sentinel is an offering of terror from enterprise on prem.

I don't believe you can have it on the cloud version that host an version.

I believe like the way that opa has kind of created this standardized format for creating these policies using this language.

It's called Rigo right.

Oh is that the name of this.

We have opa opa language is called Rigo.

And the fact that opa has created this standard.

It's going to it's going to be it's going to have to be something super awesome nowadays for me to go with something that doesn't support opa.

Yeah like, you know, I'm looking at said and it's going to be like this you know proprietary thing that works.

You know, for this one aspect.

And that's kind of it versus what's the reason I went to what to Terraform in the beginning.

Well, it's because it was standardized like you know if I go AWS, Azure GCP whatever you know it's all kind of I can use Terraform.

I don't have to learn some new tool.

But I'm feeling is that same kind of mentality.

Now Yeah.

Yeah, I think so.

If for kind of testing or validation of all these formats and testing for us because certainly I mean, let's face it, the tools that we use move bleeding we fast and many of them lack sufficient validation on what they have or features.

But now, like let's say say helm file, for example, using this would be very easy to express.

Now some policies around helm files and some of your best is basically codifying your best practices.

Now are unhelpful in doing the same for Docker file et cetera.

So Yeah like what.

So for helm file what would be a good best practice that everything has been solvable flag that should be enforced in all of your home files.

What else would be a good one that you're pinning image tags or something.

I'm struggling right now on my head.

I was prepared for it.

But yeah, I think we could find some things to validate and help us install a little flag.

Oh so how file has moved in the direction of mirroring the interface of Terraform so like Terraform has a plan phase and an apply is now in Terraform the plan phase actually can generate a plan and then you can execute that plan in hell and file.

It has it something analogous to the plan.

But not an artifact like a plan file.

So what I'm referring to is Helm.

That's right.

So helm Def shows you that the potential changes without making those happen.

Well Ben helm file has two other commands.

One is apply and the other is destroyed in those mirror Terraform.

So the apply flag honors the installed flag versus sync.

I don't believe sync.

So when you're in sync I don't think it'll uninstall things.

But apply will uninstall things when you run home file.

Interesting that's not what the documentation is saying.

But maybe it more.

OK I let's get that right.

I might be wrong.

There is a nuance like that.

Let's see what the documentation is saying that the difference between apply and sync is that apply runs Def and if Def shows that something is different than it runs sync.

Am I getting that right.

So you applied the helm file apply sub command begins by executing.

If that defines that there are changes sync is executed adding the interactive flag instructs helm file to get your confirmation vs. sync sync sink says sink all resources from state file.

Yeah So I don't think it mentions it.

And it doesn't work like that.

It doesn't mention anything about that installed flag.

I suspect before consent either I could be wrong.

Maybe maybe it doesn't work that way.

So you know what I said I made.

Maybe I am misled or maybe the functionality has changed since we started using it.

But the idea that apply is intended to be used as part of your c.I. workflow.

We're getting in town file.

And I'm starting to try to come up with some best practices because know some of us.

And now have more experience with home file than others.

And so we're looking at things like, is it a fight to clean up on fail flag for example.

Is it a best practice to have that on or just leave it as a default. False like.

Yeah Well, so you're guy I would be careful where you about that in production and I'd be I'd be recommending and perhaps and staging.

But it depends.

Maybe not at all staging environments maybe just on your like demo or preview type environments.

This just came up today.

So let me explain a little bit more.

So as we know with home if you do an upgrade it will bail on you if the previous releases failed.

So then you can't move forward.

And if you add the force flag for example, then it will uninstall that failed release and then reinstall.

But that might be pretty disruptive if you're running in a production environment, especially since a failed upgrade or a fail deployment doesn't mean a dysfunctional service.

It just means the surface is in an unknown state.

So this is why you might not want to clean up resources if it'll help you debug what went wrong.

However, in a staging environment or like preview environments where you're deploying frequently and you don't want your builds to constantly fail, especially when you're developing, especially when you know things might be breaking and unstable, then I like the force flag would and possibly the cleanup flag, then to just keep things moving humming along.

Even in staging, I would like I would totally agree with that for dev but in staging.

Wouldn't you want to have like.

My thought process for staging is like, all right, I want to have whatever production has now.

And then run the exact same command that you're going to run in production and see what happens.

Yeah, I'm so sorry we're just overloading the term staging like every company does.

Staging in your context is like a production.

Yeah, we have something that we saw staging to us is an account stage where you will have environments with different designations so one designation would be production and production should be almost identical to production.

But then you have unlimited staging or preview environments.

Those are also running in staging or staging accounts.

But they have a different designation.

So it's just that different companies use these terms differently.

And that.

So we're saying the same thing.

We're just using different terms Eric can correct me if I'm wrong, but it almost still deployed to a field.

The play if there were previous deployment deployments that ass right.

My understanding if the most recent deploy failed the upgrade will fail unless you have the forced flag.

But I don't want to claim expertise on that.

I've been using helm files.

So long.

So help file kind of redefined some of those behaviors and use the force flag on helmet file.

The first thing it does is an uninstall of the release if it has failed to ensure that this seat successfully sells.

But the raw home functionality.

Maybe if somebody knows for sure, please check in I'm almost positive.

That's the case.

Just because I have instances where the initial deploy will fail and then and then I won't be able to deploy to that again until I do like a delete purge.

Right I will I will have long running deployments that will fail and then running like another deploy does work.

And I don't have to like delete that.

Delete that.

So you're saying when there are multiple helm releases you're able to do it.

But if there's only one and release.

You can't.

So for a particular release.

If it's been deployed successfully before and the mostly deploying as long as it was successful before.

OK that's worth that that might be the case.

I believe, at least in my case, I haven't seen that.

But definitely if it's the initial deploy of a released and it's never been deployed before and it fails, you cannot redeploy it until you kind of until you purge the that release.

Yeah So maybe what we're seeing to see my screen here.

And I think it's the atomic flag that helps with this magic here.

Last nice.

By the way Brian, I shared just before you joined I think the resolution to your issue with the random node reboots.

Oh, yeah.

Yeah, just for those that ever have to run into this is to check that easy to system logs.

We'll have all of your kernel logs versus checking on the box, or guess ship it yourself by think just if you're on a date.

Is there a ship for you.

She module is.

There's cool any other questions or insights here.

I need to come up with a demo of infrastructure as code using just my local computer.

So self-contained like a demo.

Yeah Any ideas.

There's like I could do Vagrant with you know spinning up a couple of virtual machines.

I love to use Terraform because that's what I use for.

For real infrastructure as code.

And so if I'm just showing code.

I'm going to show Terraform code, but I need to be able to do a demo of like, hey, look at me spinning up actual infrastructure.

But I won't have access to a native US account or anything because multiple people are going to do this.

I assume you're saying, what does it specifically need.

Siri woke up here.

Does it specifically need to demo kw s functionality or demo.

I see any kind of I see wonderful food for thought.

If it jogs your memory because I know you're working on dad's garage you're equivalent of dude s sick one of the truck true prototypes.

They did here.

It was just AI think it was a Yeah.

So I did a prototype of working with Minikube but then it also works with Docker for Mac and I assume most of your company would be had we'd have Docker for Mac installed right now.

He would think, OK, maybe then the value of what I'm saying is moot.

But what I was going to say is if you're running dog for Mac or probably dogs for Windows 2 you can just check off this box enabled Kubernetes.

That's not a bad idea though.

And that you do to play circuit you could do Terraform code that just spins up Docker containers.

Yeah Yeah.

Because all I really care about is showing like you know look here is code that spun up you know it spun up a bunch too with this exact diversion put up with this exact amount of memory and CPR you power that I told it to set up you know and it spun up three of them because I told it to spin up three of them you know.

And now I'm going to kill them all with one command.

Yeah, that is in that's infrastructure as code right there doctor.

Instead of virtual machines.

But the concept is the same.

Faster demo is lighter weight.

And as you know, there's a doctor providers you know you can also skip what I said about cougar genetics and you could just do plain vanilla doctor as well.

You provide the context of this demo is it like a lecture learn for your co-workers it's for an intro to deficit gops class.

So we've got a section on infrastructure as code and the, the customer wants a live demo of the infrastructure as code.

I did a.

So this demo does include AWS.

But I did talk on using Terraform to deploy and apply a simple web API on this young group in the US and then do a looping deployment of that.

So that's kind of fun.

I hope it's open source I can send it to you.

But it does require a database account.

But I actually was going to do this for my co-workers and provide and we'll just create a dummy database account provide them the credentials and then just ask them to do the Terraform from destroy after.

So you could technically do that right.

I guess it depends on the size of your class.

Yeah, that gives me some ideas.

Like, I could do something like I could do something like you know given the way that you guys currently handle your infrastructure.

How would you handle doing OS version upgrade.

Oh, well, you know we'd have to SSA into each and every one of them, then know apply all the commands and shepherd them through the whole pets versus cattle thing versus I could be like, I'm going to change this variable and then redeploy.

That's a good idea.

I like that.

Well, I did also which was really cool too when I did our people that was really cool when I did my demo was I deployed it to two different regions and two different environments all with like the same Terraform and yet.

It was just a search of a variable and when I did that demo everybody appreciated that.

Yeah, that's going to do well, you have identical environments.

But one is staging and one is of the product one might have t to X large is the staging one might have t to my.

Other than that Alice is wearing the same.

Yeah let's call that reminds me of one of the posts at one of our customers shared with us only because it was a nice summary of most many people's experience with Terraform like when you start to where you are today like when you start you start with small project.

And then you realize you need a staging environment.

So you clone that code in the same project.

And now you have two environments.

But any change is in the blast radius of all of those.

So then you move to move having separate projects and all of that stuff, you find that linking to Terraform channel what was the first time I heard of the term a terror list for Terraform monolith material.

I love it.

Yeah, it's not a perfect description of it.

Everyone knows what you mean, if they've ever done it.

Yes speaking of terror let's.

I was thinking about the wave Yeah.

You had showed.

The way you do the terror form infrastructure for your clients with like each client gets a GitHub organization.

And then there is what's under the organ.

We're going to like do you have is it just one big Terraform Apply or whatever for like the whole thing or is it no go out.

Yeah, it's all decomposed.

So basically, our thing is that we've been using it.

So here's like an organization.

And then each e.w. us account represents basically dedicated application and then so therefore, each one of those has a repository but then you go into each one of these repositories.

And then they have software just like your applications have software.

But your applications.

What do they do.

They pin at library versions.

So that's what we do here.

So you want to run the write the same versions of software in all these environments all these office accounts and we do that just by pinning like to a remote module.

So here we can throw a remote module.

So when we made changes to this directory that triggers a plan and apply workflow with Atlantis there's a plan applied just for the cloud trail directory or whatever except for the whole staging got Catholics did.

Exactly So basically, you open pour requests that modify any one of these projects.

And that will trigger, you can modify multiple projects at the same time.

There's nothing no multiple applies to do that they exist.

Well known in multiple plans multiple plans multiple tries to do that.

But all in the same pull requests and all automatic.

Right OK.

What does that look like.

Does it Lantz ask when you submit a pull request does Atlantis say you know this affected 3 Terraform projects.

And here's the plans for all three of them is that what it does.

Yeah So it would.

So you know the pull request was opened automatically or as the site was open manually by a human a developer.

Then the plan was it kicked off automatically.

And then we see this out here.

This happens to be output from helm file.

Not Terraform but we because we use Atlantis for both my file.

And terrible when we see it or what's going to change.

And then somebody approved it.

And then we can surgically target what part of that we want to apply for.

We're going to say Atlantis apply and it'll apply all genes catch a so one thing to note about Atlantis, which is really nice is that it locks that project that you're in.

So to developers can't be trying to modify it in this case cube system at the same time.

Because otherwise, those planning to apply those applies will be under each other's changes and they'll be at war with each other.

So Brian any interesting projects you're working on.

I am starting my Prometheus initiative.

Oh, nice.

Very cool.

Well, are you going to take a Federated approach or as in like a remote or remote storage.

No So basically, you have a centralized Prometheus but then each cluster runs its own Prometheus and the centralized Prometheus grapes.

The other ones.

That possible.

I'm not sure yet.

I actually just started yesterday looking at the helm file for.

But yeah I'll be honest, I'm not stupid familiar with Prometheus.

It's just I know that it's a lot more lightweight than what we're currently running, which is the Stig is third party.

But takes up takes up 30% of our CPU on our 4x 2x large is on a clause of the military boot and it's just for a mining tool is just asking for too much also is memory intensive as well.

So Prometheus obviously is on the other side of that spectrum where it is not a resource hog which is why I mean, yes.

I mean, it takes a brute force method to monitoring right.

It looks at like a packet inspection and everything happening on that box.

It's but I guess it's a testament to how fast sea views have come and how cheap memory is that they're able to get away with doing that.

What does.

But it's still a problem when you're doing things like you are saying, yeah.

Yeah, I have you guys gone the Federated Federated approach.

What would you say about that.

So we are we're starting a project in the next month or so.

And with a new customer.

And that's going to be doing Federated Prometheus.

The reason for that that cases, they're going to be running like dozens or more single tenant dedicated environments of their stack for customers is basically an enterprise that's so having dozens and dozens of Prometheus is in for fun as would just be too much noise and too hard to centrally manage.

Also scaling that architecture.

If you get if you only had central Prometheus would be dangerous because at some point, you're going to hit a tipping point.

And it's just not going to handle.

So with a Federated approach basically, you can run Prometheus because Prometheus is basically time series database.

In addition to its ability to harvest these metrics.

Now it can use like a real time series database like influx TV or post scripts with time series TV or whatever.

And so forth and a bunch of others.

But the simplest way, I think is just to think about Prometheus basically, it's a database, you know we can offer.

So then what you can do is you can run smaller instances of Prometheus on each cluster with a shorter retention period.

So you can still get real time data in, but you don't need to run massive deployments of Prometheus, which can get expensive because I'm running for methe s with less than 12 15 gigs allocated to the main Prometheus operator is required before for any sort of retention more than like a week or two.

So So in the Federated model basically set up like on a shared services cluster.

What we call core one for me at this instance that then scrapes the other ones.

And it can do that at its own pace and you can also downscale the precision in the process of doing that.

If you need to.

Plus if you do need absolute real time, you could still have Grafana running on all the clusters.

And you can have real time updates on those environments, different ways of looking at you guys use the remote data remote storage.

We don't.

So it's kind of it's been in our backlog of things to solve.

And there's been there's a few options of it.

The one that keeps coming up is Thanos where we look at the architecture.

I mean, it suddenly it increases the scope of which you've got to manage right.

And when is your monitoring system.

It's really important that it's up.

So I understand and appreciate the need to have a reliable back, but also the more that you the more moving pieces the bigger the problem is if something goes down.

And then what monitors the monitoring system and all these things.

So we've gotten away with using EFS and as scary as that sounds is actually less scary these days because DFS has basically x compatibility and you can schedule.

You can reserve.

IOPS on DFS.

So the problems that we've had with Prometheus have been related to AI ops and then we just bumped up you pay to play.

And it's not that much more to pay to play compared to engineering side.

So we just bumped up the AI ops and all our problems.

One way.

The other option.

I think Andrew's brought up before is that you can also just allocate more storage and then you get more high ops, which is great because it leaves it.

First of all, it just gives you more credits.

So if you still have first of all, IOPS don't need that higher baseline on of credits you're getting is not high enough given the amount of data you have just store a bunch of random data and the amount of money you a knowing that I got your signals pretty bad Andrew at least for me, is anyone else hearing the feedback a little bit.

Yeah So would I be able to use.

So you know I have a ephemeral cluster type situation would I be able to use ACFS with an outcast cluster.

Oh, yeah Yeah Yeah.

Yes cluster that ties back into the same ACFS.

Yeah Yeah Yeah, you could have a static Yeah.

Yeah, this file system.

For example, and just keep reusing that and unique.

Yes, your family.

Yes clusters.

Awesome Yeah.

That's really nice about CSS is it does not suffer from the same limitations of being it EFS is cross availability zone.

Yes versus IBS, which is not one.

So the issue.

People run into frequently when doing persistent volume storage in Kubernetes is let's say you spin up Jenkins and you have a three node cluster across three availability zones because you like high availability will who Jenkins spins up in USD one and then for some reason Jenkins restarts.

And next time Jenkins spins up in USD 1B.

But that EBS volume is in one a still Jenkins can't spin up because it can't connect to the b.s. volume.

Yeah So your host DFS does not have that problem.

Yeah In fact, we've changed the default.

I forget what the term is this default storyboard class.

The storage class for cabinet is to be cast and make it explicit.

If you want PBS like before certain situations.

But the that's worked out really well for us.

Plus it simplifies backups too, because you can just set up AWS backup on your DFS file system.

You don't have to worry about backups of every single volume provisioned by Kubernetes.

So it's a lot easier just to back up the whole file system.

We don't do that because it would be backing up random crap that we don't need.

But you can definitely do that.

Yeah, and if she's a shameless plug.

If use the cloud posse in DFS module.

It supports database backups.

Well, that's good to know.

Thank you.

I probably wouldn't have looked the VFX.

It sounds like the right thing to do.

But we were running it right now with so far we have maybe three months of retention in it.

And it's fine.

Obviously, we have to wait and Seattle goes up to six months.

But I don't see any problems right now.

Once we address the memory issues the meat is we've been running for way over six months.

And we're fine with it.

It's about.

I hope it's about ops right.

As long as your AI ops handle you know, if you are in production and your Amazon, of course, you shouldn't use EFS or your Google or whoever.

That's dumb, but if you're a small you know if your traffic is small go for it.

And don't listen to all the people saying, oh, you should never run Postgres on EFS because it's the NFL.

It it's fine.

Don't worry about it.

Try it.

And if it causes issues, then figure out something some other something else don't add Nazis don't add complexity.

Right off the bat until you have tried the less complex option and determined that it's not a good option.

Yeah, I think that's a good way of putting it.

Are you guys using the CSI driver for EFS.

I'm using the original or whatever that's you that.

I don't know.

There's a tool called the effect provision here that you deployed your QB native cluster that just provisions perceptive volumes for you given consistent volume claims and it works great.

I just posted what I was looking at in the office hours channel.

OK So you guys are using the offensive provision here a fester Virginia.

I think we have a health file for it too.

Thank you.

Thank you.

Good to know.

Yeah if this provision and here's the file that we used and work straight.

We've been on it over a year.

No problems.

And so you guys are testing out coubertin entities federation.

No, we have not on the federation root for Kubernetes just for clarification.

When I was mentioned federation that was for Prometheus, which is unrelated.

I often look into that or Prometheus sandwich.

Protip with the effects if you're going to go into Governor cloud for any reason, use Governor cloud WF cloud E does not have CSS.

We made that mistake and it has cost us.

Well O'Brien by the way, where I pretty much learned about what this looks like for the Federated Prometheus is from your pal Corey Gail over at gundam so they can definitely shed some light.

They're not doing it under Kubernetes.

They're doing it like on bare metal.

But they can tell you about why they did it.

Yeah And I know that they just started that Prometheus initiative like maybe half a year ago.

Yeah, even that period, which is really cool.

Yeah Cool.

So as any other parting thoughts here before we start wrapping things up for today.

Any other interesting news announcements you guys have seen.

On Hacker News don't you.

Apparently friggin what's it called Whatsapp.

Check if you're Jeff Bezos don't use Whatsapp.

Yeah, that more then that sounds too good.

Also I thought apple backups were encrypted.

But now they're saying they're not encrypted.

I have started using Keybase heavily and I'm in absolutely in love.

Awesome but keep AI mean, that's great for chat and validation or whatever.

And maybe with AWS can creations and whatnot.

But it's not going to help you back up your iPhone right.

I mean, I don't have an iPhone.

I'm using it.

Give me some Keybase has file storage.

And it even has encrypted get repositories.

I saw that.

Yeah What's interesting and underrated are they're exploding messengers.

If you're sending secrets once you get them to exploding messages like like Mission Impossible.

They literally explode in the UI.

I haven't seen that.

How does that work.

I mean that on your message.

Oh, yeah.

Sure enough.

Let's go.

Let's go.

I never use that.

I just don't really use it for the check because none of my friends are on it for me.

It's what we use in-house to send like some of our secrets.

Yeah It's one of our best practices.

Well, we start with a new customer.

I always recommend that they set up a team under key dates were for those situations where you need to share its secrets that don't fit into like the best practices of using a password manager or the best practices of using like how she caught ball.

Just kind of a one off.

Yeah, for the one off things right.

Because a lot of these other tools don't support direct sharing of secrets.

To individuals in an encrypted fashion just put your passwords in Slack.

What could go wrong.

Seriously they cannot get access.

I cannot get this stupid doctor provider to work or forget.

All right, everyone.

Well, on that note, I think we'll leave it.

Let it be.

We'll we're going to wrap up office hours here.

And thanks again for sharing everything.

I always learn so much from these calls reporting on this call is going to be posted in the office hours channel.

Do you guys next week same time, same place Thanks, guys.

Derek