Show HN: PlayBooks – Jupyter Notebooks style on-call investigation documents

chasinglogic · 2024-06-05T08:21:13 1717575673

Whenever I see tools like this I always think "that wouldve been great at my old job where we didn't do post mortems"

But nowadays I think if I can automate a runbook can I not just make the system heal itself automatically? If you have repeated problems with known solutions you should invest in toil reduction to stop having those repeated problems.

What am I missing? I think I must be missing something because these kinds of things keep popping up.

everforward · 2024-06-05T15:53:17 1717602797

A lot of on call teams lack the capability to do that automation, either because ops takes the pages and can't code (or can't code well enough) or because dev takes the pages and have no access or knowledge about the infra APIs they could use for self-healing.

These platforms can form a sort of "common ground" where dev can see the infra APIs and the "code" is simple enough for ops people that don't code to rig stuff up.

I don't think these platforms are built for the kind of places where being able to write a Python script to query logs from CloudFront are just table stakes for all ICs regardless of role.

ionrock · 2024-06-05T16:02:39 1717603359

Writing post-mortems is generally pretty kludgy. You might have a Slack bot that records the big picture items, but ideally, a post-mortem would include connections to the nitty-gritty details while maintaining a good high-level overview. The other thing most post-mortems miss is communicating the discovery process. You'll get a description of how an engineer suspected some problem, but you rarely get details as to how they validated it such that others can learn new techniques. At a previous job, I worked with a great sysadmin/devop who would go through a concise set of steps when debugging things. We all sat down as a team, and he showed us the commands he ran to confirm transport in different scenarios. It was an enlightening experience. I talked to him and other DevOps folks about Rundeck, and it was clear that the problem isn't whether something can be automated, but rather whether the variables involved are limited enough to be represented in code. When you do the math, the time it would take to write code to solve some issues is not worth the benefit.

Iterating on the manual work to better communicate and formalize the debugging process could fit well into the notebook paradigm. You can show the scripts and commands you're running to debug while still composing a quality post-mortem, as the incident is happening where things are fresh.

The other thing to consider is how often you get incidents and how quickly you need to get people up to speed. In a small org, devs can keep most things in their head and use docs, but when things get larger, you need to think about how you can offload systems and operational duties. If a team starts by iterating on operational tasks in Notebooks, you can hand those off to an operations team over time. A quality, small operations team can take on a lot work and free up dev time for optimizations or feature development. The key is that devs have a good workflow to hand off operational tasks that are often fuzzier than code.

The one gotcha with a hosted service IMO is that translating local scripts into hosted ones takes a lot of work. On my laptop, I'm on a VPN and can access things directly, where you need to figure out how to allow a 3rd party to connect to production backend systems. That can be a sticky problem that makes it hard to clarify the value.

TheBengaluruGuy · 2024-06-05T10:08:21 1717582101

> if I can automate a runbook can I not just make the system heal itself automatically

The runbooks are still codified by a human in the current scenario. We are experimenting with some data to see if we can generate accurate runbooks for different scenarios but haven't found much luck with it yet. I do think that some % of issues will be abstracted in near future with machines doing the healing automatically.

> you should invest in toil reduction to stop having those repeated problems.

Most teams I speak to say that they try their best to avoid repeating the same issue again. Users typically use PlayBooks for:

(a) A generic scenario where you have an issue reported / alerted and you are testing 3-4 hypotheses / potential failure reasons at once.

(b) You want to run some definitive sequence of steps.

vvoruganti · 2024-06-04T16:46:06 1717519566

This is really cool! Love seeing more tools to help SREs and hopefully lessen the burden of on calls.

The notebook style interface for logging and taking notes is appealing too.

Seen a similar approach with https://fiberplane.com/

Haven't been able to play around too much but watching the space

TheBengaluruGuy · 2024-06-04T16:54:25 1717520065

Thank you.

If you get a chance to play around, would love to hear your thoughts on it :)

debarshri · 2024-06-04T20:49:19 1717534159

Reminds me of Rundeck and the time we were trying to build something similar. There are more modern take like fiberplane and moment.dev. Not sure about their adoption.

At one point, we were building something like this on top of kubernetes. I think tech is the easy part here. Getting people to leave their existing workflows and use your product is hard.

Secondly, difficult part of our journey was integrations. Until you have integrated all the tools an org uses, product is useless.

Thirdly, it is great that there are building blocks, but users understand use cases. So, expecting end users to build playbooks themselves is tricky. There has to be an intrinsic motivation within the platform.

Fourthly, it is super competitive space if you see it from an internal tool building perspective. There are lot of internal tool builders like appsmith, retool, tooljet, django admin you are competing with where you could run bash scripts, sql queries etc.

Best of luck, with you journey.

cdchn · 2024-06-05T21:06:54 1717621614

I was looking at using moment.dev for a very similar (internal) application but the lift of using TypeScript and how the whole tool worked was very daunting. Having a simple Jupyter notebook interface (in Python) is much more approachable for a devops background.

debarshri · 2024-06-05T23:29:05 1717630145

In my experience, getting devops and infrastructure engineers to use Jupyter notebook specifically for SRE stuff is hard. What is working for is, in our new pivot is, you have to meet where the engineers are at. It could be Jetbrain's tools, VSCode or terminal. Otherwise the lift is always too much. In my opinion, Jupyter way might be better but still not good enough to cross over.

delano · 2024-06-04T21:35:12 1717536912

If it works like Jupyter, as a file that can be version controlled, and like Deepnote where multiple people can be viewing/working on it at the same time, my mind would be blown.

samuelstros · 2024-06-04T21:55:15 1717538115

here, be blown away https://github.com/opral/monorepo/tree/main/lix

solving version control for files like jupyter notebooks brings collaboration to those files without the need to give up files in favor of the cloud. playbooks could leverage lix in 1-2 years to build a file-based version of their tool

TheBengaluruGuy · 2024-06-05T03:32:11 1717558331

this is quite interesting. I'll surely keep it in mind while we build out deeper collaborative features.!

delano · 2024-06-05T16:54:14 1717606454

Wow, yeah. "Bringing backend features to files."

This feels a bit like that time we saw Etherpad playback for the first time. I'm just not sure if I've grokked the big picture yet.

https://news.ycombinator.com/item?id=495336

samuelstros · 2024-06-05T19:59:06 1717617546

big picture is that cloud-based apps/saas is getting disrupted.

there is no value in a cloud-based solution that locks users and customers in if collaboration can be solved in the data (file) level. turns out that version control solves collaboration on the data level and is awesome to build apps.

mcintyre1994 · 2024-06-04T22:05:16 1717538716

You might also like Elixir Livebook! :) https://livebook.dev/

TheBengaluruGuy · 2024-06-05T03:30:57 1717558257

Thanks for your feedback.

> as a file that can be version controlled

PlayBooks are created using a UI and all state changes are tracked but we currently don't support moving back to a previous version of the PlayBook.

> where multiple people can be viewing/working on it at the same time

This is currently picked wherein we will be creating sessions for each time PlayBooks are run and sessions will have the data persistent in each cell for everyone with the link to see.

lcfcjs6 · 2024-06-04T20:10:11 1717531811

This is awesome, i've seen so many static runbooks (like confluence) and SREs will scan it once, not find what they need and then go wake up a senior dev. Pre-programmed scripts could go a long way in giving the SRE the ability to go that extra step, which could be vital to solving the problem faster.

TheBengaluruGuy · 2024-06-05T03:35:22 1717558522

Yes, we also support webhook based triggers so investigations can get initiated even before the SRE is on the laptop and by the time they reach there, they receive a summary upfront.

everforward · 2024-06-05T16:02:42 1717603362

Isn't that already possible via normal Python scripts? I've worked a couple places where dev had a "don't wake us up" script that was programmed to detect known and common issues and either fix it or offer recommendations on next steps (including a couple of code paths that led to a "page everyone, immediately and repeatedly").

From the SRE side, far and away the most common reason I end up paging devs is because the issue is somewhere deep inside the system and I lack that depth. I'm supporting half a dozen services and can't keep track of the churn that happens at high enough granularity. Eg I know the app's downstreams and most of its upstreams, but if there's an issue with a particular field in an API response I'm unlikely to know whether that field comes from our database, a downstream, summoned by voodoo, etc.

Still interesting to see, I'd love to be proven wrong.

shanemhansen · 2024-06-05T01:51:45 1717552305

I saw this used from time to time at Google. There were occasional utility SRE notebooks (colabs). Also the cloud support team seemed to make more use of them.

bckr · 2024-06-04T19:47:07 1717530427

Great to see this launch! I’m looking forward to trying this when our startup is a bit more mature.

taeric · 2024-06-04T18:08:30 1717524510

Reminds me of https://nathanielhoag.com/blog/2022/interactive-runbook/. Fun space to play in. Good luck on this!

TheBengaluruGuy · 2024-06-04T18:21:19 1717525279

This is quite an interesting evaluation, thanks for sharing. We are piloting with a large enterprise (100+ SREs). Before us, they started implementing Jupyter Notebooks in a similar direction.

Writing one playbook in Jupyter is easy but creating a framework to enable their 100+ product teams to self-serve and create playbooks has been so intensive for them, they even started working on their internal SDK for it.

It was a lot of code and the lead felt like the Jupyter visual interface was harder to follow for instructions/runbooks.

With PlayBooks, we have tried to abstract out the entire execution engine and configuration to a intuitive user experience (our architecture is explained here -- https://slender-resolution-789.notion.site/PlayBooks-Documen... )

anonme4ever · 2024-06-04T19:28:54 1717529334

You should check out Nurtch[0] with Rubix integration[1]. Gitlab have some docs on how to use it[2].

Your project seems nice! I'll give it a try ;-) Only thing, the Jupiter-like part is not clear enough.

0: https://www.nurtch.com/

1: https://docs.nurtch.com/en/latest/rubix-library/index.html

2: https://docs.gitlab.com/ee/user/project/clusters/runbooks/

TheBengaluruGuy · 2024-06-04T19:58:38 1717531118

Thanks for sharing about Nurtch & Rubix, I have come across it before in the Gitlab Runbooks.

The Jupyter part is reference to the cellular execution of tasks as per the preference of the users + being able to get execution / code next to each other. Both have been design principles for us from the get-go.

Just like how variables can be reused across cells in Jupyter, we plan to shortly introduce rules / conditionals creating interdependencies between variables in the PlayBooks steps.

Edit: Adding the a sample Playbook link here for reference -- https://sandbox.drdroid.io/playbooks/14

dennisy · 2024-06-04T20:36:11 1717533371

This is a great idea! But I feel better served by an existing workflow tool, such as Airflow?

TheBengaluruGuy · 2024-06-05T03:44:31 1717559071

I'd like to get a bit more context on what you're thinking. How would Airflow help SRE teams with on-call investigations?

perpil · 2024-06-04T18:10:51 1717524651

I like the integration with slack and the inline execution of steps. I've been working on a similar product with https://speedrun.cc but it just piggybacks on GitHub markdown and most of the execution is done via a deeplink. Reach out if I can help, I've been messing around in this space for awhile.

TheBengaluruGuy · 2024-06-04T18:26:09 1717525569

Slack has become so central to every on-call investigation, that it was like a dealbreaker for my cofounder, Dipesh, to have a fully functional Slack workflow in our MVP.

I did come across Speedrun a while back and was planning to give it a spin. Thanks for dropping a note, I'll drop you a mail sometime in the near future to discuss more on the topic. :)

pimlottc · 2024-06-05T12:39:07 1717591147

Feedback on the sample playbook:

- The “rename step” functionality is not intuitive. I expecting tapping on the step name to “unfold” the step and show me the full details, not start the renaming process. After tapping it, I still didn't realize what was happening; i thought perhaps it had executed the step, which the check mark indicating completion or success. It wasn’t clear that it was an input box since it didn’t have focus, and it wasn’t clear that the check mark was a button.

I would have guessed that the pencil icon perhaps was the rename action, though it still did not put focus on the input box. There shouldn't be a second step needed to focus the input box.

- It’s not clear what defines the “type” of each step; eg whether it’s a log filter, or dh query, or shell command, etc. It seems like it’s the “Data” field, although the name doesn’t make much sense. The field does not seem to be editable; I would have expected it to be a dropdown list with other possible step types listed. If it is intended not to be changeable, then it probably shouldn’t be an input element. There’s a “reload”(?) icon next to it, but I have no idea what that does.

TheBengaluruGuy · 2024-06-05T13:11:28 1717593088

> “rename step” functionality is not intuitive.

We deployed the change to make it intuitive (similar to what is suggested) yesterday. It's still in integration branch so awaiting merging in main on this.

> i thought perhaps it had executed the step, which the check mark indicating completion or success.

Noted.

> It wasn’t clear that it was an input box since it didn’t have focus, and it wasn’t clear that the check mark was a button.

Noted.

> It seems like it’s the “Data” field, although the name doesn’t make much sense.

It is indeed a dropdown list but we had hard-coded it for sandbox so user can't change the source of an existing step. It is changeable when you host your own version or when you add a new step in sandbox.

> There’s a “reload”(?) icon next to it, but I have no idea what that does.

In case user decides to add a new source on-the-go (say in another tab), reload helps fetch the same list again.

Overall, I do understand that some parts of it are unintuitive and is a focus area for us to improve it asap.

alaintno · 2024-06-05T07:24:38 1717572278

It would be so cool to also have access to GCP resources!

Great job nonetheless!

TheBengaluruGuy · 2024-06-05T07:26:49 1717572409

Connecting to GKE for k8s events/deployment info is WIP, we plan to pick up Stack Driver too soon.

ystad · 2024-06-04T18:45:12 1717526712

Nice. Similar solution https://github.com/1xyz/pryrite

Shubham_Bhard · 2024-06-05T06:29:51 1717568991

Great! I love ChatGPT but have found it has limited utility when I am trying to debug/resolve issues which involve intricate business/domain/customer logic and modelling. This seems to provide me the solution! Thanks folks!