I work in this space. We manage thousands of e2e tests. The pain has never been ...

hn_throwaway_99 · 2024-10-23T15:27:07 1729697227

While I agree with your primary pain point, I would argue that that really isn't specific to tests at all. It sounds like what you're really saying is that when something goes wrong, it's really difficult to determine which component in a complex system is responsible. I mean, from what you've described (and from what I've experienced as well), you would have the same if not harder problem if a user experienced a bug on the front end and then you had to find the root cause.

That is, I don't think a framework focused on front end testing should really be where the solution for your problem is implemented. You say "This is a very, very difficult thing to automate and requires AGI-level intelligence to really build a system that can go read the logs of some random service deep in our service mesh to understand why an e2e test fails." - I would argue what you really need is better log aggregation and system tracing. And I'm not saying this to be snarky (at scale with a bunch of different teams managing different components I've seen that it can be difficult to get everyone on the same aggregation/tracing framework and practices), but that's where I'd focus, as you'll get the dividends not only in testing but in runtime observability as well.

Lienetic · 2024-10-23T15:45:29 1729698329

Agreed. Is there a good tool you'd recommend for this?

hn_throwaway_99 · 2024-10-23T15:55:55 1729698955

It's been quite some time but New Relic is a popular observability tool whose primary goal (at least the original primary goal I'd say) is being able to tie together lots of distributed systems to make it easier to do request tracing and root cause analysis. I was a big fan of New Relic when I last used it, but if memory serves me correctly it was quite expensive.

krainboltgreene · 2024-10-23T16:56:36 1729702596

"OpenTelemetry and other tools are promising, but again, I’ve never seen good enough infra that puts that all together."

It's a two paragraph comment and you somehow missed it.

hn_throwaway_99 · 2024-10-23T21:38:44 1729719524

I did read it, and I don't understand why you feel the need to be an asshole.

Like I said in my comment, I do think getting everyone on the same page in a large, diverse organization is difficult. That said, it's not rocket science, and it's usually difficult because there aren't organizational incentives in place to actually ensure teams prioritize making system-wide observability work.

FWIW, the process I've seen at more than 1 company is that people bitch about debugging being a pain, they put in a couple half measures to improve things, and then finally it becomes so much of a pain that they say "fine, we need to get all of our ducks in a row", execs make it a priority, and then they finally implement a system-wide observability process that works.

msoad · 2024-10-23T19:35:13 1729712113

Exactly! I've never seen a 5000+ eng org that have all their ducks in a row when it comes to telemetry. it's one of those things that you can't put a team in charge of it and get results. everyone have to be on the same page which in a big org is hardly the case.

ec109685 · 2024-10-23T16:45:05 1729701905

There are silly things that trip up e2e tests like a cookie pop up or network failures and whatnot. An AI can plow through these in a way that a purely coded test can’t.

Those types of transient issues aren’t something that you would want to fail a test for given it still would let the human get the job done if it happened in the field.

This seems like the most useful part of adding AI to e2e tests. The world is not deterministic, which AI handles well.

Uber takes this approach here: https://www.uber.com/blog/generative-ai-for-high-quality-mob...

tomatohs · 2024-10-23T16:57:28 1729702648

I predict an all out war over deterministic vs non-deterministic testing, or at least a new buzzword for fuzzy testing. Product people understand that a cookie banner "shouldn't" prevent the test from passing, but an engineer would entirely disagree (see the rest of the convos below).

Engineers struggle with non-deterministic output. It removes the control and "truth" that engineering is founded upon. It's going to take a lot of work (or again, a toung-in-cheek buzzword like "chaos testing") to get engineers to accept the non-deterministic behavior.

cschiller · 2024-10-23T16:25:22 1729700722

Thanks for your thoughtful response! Agree that digging into the root cause of a failure, especially in complex microservice setups, can be incredibly time-consuming.

Regarding writing robust e2e tests, I think it really depends on the team's experience and the organization’s setup. We’ve found that in some organizations—particularly those with large, fast-moving engineering teams—test creation and maintenance can still be a bottleneck due to the flakiness of their e2e tests.

For example, we’ve seen an e-commerce team with 150+ mobile engineers struggle to keep their functional tests up-to-date while the company was running copy and marketing experiments. Another team in the food delivery space faced issues where unrelated changes in webviews caused their e2e tests to fail, making it impossible to run tests in a production-like system.

Our goal is to help free up that time so that teams can focus on solving bigger challenges, like the debugging problems you’ve mentioned.

Terretta · 2024-10-23T18:06:02 1729706762

Integrate with https://www.honeycomb.io

fullstackchris · 2024-10-23T15:04:59 1729695899

To be fair, this is NOT the case with native mobile apps. There are some projects like detox that are trying to make e2e tests easier, but the tests themselves can be painful, run fairly slow on emulators, etc.

Maybe someday the tooling for mobile will be as good as headless chrome is for web :)

Agreed though that the followup debugging of a failed test could be hard to automate in some cases.

edelans · 2024-10-24T08:49:16 1729759756

I think we can claim that at Waldo.

Check for yourself: I've just recorded this [1] scripted test on the wikipedia mobile app, and it yields this [2] Replay. In less than a minute we spin up a fresh virtual device, install your app on it, execute the 8 steps of the script.

As a result, you get the Replay of the session : video synchronized with interaction timeline, device & network logs, so you can debug in full context.

[1]: https://github.com/waldoapp/waldo-programmatic-samples/blob/... [2]: https://share.waldo.com/7a45b5bd364edbf17c578070ce8bde220240...

egeek · 2024-10-25T22:37:10 1729895830

Do you have any pricing info available? All I can see is get started for free, but no info on what it might cost later

rafaelmn · 2024-10-23T16:29:46 1729700986

I think either you're overselling the maturity of the ecosystem or I've been unfortunate enough to get stuck with the worst option out there - Cypress. I run into tooling limitations and issues regularly, only to eventually find an open GitHub issue with no solution or some such.

codedokode · 2024-10-23T17:47:01 1729705621

Sorry if it is a stupid idea, but cannot you log all messages to a separate file for each test (or attach test id to the messages)? Then if the test fails, you can see where the error occured.

msoad · 2024-10-23T19:37:23 1729712243

Where I work there are 1,500 microservices. How do I get a log of all of those services -- only related to my test's requests in a file?

I know there are solutions for this, but in the real world I have not seen it properly implemented.

antonvs · 2024-10-24T09:54:36 1729763676

This works easily enough in the major cloud environments, since logging tends to be automatic and centralized. The only thing you need to do is make sure that a common request id or similar propagates to all the services, which is not that difficult.

ergeysay · 2024-10-24T03:13:37 1729739617

As you said, OpenTelemetry and friends can help. I had great success with these.

I am curious, what were implementation issues you have encountered?

TechDebtDevin · 2024-10-23T15:23:36 1729697016

I doubt that screenshot methods are the bottleneck considering that's the method Microsoft and Anthropic are using.

tomatohs · 2024-10-23T16:58:24 1729702704

It's absolutely not the bottleneck. OpenAI can process a full resolution screenshot in about 4 seconds.

tomatohs · 2024-10-23T16:51:55 1729702315

You're totally right here, but "debugging failed tests" is a mature problem that assumes you have working tests and people to write them. Most companies don't have the resources to dedicate full engineer time to QA, and if they do nobody maintains the test.

Debugging failed test is a "first world problem"

AdieuToLogic · 2024-10-24T04:17:13 1729743433

> ... "debugging failed tests" is a mature problem that assumes you have working tests and people to write them.

I am reminded of an old s/w engineering law:

  Developers can test their solution or Customers will.
  Either way, the system will be tested.