Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Laminar – Open-Source DataDog + PostHog for LLM Apps, Built in Rust (github.com/lmnr-ai)
203 points by skull8888888 4 months ago | hide | past | favorite | 45 comments
Hey HN, we’re Robert, Din and Temirlan from Laminar (https://www.lmnr.ai), an open-source observability and analytics platform for complex LLM apps. It’s designed to be fast, reliable, and scalable. The stack is RabbitMQ for message queues, Postgres for storage, Clickhouse for analytics, Qdrant for semantic search - all powered by Rust.

How is Laminar different from the swarm of other “LLM observability” platforms?

On the observability part, we’re focused on handling full execution traces, not just LLM calls. We built a Rust ingestor for OpenTelemetry (Otel) spans with GenAI semantic conventions. As LLM apps get more complex (think Agents with hundreds of LLM and function calls, or complex RAG pipelines), full tracing is critical. With Otel spans, we can: 1. Cover the entire execution trace. 2. Keep the platform future-proof 3. Leverage an amazing OpenLLMetry (https://github.com/traceloop/openllmetry), open-source package for span production.

The key difference is that we tie text analytics directly to execution traces. Rich text data makes LLM traces unique, so we let you track “semantic metrics” (like what your AI agent is actually saying) and connect those metrics to where they happen in the trace. If you want to know if your AI drive-through agent made an upsell, you can design an LLM extraction pipeline in our builder (more on it later), host it on Laminar, and handle everything from event requests to output logging. Processing requests simply come as events in the Otel span.

We think it’s a win to separate core app logic from LLM event processing. Most devs don’t want to manage background queues for LLM analytics processing but still want insights into how their Agents or RAGs are working.

Our Pipeline Builder uses graph UI where nodes are LLM and util functions, and edges showing data flow. We built a custom task execution engine with support of parallel branch executions, cycles and branches (it’s overkill for simple pipelines, but it’s extremely cool and we’ve spent a lot of time designing a robust engine). You can also call pipelines directly as API endpoints. We found them to be extremely useful for iterating on and separating LLM logic. Laminar also traces pipeline directly, which removes the overhead of sending large outputs over the network.

One thing missing from all LLM observability platforms right now is an adequate search over traces. We’re attacking this problem by indexing each span in a vector DB and performing hybrid search at query time. This feature is still in beta, but we think it’s gonna be crucial part of our platform going forward.

We also support evaluations. We loved the “run everything locally, send results to a server” approach from Braintrust and Weights & Biases, so we did that too: a simple SDK and nice dashboards to track everything. Evals are still early, but we’re pushing hard on them.

Our goal is to make Laminar the Supabase for LLMOps - the go-to open-source comprehensive platform for all things LLMs / GenAI. In it’s current shape, Laminar is just few weeks old and developing rapidly, we’d love any feedback or for you to give Laminar a try in your LLM projects!




Everything is LLMs these days. LLMs this, LLMs that. Am I really missing out something from these muted models? Back when it was released, they were so much capable but now everything is muted to the point they are mostly autocomplete on steroids.

How can adding analytics to a system that is designed to act like humans produce any good? What is the goal here? Could you clarify why would some need to analyze LLMs out of all the things?

> Rich text data makes LLM traces unique, so we let you track “semantic metrics” (like what your AI agent is actually saying) and connect those metrics to where they happen in the trace

But why does it matter? Because at the current state these are muted LLMs overseen by the big company. We have very little to control the behavior and whatever we give it, it will mostly be 'politically' correct.

> One thing missing from all LLM observability platforms right now is an adequate search over traces.

Again, why do we need to evaluate LLMs? Unless you are working in a security, I see no purpose because these models aren't as capable as they used to be. Everything is muted.

For context: I don't even need to prompt engineer these days because it just gives similar result by using the default prompt. My prompts these are literally three words because it gets more of the job done that way than giving elaborate prompt with precise example and context.


They're not "muted". You just got used to them and figured out that they don't actually generete knew knowledge or information, they only give a statistically average summary of the top Google query. (I.e., they are super bland, boring and predictable.)


LLMs are pretty bland but they don’t just summarize the top Google result. They can generate correct SQL queries to answer complex questions about novel datasets. Summarizing a search engine result does not get you anywhere close to that.

It may be fair to characterize what they’re doing as interpolative retrieval, but there’s no reason to deny that the “interpolative” part pulls a lot of weight.

P.S. Yes, reliability is a major problem for many potential LLM applications, but that is immaterial to the question of whether they're doing something qualitatively different from point lookups followed by summarization.


> They can generate correct SQL queries to answer complex questions about novel datasets.

"Correct" is a big overstatement, unless by "SQL" you mean something extremely basic and ubiquitous.


The output can be explicitly constrained to a formal syntax (see outlines.dev).

For many cases this is more than enough to solve some hard problems well enough.


Honestly I think the reason it is “extremely basic” is because while it has been trained on “the entire internet” it doesn’t know anything about your specific database schema beyond what you provided in your prompts.

If these LLMs were cheap and easy to train (or is it fine tune?) using your own schema and code base on top of its existing “whole internet” training data… it could almost certainly do more than just provide “basic stuff”.

Of course I think the training for your own personal stuff would need to be “different” somehow so it knows that while most of its training is generalistic the stuff you feed it is special and it needs to apply the generalist training as a means for understanding your personal stuff.

Or something like that. Whatever the case is it would need to be cheap, quick and easy to pick up a generalist LLM and supplement it with the entirety of your own personal corpus.


I found a LOT more value with personal python based API tools once I employed well described JSON schemas.

One of my clients must comply with a cyber risk framework with ~350 security requirements, many of which are so poorly written that misinterpretation is both common and costly.

But there are other, more well-written and described frameworks that include "mappings" between the two frameworks.

In the past I would take one of the vague security requirements, read the mapping to the well described framework to understand the underlying risk, the intent of the question, as well as likely mitigating measures (security controls). On average, that would take between 45-60 minutes per question. Multiply that out it's ~350 * 45 minutes or around 262 hours.

My first attempts to use AI for this yielded results that had some value, but lacked the quality to provide to the client.

On this past weekend, using python, Sonnet 3.5, JSON schemas, I managed to get the entire ~350 questions documented with a quality level exceeding what I could achieve manually.

It cost $10 in API credits and approx 14 hrs of my time (I'm sure a pro could easily achieve this in under 1 hour). The code itself was easy enough, but the big improvements came from the schema descriptions. That was the change that gave me the 'aha' moment.

I read over final results for dangerous errors (but ended up changing nothing at all) but just in case, I ran the results through GPT-4o which also found no issues that would prevent sending it to the client.

I would never get that job done manually, it's simply too much of a grind for a human to do cheaply or reliably.


Have you tried BAML (https://github.com/boundaryml/baml)? It's really good at structured output parsing. We integrated it directly into our pipeline builder.


Not yet, but its the weekend is just beginning, thanks for the tip.


(BAML founder here) feel free to jump on our Discord or email us if you have any issues with BAML! Here's our repo (with docs links) https://github.com/BoundaryML/baml and a demo: https://boundaryml.wistia.com/medias/5fxpquglde

People have used it to do anything from simple classifications to extracting giant schemas.


You are welcome! The easiest way to get started with BAML on Laminar is with our pipeline builder and Structured Output template. Check out the docs here (https://docs.lmnr.ai/pipeline/introduction)


Hey there, apologies for the late reply.

> Could you clarify why would some need to analyze LLMs out of all the things?

When you want to understand trends of the output of your Agent / RAG on scale, without looking manually at each trace, you need to another LLM to process the output. For instance, you want to understand what is the most common topic discussed with your agent. You can prompt another LLM to extract this info, Laminar will host everything, and turn this data into metrics.

> Why do we need to evaluate LLMs?

You right, devs who want to evaluate output of the LLM apps, truly care about the quality or some other metric. For this kind of cases evals are invaluable. Good example would be, AI drive-through agents or AI voice agents for mortgages (use cases we've seen on Laminar)


Topic modelling and classifications are real problems in LLM observability and evaluation, glad to see a platform doing this.

I see that you have chained prompts, does that mean I can define agents and functions inside the platform without having it in the code?


Yes! Our pipeline builder is pretty versatile. You can define conditional routing, parallel branches, and cycles. Right now we support LLM node and util nodes (json extractor). If you can defined your logic purely from those nodes (and in majority of cases you will be), then great, you can host everything on Laminar! You follow this guide (https://docs.lmnr.ai/tutorials/control-flow-with-LLM) it's bit outdated by gives you a good idea on how to create and run pipelines.


> Everything is LLMs these days. LLMs this, LLMs that. Am I really missing out something from these muted models? Back when it was released, they were so much capable but now everything is muted to the point they are mostly autocomplete on steroids.

it was my experience, too, then I tried out that cursor thing and turns out a well designed UX around claude 3.5 is the bees knees. it really does work, highly recommend the free trial. YMMV of course depending on what you work on; I tested it strictly on Python.


LLMs and python don't sound good together.


You're thinking about consumer use cases. Commercial uses cases are not "muted" by any means. The goal is to produce domain-specific JSON when fed some contextual data. And LLMs have only gotten better at doing so over time.


I’m always game for an LLM observability platform that is potentially affordable, at least during the early phases of development.

I was using DD at work and found it to be incredibly helpful but now that I am on my own, I am much more price sensitive.

Still, having a low friction way to see how things are running, check inputs/outputs is a game changer.

One challenge I have run into is a lack of support for Anthropic models. The platforms that do have support are missing key pieces of info like the system prompt. (Prob a skill issue on my end).

Also they seem to all be tightly coupled to langchain, etc which is a no-go.

Will check this out over the next week or two. Very exciting!


Totally agree, observability is a must for LLM apps. We wanted to build something of extremely high quality but to be affordable for solo devs, that's why open-source and why very generous free tier on our managed version.

Regarding Anthropic instrumentation, we support it out of the box! You don't even need to wrap anything, just do laminar initialize and you should see detailed traces. We also support images! Hit me up at robert@lmnr.ai if you need help onboarding or setting up local version


How will you distinguish Laminar as "the Supabase for LLMOps" from the many LLM observability platforms already claiming similar aims? Is the integration of text analytics into execution traces your secret sauce? Or, could this perceived advantage just add complexity for developers who like their systems simple and their setups minimal?


Hey there! Good question. Our main distinguishing features are:

* Ingestion of Otel traces

* Semantic events-based analytics

* Semantically searchable traces

* High performance, reliability and efficiency out of the box, thanks to our stack

* High quality FE which is fully open-source

* LLM Pipeline manager, first of it's kind, highly customizable and optimized for performance

* Ability to track progression of locally run evals, combining full flexibility of running code locally without need to manage data infra

* Very generous free tier plan. Our infra is so efficient, that we can accommodate large number of free tier users without scaling it too much.

And many more to come in the coming weeks! On of our biggest next priorities is to focus on high quality docs.

All of these features can be used as standalone products, similar to Supabase. So, devs who prefer keep things lightweight might just use our tracing solution and be very happy with it.


Why are SaaS products all going into a pricing model that’s $0, $50, Custom. What about a $5 or $10 plan… or maybe a sliding scale that you pay for what you use?


Hey there, we priced it that way because of our very generous free tier. Your suggestion of usage based pricing also makes sense. So, in our case it might be something: if you pass 50k spans, then you pay something like $0.001 (not a final number) per span. Would you image something like this?


To my eye this looks quite a bit more serious and useful than the naive buzzword bingo test would suggest.

I really like the stack these folks have chosen.


Thank you! We thought a lot about what would make a great title but couldn't really find anything else which would convey info as densely as current title. We also love our current stack :). I think Rust is perfect language to handle span ingestion and it marries perfectly with the rest of our stack.


Langtrace core maintainer here. Congrats on the launch! We are building OTEL support for a wide range of LLMs, vectorDBs and frameworks - crewai, DSPy, langchain etc. Would love to see if the langtrace’s tracing library can be integrated with Laminar. Also, feel free to join the OTEL GenAI semantic working committee.


Thank you! If langtrace sends Otel spans over http or grps we can ingest it! How would one join OTEL GenAI comittee?


> One thing missing from all LLM observability platforms right now is an adequate search over traces.

Why did you decide to build a whole platform and include this feature on top, rather than adding search to (for example) Grafana Tempo?


Valid point. For us searchable and especially semantically searchable, traces / spans really make sense only in the context of tracing LLM apps. And then, we view it as a powerful feature, but, primarily in the context of AI/LLM-native observability platform. For us, the ultimate goal is to build the comprehensive platform, with features which are extremely useful for observability and development of LLM/GenAI apps.


>For us searchable and especially semantically searchable, traces / spans really make sense only in the context of tracing LLM apps.

I know LLM is the new shiny thing right now. Why is semantic search of traces only useful for LLMs?

I've been working in CI/CD and at a large enough scale, searchability of logs was always an issue. Especially as many tools produce a lot of output with warnings and errors that mislead you.

Is the search feature only working in an LLM context? If so why?


Now that you mentioned it, > warnings and errors that mislead you

it really makes sense. I guess what I was pointing into, is that when you have really rich text (in your case it would be error descriptions), searching over them semantically is a must have feature.

But you are right, being an output of LLM is not a requirement.


I think you might find that gets prohibitively expensive at scale. There's various definitions of "semantic", such as building indexes on OTel semantic conventions, all the way over to true semantic search over data in attributes. I'd be curious how you're thinking about this at the scale of several millions of traces per second.


Hey there, by semantic we mean, embedding text and storing it in the vector DB. Regarding scale, we thought a lot about it, and that's why we process span in the background queue. Tradeoff would be that indexing / embedding would not be real-time as scale will grow. We will also use tiny embedding models, which become better and better.


Awesome launch! Just curious, what does the "run everything locally, send results to a server" approach mean and why do you love it?


Thank you! By that we mean, all processing, i.e. data, forward run, evaluator run is done locally in, let's say, Jupyter Notebook. While it's running locally, it sends all the run data / stats to the server for storage and progress tracking.

We love it because we tried putting things into the UI, but found it to be much more limiting rather that letting users design evals and run them however they want.


Does it do event sourcing like inngest where I can do the “saga pattern”?


You mean like triggering another processing pipeline from the output of current processing pipeline?


How do you compare to say, Langfuse?


Hey there, answered similar question here https://news.ycombinator.com/item?id=41453674

We really like langfuse, the team and the product.

Compared to it:

* We send and ingest Otel traces with GenAI semconv

* Provide semantic-event based analytics - you actually can understand what's happening with your LLM app, not just stare at the logs all day.

* Laminar is built be high-performance and reliable from day 0, easily ingesting and processing spikes of 500k+ tokens per seconds

* Much more flexible evals, because you execute everything locally and simply store the results on Laminar

* Go beyond simple prompt management and support Prompt Chain / LLM pipeline management. Extremely useful when you want to host something like Mixture of Agents as a scalable and trackable micro-service.

* It's not released yet, but searchable trace / span data


Hey, looks cool. Trying to understand the prompt management a bit better. Is it like a GUI that publishes to an API?


Thank you! Yes, our pipeline builder is a way to build LLM pipelines and expose them as API endpoints! All hosted and managed by Laminar.


looks cool I wish I had this when I started YC


Thank you!


[flagged]


Turns out there quite a few companies / projects named Laminar. I really like the name, couldn't buy a .com or .ai domain tho, so settled on lmnr.ai. But it's been growing on me.


Lmnop would have had a ring to it with laminar ops!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: