Hacker News new | past | comments | ask | show | jobs | submit | more totalhack's comments login

The lack of a semantic layer and join limitations are what made me pass on superset, but that was a couple years ago so maybe those features have been added.

I built my own semantic layer instead. I use this in production in my company but obviously use at your own risk as it's a one-man show.

https://github.com/totalhack/zillion


This looks interesting for me, but I'd really like more detail about the architecture and deployment in the docs.

There is this:

> A final SQL query against the combined data from the DataSource Layer

> The Combined Layer is just another SQL database (in-memory SQLite by default) that is used to tie the datasource data together and apply a few additional features such as rollups, row filters, row limits, sorting, pivots, and technical computations.

But it leaves me with questions - how/when does this get populated? What other options are there besides in-memory SQLite? (I presume that's just a convenience for development and would use something else in production?)

Or is it just what Superset calls a 'metastore' i.e. data about the data, and the queries are run against the data source layer?


Zillion queries happen in two layers...

It first runs one or more queries against your DataSources in a drill-across query fashion. You can think of DataSources as one or more completely separate databases. You could have one mysql, one postgresql, one duckdb etc all in the same Warehouse (not saying this is common in production, just an example). Within those DataSource queries it's also joining all needed tables together for you, i.e. joining multiple tables in each database to meet your required grain.

It then takes the results of all those queries and combines that data in another layer which is currently an in-memory sqlite database. The purpose of that layer is joining the data for presentation as well as applying some additional features like rollups, technicals, formula fields, etc.

I'm not familiar with what superset does under the hood or exposes as an API so I don't know how to compare it, if there is some similar backend piece. But I suspect no part of superset is quite the same as this, based on what its front end can do.

Happy to answer any other questions you have.


Or from a comment elsewhere in this thread about Superset:

> Superset lets you join tables within the same database. If you want to do cross-DB joins, we have a new (beta) in-memory meta-DB that lets you do this

...is it this?


I thought this was some jargon I didn't know haha.


Superset would be on my shortlist if I had to use something else, but the join limitations were part of why I passed.


Where is the data center located? The fastest response time I could get from some quick testing from the northeast US, having it output just one letter, was 670ms. Just wondering if that's an expected result, as it's on a par or slower than GPT 3.5 via API.


West Coast US. You would have been placed in our queuing system because with all the attention we are getting we are very busy right now!


Thanks! I did notice the queue count showing up occasionally but not every time. Maybe someone could repeat the test who has access without the queue so we can get an understanding of the potential latency once scaled and geo-distributed. What I'm really trying to understand is time to first token output actually faster than GPT 3.5 via API or just the rate of token output once it begins.


I don't know about GPT 3.5 specifically, but on this independent benchmark (LLMPerf) Groq's time to first token is also lowest:

https://github.com/ray-project/llmperf-leaderboard?tab=readm...


There’s a queueing system if too many requests are being processed at once. You may have hit that.


What are the performance implications, if any? I assume it's just translating queries as needed into the database engine language, in which case I wouldn't expect much of a performance issue.


I've also been frustrated when testing out tools that kinda keep you locked into one predetermined view, table, or set of tables at a time. I made a semantic data modeling library that puts together queries (and of course joins) for you as it uses a drill-across querying technique, and can also join data across different data sources in a secondary execution layer.

https://github.com/totalhack/zillion

Disclaimer: this project is currently a one man show, though I use it in production at my own company.


zillion looks fantastic. I've wanted for a long time to build something similar for warehousing the education and social science research data I work with, but have found it difficult to spend time building infrastructure as someone currently spending most of their time in academia. What does your company do, if you don't mind me asking? Any interest in expanding into research data management? I'd love to chat with you sometime if you're at all interested... (my contact is in my profile)


Thanks! I'm not sure if I'll have the bandwidth to help yet, but interested to hear about the problem you are facing. I'll reach out this week.


Have you considered a semantic layer instead, or does it have to be a natural language interface?


Maybe you could utilize views to make your EAV schema more friendly for the LLM? Whether that's realistic depends on the specifics of your situation of course.


Sounds like you are describing a semantic layer. You don't need AI to achieve that, though it is fun when it works. Example of a semantic layer I made below, but there are others out there with more support behind them.

https://github.com/totalhack/zillion


This somewhat already exists in the form of semantic layers, which if done well can handle many of the queries necessary in the course of business without business users needing to know any SQL. There will still be cases where you need direct SQL, and AI tools can help the development process there at a minimum.


Yes, there will be cases where you need SQL knowledge. There will also always be cases where knowing exactly how a database works under the hood is necessary. I think this is somewhat of a weak argument because you can always construct an example of how something may be helpful for a small group of people.

The relevant question is: How many people who work with databases need to have a lot of experience with SQL? My argument is that while the answer today is "most," the answer in a couple of years might be "very few."


Sure, AI will assist more and more in the cases where people must write SQL or manage a database, perhaps to the point you suggest.

But my point was actually that more people think they need to know SQL today than is actually the case. Excluding people that manage databases or cases that go direct to SQL for things like complex ETL, your average business user / marketer / etc should not be asked to write SQL or have to ask someone else to write SQL for them. Use a semantic layer instead with a UI on top and it's almost as easy as natural language.

Here is a example of one I made below, but there are others out there with more support. At my company, and the last few I've worked for, we use this approach for ~all day to day querying and a chunk of backend SQL replacement.

https://github.com/totalhack/zillion


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: