Sort of off topic, but every cool BI tool I can find (prompted by recently seeing metabase and AWS quicksight) seems to work with one database at a time.
When working with microservices, your data is spread across some N databases, rendering most BI tools, from what I can see, completely useless for reporting on more than one service at a time.
Are there any solutions for this? Or is the only one right now just dumping all your service DBs into a single DB for analytics?
edit: thanks friends. ETL it is. don't know why I thought it would make sense to have a tool that reports across multiple databases since the performance would be horrific...although maybe there's space for a hosted BI tool that does ETL automagically. just a thought
More precicely, it is a common pattern to split the analtics into three parts: 1) Collecting the data (using lots of adapters) into a "data lake", 2) filter and preprocess that data lake into a uniform data structure whose structure is determined by the analysis goal rather than the operational system, 3) analyze that uniform data with statistical and other tools.
It usually makes no sense to combine those phases into a single overall tool. First, these are very different tasks, where different specialized tools will evolve anyway. Second, you want to keep the intermediate results anyway - for caching as well as to have an audit trail and reproducibility of the results.
For example, you don't want the performance of operational system be affected by for how many analysis tools it is used at a point in point. Also, you don't to work on a always-modifying dataset while fine-tunung your analysis methods.
Thanks for your input. The point of a BI tool is to allow flexible analysis of your data. What types of transformations are generally required for making this work without actually _doing_ the analysis during the warehousing step?
The data lake contains datasets in their original structure, raw data, with all warts and stuff, defined by the operational systems.
The data warehouse contains datasets in a unified (and usually simplified and reduced) form, defined by the needs of analysis tools.
If your analysis tool accesses the data lake directly, it will almost certainly contain "parsers" for various operational data formats. Also, it will perform those transformations over and over again every time. And multiple analysis scripts may contain multiple versions of those parsers. The idea is to separate these "parsers" out of the analysis step and to "cache" the cleaned-up intermediate result. That "cache of clean data" is usually called "data warehouse", and can create good indexes on that data, multiple runs of your analysis tools have very fast access.
got it. so the idea is generally people want to do sort-of 2nd derivative queries on the data, so it's best to get those first stats out of the way in the warehousing step
Yes, although I wouldn't describe this as "2nd derivative queries", but more like "put the code that you need anyway into two separate layers (tools) with clean boundaries and a persistent intermediate result".
This is why data warehousing exists. That way you can use many different BI tools against a consistent set of data and pre-calculate a lot of commonly-used summary data.
ETL also isn't something you can just do automagically. It requires an understanding of the data and your goals, because you essentially have to build both your data model and your reporting requirements into your ETL process. You could probably do it automagically for some simple cases, but for most real-world scenarios it's just going to be easier to write a Python/Perl script to run your ETL for you.
BI / reporting requires a lot of plumbing to work correctly. You have to set up read-only clones of your DBs for reporting (because you don't want to be running large queries against production servers) and generally an ETL process that dumps everything into a data warehouse. From there, you can push subsets of that data out to various BI tools that provide the interface.
If anyone's looking for a straightforward ETL framework, check out DataDuck http://dataducketl.com/ The `dataduck quickstart` command is as close to automagic as you can get, and then you can customize your ETL after that.
I recently joined a competitor (SnapLogic). It's a lot nicer IMHO - HTML5 drag-n-drop interface, unified platform for big data, API integrations, IoT etc., and supports executing integrations either in the cloud or on-premise.
Microsoft's SQL Server Integration Services has a lot of adapters and transformations out of the box. The tooling is in visual studio, you can use c# or f#. Warning about f# - you will never want to touch other languages after you try it.
Yeah. It's really friggin good. I highly recommend it. And I'm an entrenched libre open source kinda character.
I recommend reading PACT PRESS' books about Business Intelligence on the Microsoft SQL Server stack. It brings it all together in a fantastic way, example driven.
Metabase is nowhere close to being a business-ready BI tool. I wish them luck, but their current product is basically a tech demo compared to the real, commercial products out there -- none of which require you to put your data anywhere in particular either (you can run Tableau against pretty much any DB platform out there).
NiFi looks very promising for numerous chunks of the ETL capability space, but I'm not sure it can stand alone on the transformation piece. I've been looking into coupling it with python for a full ETL stack.
JPKab - As you explore NiFi more and pair it with your own scripts I'd be curious to hear if you think there are things we can and should do better to be more complete. Let us know at dev@nifi.apache.org Good luck!
If you are looking for an open source technology to meet this need, Apache Drill is a distributed SQL engine that can talk to multiple datasources and make them available in a single namespace. Joins across sources are supported and can actually perform well when you use enough nodes for your data volume.
Because Drill exposes ODBC and JDBC interfaces, it can very easily be used with BI tools to expose all kinds of data to analysts.
The architecture is fully pluggable so anyone can write a connector to a new datastore and just include it on the classpath when starting the cluster. In the current 1.2 release Drill can connect to any database that exposes JDBC, like MySQL, Postgres, Oracle, as well as a number of non-relational datastores like HDFS, MongoDB and HBase.
To work with all of the non-relational stores Drill supports complex types like maps and lists. The syntax for working with this complex data is the same javascript, i.e. a.b.c[3], so this is a compliant extension of SQL, but is non-standard.
The solution is to use an ETL (Extract, Transform, and Load) tool to copy data to a warehouse.
You then point the BI tool at the warehouse.
There are lots of tools for this already, and even more custom solutions and other things. CSV files tend to rule here... scheduled exports that are picked up by the ETL tools.
The transformations that take place are usually normalisation of data from systems of differing schema or different types of data (perhaps changing strings into corresponding numbers, i.e. Risk Level = Orange could be normalised as 50% in the data warehouse).
Businesses have taken this approach for a long long time as the databases that drive applications, services, and now micro-services have always been tuned for the performance of the production use-case and not for ad-hoc reporting.
(Full disclosure, co-founder of Periscope Data here.)
Periscope Data supports cross-DB joins. We cache the data in our own clusters so queries run really fast, and also so you can do things like join across databases and upload CSVs.
Some BI tools can connect to multiple databases, however the right solution is to put all your data into a single data warehouse and run the analysis off of that. Query performance is one reason, and here's a blog post I've written with some more:
I agree, the "one database at a time" approach is the way vendors ignore the harder problem of connecting data across platforms.
I used to work for a company that built the "automagic ETL" kind of solution. You could write a single straight SQL statement across virtual "tables" (if you had non-tabular data, say NoSQL, you could query it through a table-like projection). It could literally join data across heterogeneous systems by shuffling data between them or to a dedicated "analytical processing platform" for join processing. Or you could do things like, create a table on one system, as the result of a select from a join of tables on three other systems. At the time, this was way ahead of what anyone else was doing.
However, it is a hard problem to solve, the company is/was small and funding was a problem because it took a long time to find ways to invent the tech. Also, it was an enterprise solution and closed source - when it really probably needed to be open source to be able to support the diversity of data sources.
These days, between Apache Drill, Spark, Ignite, etc., and any number of other commercial solutions, we're starting to see the solutions to solve the problem you're talking about.
I bet this Metabase UI on top of Apache Spark, and your databases, would be a killer. That's a common pattern (BI tool on top of Spark), see Apache Zeppelin for how it uses Spark, for example.
That said, as long as your data isn't truly ridiculously huge - if it can be centralized, centralization still works just fine.
This is something we saw at Chartio very early on. It's why we built the concept of layers directly into our product to support cross-DB networked joins:
I've worked in this space a lot and in my experience, the only true solution is some sort of layer below your reporting tool to aggregate the data. That being said, it's not always a database that physically stores the data. It simply needs to be a semantic layer that models the way in which the data will be queried. Additionally, your OLTP systems are not optimized for reporting and you'll quickly choke yourself out trying to perform complex analysis on data structures attempting to solve a problem for which they weren't designed.
So in practical terms, you'll eventually want to bring everything back in and create a analytics DB or data warehouse. Depending on what your schemas look like in each microservice, you might just be able to add each DB into Metabase and hit them up.
We don't do cross DB joins in the query builder. You can still use something like PG foreign data wrappers to create an uber-db with all the tables from each microservice.
Regardless of whether you use MB or another tool, there will be some legwork. Part of the tradeoffs involved in the great monolith vs microservices decision.
The non-cool BI tools work alright with multiple sources and integrating them into a common OLAP model that takes care of automating the relevant backend queries. This is a valid approach if you do not want to move all your data into a DW before reporting on consolidated views, but the there is no magic bullet in terms of which model is optimal. I am talking about Oracle BI, SAP BO, Cognos BI.
Tool-level data integration is supported by tools like Spotfire, but for analyzing high volume of data you need to mantain a data warehouse using ETL code to integrate the data from all the diferent sources.
For better performance you may also need to implement an OLAP tool.
lot of bi tools provide connectivity via odbc and jdbc so works with multiple databases at the same time. check http://www.infocaptor.com for e.g, serves both odbc and jdbc, on prem and on cloud
I like the idea and the use of AGPL. This license lets companies release useful software and still keep the door open for selling alternative licenses, if that is what Metabase does. I have been thinking of writing something similar for natural language access (similar to what I used in my first Java AI book fifteen years ago) to relational and RDF/OWL data stores, and Metabase provides a nice example of how to distribute a general purpose standalone query tool.
Also +1 for being a Clojure app. I am going on vacation tomorrow and am loading the code on my tiny travel laptop for reading.
Those companies are missing out on some really great tools then, like this one. Perhaps as more and more awesome software is released as AGPL, they will change their position.
That seems like a very silly policy. Do these companies do any kind of analysis on what they'd need from their software licenses and what they'd be giving up by using AGPL?
You'd think from their attitudes that the FSF is a patent troll.
Let's look at the advantages of using software under "permissive" licenses for those companies:
- embedding somebody else's code in a closed source product and potentially compete against it
- patenting somebody else's code and potentially sue the author
For the upstream author losing this type of hostile users my using AGPL might be a good idea.
It's not a silly policy. Using AGPL software (without getting a commercial license) in any modified form and/or depending on it would require all applications that use it across the network to be open sourced.
Of course this requires it to be modified which could happen accidentally and completely put a closed source application in jeopardy.
One very selfish request: Google Bigquery connector.
I think a lot of the companies you'd like to serve are going to be folks who might start getting into large quantities of data, but are too small to have people to build out and maintain OLAP schemas. This, to me, is where BigQuery comes in.
FYI, Metabase is a Thompson Reuters platform. I bet you anything they hold the trademark for the term.
You have a pretty good market opening right now - Crystal Reports just went to a SAP Business Objects-type model. Pentaho doesn't even have a community edition anymore.
It looks like you don't support MDX/snowflake schemas/other OLAP reporting standards, so I'm guessing you're going for the lower end of the market. So two things you should do off the bat - import data from Google Sheets (maybe offer to "unlock" the service for tweets and/or links), and offer an Excel plug-in (free up to X rows).
Also offer to charge if only to allow Bob to go to his manager and say you offer support. In corporate environments, no support is often a no go.
Everyone in the low end of the business world just uses Tableau or QlikView - neither company is very militant about enforcing their trial licensing; and full versions aren't really all that expensive for a business anyway (~$500 per license per year, but a small company can get away with a single license). Both tools are also much more mature and powerful than this product.
Not saying there's no room for an open source BI platform, but I don't know there's a lot of money in it given the competitive pressure Tableau and Qlik are putting on the SAPs/Oracles of the world. Anything over $100 is going to put you squarely in the realm of real BI tools with MUCH more capability, a much simpler interface, and standardized training programs.
That's for the personal edition, which can only connect to CSV, Excel, MS Access, OData and Tableau's proprietary Data Extract format (TDE). Tableau Online requires your data to be stored in their cloud (which might not be acceptable to some IT departments).
Tableau Online doesn't require you to save your data on our platform if it's stored in AWS, Google, or Azure (example: Amazon Redshift or Aurora). Your can maintain live data connections with datasources hosted on those ecosystems and the source row level data doesn't ever need to land in the Tableau Online platform. More info here: http://www.tableau.com/learn/whitepapers/tableau-online-keep...
And also you need the desktop product to publish to it. Tableau online is a cloud alternative to the on-premise online dashboarding, not for the desktop client.
Yup, TR has a bioinformatics DB they sell which I think is based on metabase.org. We're in a sufficiently different space that we're hoping it's not a conflict.
We don't explicitly support snowflake schemas but folks are definitely using MB with them them. We have a limited set of joins accessible through the query builder and snowflake schemas work really well.
I always find database & "data helper" tools fun to experiment with. After playing with this with a MySQL database for 10 minutes on my Mac I have 2 initial reactions:
1. The GUI is quite nice, and very simple. It is a glorified query tool that knows your tables and helps you make queries with visualizations and gives the results. It still allows for raw queries if you're into that and just want their GUI for queries.
2. The tool (or the Mac app, at least) still has plenty of bugs to iron out:
- I tried creating a "Dashboard" and it wouldn't actually create or close the modal window. Then I refreshed the application and it had created 10 of the same Dashboard.
- I tried deleting the database and the button just doesn't work.
- Many of the queries I ran on my own tiny sample db seemed to just not run. Closing & reopening the app didn't help.
I feel like the bugs could largely be to do with the OSX binary in specific, and not the actual platform. Quite interested to see how this develops, and am going to put a bigger database in to play later when I have more time.
re:dash is very cool, and we actually used an early version way back.
The primary difference is audience. Most of the impetus to build it came from the difficulties we had getting non-analysts to do their own data pulls and ad hoc queries. While some number of people were able to edit SQL others had written for them, it rapidly became unsustainable.
As a consequence of the audience, we have lots of ancillary features (like the ability to see detail records and walk across foreign key links, the data reference, etc) that fell out as we learned how people used the tool.
Looks like a nice tool. I wrote https://github.com/epantry/django-sql-explorer, which is sort-of similar in terms of the problem it solves, except embraces SQL front-and-center. Excited to check this out in more depth.
Personally, I think it's really exciting is that this is all written in Clojure! We need more Clojure web apps out there to learn from! An initial glance at the source makes it seem very readable as well. A great learning resources for us amateur Clojurists. Also makes it clear that the ecosystem is pretty immature; there is a lot of frameworky boilerplate to support a straightforward REST API.
Yes, writing the app in Clojure has been very interesting. Some things were downright magical, while other aspects felt very rough coming from other languages and frameworks. Ironically the prototype was in django =)
Yeah -- it seems crazy that you have to reinvent things like your defendpoint macro.
I know Clojure is all about composable libraries, not frameworks, etc, etc but just as compojure has become the de facto standard for routing, I imagine more libs will emerge higher up the web stack. It'll mean opinionated libraries, but that is the only way to get more people writing apps that deliver value instead of spending time creating macros that ensure (for instance...) the presence of int IDs in API requests.
How would we compare Splunk and Palantir with this company? I know their focus may be a bit different: Splunk is geared towards "logs" and Palantir towards govt datasets, etc... but at the essence of each company is the task of bringing diverse datasets together and making sense of it. Which makes me think that the companies that are successful at this focus on a particular use-case - as opposed to trying to build for a generic one. thoughts?
Splunk is mainly about log analysis, though they have a new BI product. It's primarily about making sense of your logs. We primarily work with a database you already have (and eventually the data warehouse you'll spin up).
Palantir is a wide set of tools and really about much more powerful analysis than we're targeting.
Our whole bag is letting non-technical people in your company get answers to their common questions by themselves.
This looks awesome, and is something I've been looking for. I'm aiming to start offering a specific service to businesses creating a back-end for BI platforms, but would like to offer the service as cheaply as possible by having a license free BI layer.
Would be interested in what you can offer when youre ready. Im helping businesses implement BI more from a strategic perspective. I dont get involved technically , its just part of my consulting.
Terrible thing about the googlability of the name, since its so heavily overloaded for DB-related things.
Also, while it seems to work tolerably well with the included sample database, connecting it to a local Postgres instance seems to work to the level of connecting and getting all the schema information, but every attempt to run a "question" -- even the simplest pick-a-table-and-go, or the questions it automatically suggests -- results in the "We couldn't understand your question. Your question might contain an invalid parameter or some other error." message, so it seems like there are some pretty significant rough edges.
Really appreciate you taking the time to try it out!
If you're still around, are you using the default schema ('public') for your tables? We've found a few bugs around non-default schema we're working on.
Yes, we're mainly focused on making it easy for non-technical people to pull data out. We have had embedded applications that hit a REST api to expose actions on the data that was embedded though.
Redshift, Druid and Presto are in planning, but in general, we're going to build drivers in the order of community demand. If you have a specific db, open an issue or chime in on one of the open driver issues letting us know it's important.
Support for some of the more common JDBC drivers would be nice, like SQLite, MS SQL and (shudder) Oracle. I know Korma supports all of these (http://www.sqlkorma.com/docs#db), and it seems like you use that.
Oh, that's an old version tag. Keep an eye out for a new version in a few hours, the app should offer you a chance to update.
Regarding FK's we let you filter or group by any attribute on either the table you're working with or any attribute on an FK in that table. Are you using the sample DB or your own?
Maybe I'm an idiot, but here's me selecting from a customer table. Customers have a FK to 'address' via ship_address_id. I want to filter on address.state, but can't figure out how to do it. 'state' is not an available field in the filters and I can't seem to drill into any of the FK entities.
it looks like the field isn't marked as an FK in our semantic model. Could you go the admin, and click on "Metadata" and check to see what "special type" that field is marked at? If it's not "Foreign Key" you can set it to that and see if you can filter by linked fields.
What type database are you pointing this to? Do you have FKs/constraints set?
Thanks! Absolutely awesome tool. Could do almost all of the charts & metrics I need. I couldn't figure out how to do a chart that shows me the SUMs of something per day and per category. Also number "questions" in the dashboard could benefit from better styling (maybe center horizontally/vertically).
Very cool. How does this compare to something like Microsoft Power Bi? Will I have any issues running it on Linux (Note, im a fairly non-tech user).
I have some clients that need BI reporting. We've been using Cyfe so far, but it's only good for summaries of data. It's not very good for custom reporting.
These are very different product offerings. Power BI offers a mature in-memory columnstore database with a custom functional query language, DAX, with full dimensional modelling facilities. Additionally, if taking advantage of the (freemium) Power BI web portal, you get natural language query and cross-platform mobile apps.
This is essentially a SQL query builder on top of a RDBMS with some nice visualizations.
I am not trying to disparage the project - in fact I'm often of the opinion that a good query builder is all most organizations need for BI, but the two products in question are vastly different. Power BI has tooling to perform ETL, data modelling, and reporting. Metabase only covers the last piece.
In terms of performance, Metabase, being on top of a SQL RDBMS will have the potential to be incredibly speedy, but with large data volumes, you will need a DBA to make sure reporting scales appropriately. With Power BI, that conversation comes much later in the game. 10Ms of rows are trivial with a star schema and sufficient RAM in terms of performance tuning. With a SQL solution, there is a lot more back-end work to tune for an ad-hoc BI workload even at the 1Ms of rows level.
We run it on linux in production and it's the primary platform we recommend for company wide deployments.
Power BI is pretty awesome, but pretty complicated when we tried it out. With enough attention it can do very cool things. Our primary audience is folks who don't need the level of customization, and have a lot of non-technical coworkers that want to ask their own questions.
I couldn't run the Docker image, it seems that the Java process running in docker image doesn't bind the port 3000. Also tried to deploy to Heroku but when I try to access the application via url, Heroku throws 503 error.
Curious. I run the docker image all the time and haven't had any issues with the port binding. What kind of system are you using? and what was your launch command?
Heroku sometimes times out if the application takes too long to start. Could you try again and see if it works? We're working on making Heroku deploys more robust.
It seems that the Docker issue is memory overflow in container. I tried to run the docker image in a 512mb Digitalocean droplet that already runs several processes that also consume memory so I think it's expected considering Metabase runs on JVM.
Are you on virtualbox? Sounds like a port binding issue. If you open an issue at github.com/metabase/metabase or http://discourse.metabase.com/ we'll try to debug exactly what's happening.
Just tried this locally and on AWS at Change Heroes and its great! We've been looking to become more metrics driven as a company, and this might be the tool that makes it happen!
Thanks for building it and making it open source :)
I'm unable to deploy to Heroku, though it seems to be a problem on Heroku's end. Their Rollbar snippet is throwing an unhandled error once I arrive from your deploy button.
I'm able to get get past that and get the app deployed, but then I just get application error trying to launch it. Then I went and looked at the build log. I'm not sure its even doing anything:
We've been using at Color Genomics for a few months and we absolutely love it. Easy to setup, no data sharing with external parties, and simple/straight forward UI.
TNKS - Nice simple tool. I started in AWS (t1.micro) as a .jar. I had to do "hostname localhost" to get the database (h2) to run but then it came right up.
When working with microservices, your data is spread across some N databases, rendering most BI tools, from what I can see, completely useless for reporting on more than one service at a time.
Are there any solutions for this? Or is the only one right now just dumping all your service DBs into a single DB for analytics?
edit: thanks friends. ETL it is. don't know why I thought it would make sense to have a tool that reports across multiple databases since the performance would be horrific...although maybe there's space for a hosted BI tool that does ETL automagically. just a thought