Hacker News new | past | comments | ask | show | jobs | submit login
Apache Jena (apache.org)
140 points by pplonski86 on March 18, 2019 | hide | past | favorite | 65 comments



A lot of "blast from the past" comments! But as a counter example, we still use Jena extensively (in combination with its server Fuseki) to deal with biomedical ontologies and taxonomies for use in Natural Language Processing (e.g. Human Phenotype Ontology, Gene Ontology, etc). We even recently made the step to add some PROLOG-ish inference rules [1]. I have nothing but love for the RDF ecosystem, in the sense that I love their ideas even if some of the implementations are a bit wonky. For example, the performance of property paths (i.e. the + and * operators in sparql) leave things to be desired sometimes. Not to mention the funny looks from some devs when you say you use RDF, but I take that as a badge of honor! It took me a while to get it, so I finally decided to write down what it all means a couple of years ago [2].

[1]: https://jena.apache.org/documentation/inference/index.html

[2]: https://joelkuiper.eu/semantic-web


A blast from the past. Semantik Web used be a good buzzword to receive EU research grants, and RDF tripel have been useful in some areas. What else remains?


  > What else remains?
SPARQL. It's useful beyond Semantic Web, as evidenced by the raise of projects such as GraphQL. It can replace proprietary REST APIs and make the Web more connected and open. For e.g. see this short talk by Ruben Verborgh [1] and projects like Solid [2].

[1] https://youtu.be/LUF7plExdv8

[2] https://solid.mit.edu/


I chuckled on the EU research grants part. My God, when I remember the monstrosities I was thinking up just to bodge in Semantic Web/Linked Data. Those guys are purely fueled by buzzwords. Maybe we met somewhere :) .

"Semantic data is the future and always will be" - unknown author.


"Semantic data is the future and always will be" - unknown author."

- Pretty sure it was Peter Norvig


Wikidata is built mostly on RDF ideas. For example, there's SPARQL query interface for it: https://query.wikidata.org/

Schema.org is heavily used by search engines and metadata in its format is found on many websites. It's an RDF-based format.


What do you think Google, Facebook, Siri and Alexa use for their underlying knowledge graphs?


They have knowledge graphs? I thought it was all machine learning now.

(That was a glib comment, but could we be a bit more specific about what's being repreented with RDF and other technologies, and for what purposes it's used and how?)


Ya right machine learning. Machine learning over what? Comments on the internet? Donald Trump would automatically be classified as an Orange.

The knowledge graph is a massive manual effort of encoding triples - https://en.wikipedia.org/wiki/Freebase.


Freebase wasn't built on RDF though, but had all it's own formats and query language.



The fact that they exported their data in form of RDF for other's convenience doesn't mean they built it on RDF.


A dead effort?


Ha, my sole experience of RDF was on an EU-funded research project :)


Same here! :)


Mine on a non-funded research project. Must have done something wrong...


I'm seeing several people asking what RDF is useful for. If you're curious, I use it for basketball analytics: https://github.com/andrewstellman/pbprdf

Here's an article about my system, pbprdf: https://www.zdnet.com/article/nba-analytics-and-rdf-graphs-g...

And an example of its use: https://gist.github.com/andrewstellman/4872dbb9dc7593e56abdd...

Here's an example of what the RDF files generated by pbprdf look like:

Here's the ontology, which defines the vocabulary it uses: https://github.com/andrewstellman/pbprdf/blob/master/generat...

And this is what the data looks like:

  <pbprdf/games/2017-11-29_Warriors_at_Lakers/230> pbprdf:shotPoints "3"^^xsd:int ;
        pbprdf:shotAssistedBy <pbprdf/players/Klay_Thompson> ;
        pbprdf:shotType "26-foot three point jumper" ;
        pbprdf:shotMade "true"^^xsd:boolean ;
        a pbprdf:Shot ;
        pbprdf:shotBy <pbprdf/players/Stephen_Curry> ;
        a pbprdf:Play ;
        pbprdf:forTeam <pbprdf/teams/Warriors> ;
        pbprdf:inGame <pbprdf/games/2017-11-29_Warriors_at_Lakers> ;
        pbprdf:time "10:23" ;
        pbprdf:period "3"^^xsd:int ;
        a pbprdf:Event ;
        rdfs:label "Warriors: Stephen Curry makes 26-foot three point jumper (Klay Thompson assists)" ;
        pbprdf:secondsIntoGame "1537"^^xsd:int ;
        pbprdf:secondsLeftInPeriod "623"^^xsd:int .


Hi Andrew

Thanks for posting your project in full. Too much of RDF/linked data is in the abstract, to big to see the moving parts, or behind propriety doors. I'm at the beginning of the learning curve so it's much appreciated and quite a number of things about the data workflow clicked in - nice to see a graph and instance create process in rd4j.

I'm wondering if you've come across an approach to push the outputs of quantitative sparql queries such as your shot points% to a visualization tool..but, I'm looking for a semantically aware approach.

So as to be informative to this forum.. what do I mean... I'm not talking about a basic flow of the output a flat file (e.g. csv) and digest by a generic tool - take the pick of zillions of libraies here, but Power BI is my current bug bear where Microsoft has sold the promise of self serve BI but leaves everyone else to manaage the chaos of cleaved, chewed and duplicated data and fragile and disconnected calculation(DAX measure) code base.

So what am I looking for ?

Let's call them "measures", but in the rdf construct they are sparql queries as you've documented so well. The measure operates on data that meets constraints of it's type and cardinality amongst other things, but which has, if required been automatically changed as to the conforiming "pattern" using constraint rules. I then build my client application with visuals e.g. a chart or map that displays the sparql query results. The visual changes based on properties or constraints on that data. More over the acutal measure is stored with the data and encapsualted in the client application. Plus it has full provenance also included. I note here that general ontology and instance visualization tools abound, but not what you could call BI tools for charting etc.

I know these have been concevied and prototyped before. See: https://composing-the-semantic-web.blogspot.com/search?q=cha...

I have been building my skills and work flow in a team that's adopting shacl and spin rules to drive data ignestion through to interfaces in the Topbraid tool set. The space is coming along but for this use case of charting and visualizing seems to have stalled, with the above UISPIN work now deprecated and waiting... maybe for shacl and some shacl javascript mappings to come to the rescue.

I've found some interesting new work using webcomponents (polymer/LitElement) that makes sense: https://blog.resc.info/reboot-of-using-interface-encapsulati... But it feels a long way away for me to tackle conceptually and skills wise with yet another code framework to get on top of.

Hoping you've seen some potential paths mate.

Cheers

Simon


I think Resource Description Framework (RDF) is overkill.

The underlying idea that is triple store (reminiscence of Entity-Attribute-Value model) is a good idea because it allows to model data with less overhead than a graph database will do over the same problem. Think of a list of items attached to a node or hypergraphs. All that is easier to do in a triple store.

Actually, I think that triple stores are not given enough buzz. Most of RDF buzz is around ontologies (aka. standard vocabulary for describing things). Datomic prooves that triple store is a great idea in itself.

Datomic is RDF in disguise. That is, it implements a versioned triple store and a language similar to SPARQL (based on core.logic (aka. clojure's minikanren))).

When you think about it, a versioned database is a gem when it comes to debugging. Versioning a database is next to the best idea of the decade and that would not have been possible with another model than the triple store model.

The idea of database versioning or more generally versioning of structured data, especially versioning ala git is making its way through academia (see https://project-hobbit.eu/) and outside academia (cf. https://qri.io/) to help with the vast amount of data that is flowing.


> When you think about it, a versioned database is a gem when it comes to debugging. Versioning a database is next to the best idea of the decade and that would not have been possible with another model than the triple store model.

This is a pretty bold claim. Why isn't it possible with another model?


Well, you are correct. It is possible to implement historisation / audit trail in other database models. Sorry. Maybe I should blog about it :)


I actually stumbled across this a couple weeks ago, while thinking about the design for a project. I'm working with threat intelligence/security data and I need a nice way for users to save/share that data with the team, as well as query it in a simple fashion. People are pushing for a graph database. So I did some research and learned about RDF, OWL, Jena, etc.

I'm still not sure if this is the right road to go down. Just from reading papers it's hard to tell how much of this is hype and over-engineering and how much is solid. Anyone have some RDF/OWL/Jens/SPARQL stories they want to share?


I had a project to map out a whole bunch of relationships between various software configuration parameters of a vendor product. This was a lot of XML but also some CSV and JSON data. There's probably a ton of ways to do it, but I just spent some time extracting "facts" from this data into RDF. I didn't know what any of it was or how it all fit together but there were a lot of UUIDs and such.

Once I had "everything" in RDF, using SPARQL queries to nose around was a really great way to find arbitrary relationships. I then used GraphViz to map out everything and it was very helpful to see how everything fit together. Could've probably used a regular database, or a key value store, but working with just flat files transformations at the command line was nice.

I've also since used it manage a knowledge base and generate documentation and presentations. I haven't even used much of OWL or higher levels of abstraction yet, but just from a hack up something standpoint I think it is a pretty nice set of ideas to get basic graph operations just about anywhere you may need it, in the shell, in the browser, or in a backend project. Large sets of data may require something else, but anything less than 1g of data is rather performant IMHO.


EU is developing a vocabulary editor called VocBench [0]. UK is making something similar: registry-core [1]. Here in the Netherlands there an active group exchanging RDF user stories [2].

To get a good start with RDF etc, I recommend 'Semantic Web for the Working Ontologist' (second edition or later!) [3]. It explains how inferencing works by doing inferencing with SPARQL queries.

RDF is a powerful technology, but takes time to get acquainted with.

What I dislike is that there's not many libraries (that I know) for working with it. I started a Rust library (Rome) for working with Turtle files and would like to have time to make it into a SPARQL engine.

[0] http://vocbench.uniroma2.it/ [1] https://github.com/UKGovLD/registry-core [2] http://www.pilod.nl/wiki/Platform_Linked_Data_Nederland [3] http://workingontologist.org/


Unfortunately a lot of stuff going on in the RDF domain is behind firewalls so it's a bit hard to give a lot of details. But I can contribute some public and some private use-cases of where RDF is used:

Refinitiv (formerly Thomson Reuters Financial and Risk) knowledge graph is built completely on the RDF stack: https://www.refinitiv.com/en/products/knowledge-graph-feed

When I talked to them in late 2017, they told me they have 100 billion triples in their database, plus more in a versioning back-end. Their triplestore is open source: https://github.com/CM-Well/CM-Well

Several government-agencies all over the world start to build public RDF knowledge graphs. I'm closely involved in the one from the Swiss government, see my presentation from last week http://presentations.zazuko.com/Swiss-LOD-Platform/

There are similar projects in other countries like the Netherlands, Belgium, UK, etc. This stack makes a lot of sense for open data, as you can do some pretty crazy queries without spending 2 days on preparing your data. See for example the Swiss Open Data Advent Calendar of 2018: https://twitter.com/linkedktk/status/1076064066525949952

As I said there are many "behind the firewall" use-cases where people use the stack exactly because of its features like OWL. Yes it comes at a price (bootstrapping is not really super easy) but this is stuff we will still run in 40 years from now. I see it in:

Finance: Fraud detection, compliance, customer 360° views, ... Stardog (https://www.stardog.com/) lists Moody's, BNY Mellon and National Bank of Canada as customers, last week I've met someone from Credit Suisse which is Mr. RDF there. * Production: You have a ton of databases containing products you create but there is no way to figure out what a final product consists of as the data is scattered across at least 5 of them. The automotive supplier I talk about here is using RDF to get that view.

Life sciences: The largest RDF dataset available to the public is UniProt and related datasets. In total they provide a SPARQL endpoint (RDF database) with 50 billion (!!) triples available. This is a highly popular dataset and is used in pretty much every larger pharmaceutical enterprise as well. See https://www.uniprot.org/ as a starting point. I know at least of one large life sciences company that just recently decided that RDF will be the base of all future data unification standards within the organization.

Insurance business: One of our customers is using RDF to unify a ton of different systems and get the 360° view as well about their customers.

RDF is an absolutely amazing stack and I do not see anything else available that gets remotely close to the power of it. The day I find something more powerful, I will be the first using it. But most of the time people dismissing RDF have zero clue about what it really can do.


I am part of a team building an RDF database to be used for environmental footprinting and industrial ecology (https://github.com/BONSAMURAIS), and am also slowly becoming part of the Swiss open data scene - I would appreciate a chance to chat with you about your experiences!

For us, RDF seems like the only technology that can easily adapt to the large number of data types that we envision collecting.


sure, more than happy to. You will find me at @linkedktk on twitter or adrian.gschwend @ zazuko . com


You sound like you may understand RDF well enough to answer.

If I have transcribed voice convo data, with date/time, names, location, sentiment and extracted subject matter keywords, would Jena/RDF and/or related tools be appropriate for exploring relationships and trends between data points?

Thank you.


Yes that sounds like a pretty good fit for RDF. Do you have some examples on the data? Probably pretty straight forward to do an RDF model that could be used for analytics later


I have been a fan of RDF/RDFS/OWL for a long time. I started adding RDF in various forms to my main web site shortly after TBL, et al wrote the original Scientific American article. I have also written two semantic web/linked data books.

The uptake for the semantic web has been spotty. I was hired as a contractor at Google to work with the Knowledge Graph and over the years I have had a lot of consulting work in related areas.

That said, if I were building a custom Knowledge Graph for a company or organization today, I would likely use a graph database like Neo4J. Maybe not though - it would depend on the application.


Really? Why not Jena or Marmotta with Postgres backend? I have little experience but just started setting up the latter, therefore I'm curious.


I can't speak for OP, but in my humble opinion, they don't really serve the same purpose. Jena and Marmotta are implementation of standards, while Neo4J is more of a proprietary system, and the underlying paradigm is different between triple graphs (for Jena and in a lesser dimension Marmotta) and property graphs (for most graph databases, e.g Neo4J, Apache Tinkerpop, OrientDB...).

To make a very short summary, RDF is more concerned with the possibility to link data, so each piece of information is identified by a dereferencable URI, and can be described with an explicit model called an ontology (fancy word for a vocabulary used to describe data). On the other hand, Neo4J is more concerned with performance, but does not consider linking data accross the Web, or using an explicit schema. And for Marmotta, it is in kind of a bad place right now, the development seems a bit stalled, and the standard it implements is quite complicated compared to the majority of the problem it solves. This might evolve however, since Linked Data Platform (said standard) is now promoted by SOLID (https://solid.inrupt.com/), a new initiative by Tim Berners Lee et al. to enable a truly distributed Web.


I don't see what you can't do with neo4j, neptune, datomic. Neptune offers a sparql interface. You could easily specify an explicit schema. If you wanted to make an ontology that's globally unique you could force it to be defined at a centrally defined application level. If you wanted a inferring traversal, for example applying some type of hierarchy, you could write your own iterator in java in neo4j if you wanted to, or just apply multiple types to the vertex.


The biggest problem with Solid IMO has always been that very few people care about privacy enough to deal with the pain of managing their own data.


If performance matters in any way than Dgraph and Neo4J are much much better. Jena is a relic of semantic web past which you need if you want to do academic stuff like ontolgies, Owl, Sparql and inference. But hardly for building 'real' applications for actual users and datasets.


Different use cases. I think RDF/RDFS/OWL excel at combining different linked data sources with SPARQL queries. I think graph databases like Neo4J may be easier to work with if you are building a local knowledge base. These are just my opinions.


What even is Semantic Web in modern terms?


Things like json-ld, microdata, rdfa https://developers.google.com/search/docs/data-types/product

Wikidata also has data in rdf https://wikidata.org/wiki/Wikidata:Database_download#JSON_du...

I wonder what is the killer feature to be had. If you want highly connected data, you can use graph databases like neptune, neo4j, datomic. If you want logic programming, you still have swi-prolog, or something like picat, eclipse, mercury, which can easily model triples or custom ontology. There's also apache tinkerpop and similar which give querying a more object oriented feel. I see prolog can interop with Jena, but if it can, why not parse & query rdf/owl in prolog itself. Can't prolog do everything sparql can.

What is the value offering on apache Jena?


RDF, and in particular OWL2 (reformulation of RDF tech based on description logic) is about decidable fragments of first-order logic, whereas Prolog is an existential Horn fragment on terms with Turing-complete extra-logical additions such as negation-as-failure and "cut" hence has undecidable decision problems. I actually think the EU-granted research belittled in another comment did a good job of carving out fragments of FOL for relevant applications with desirable complexities of decision problems. But from a practical PoV, RDF, OWL, and SPARQL is bordering on unusable (starting from the fact that open world semantics alone isn't applicable in many real-world scenarios), though jena and rdflib are working fine. Think about OWL2/description logic as a variable-free representation of axioms with just two logical variables (plus some with three vars such as the axiom of transitivity).

Today there's a renewed interest in Prolog and Datalog, which makes me happy after RDF had captured the field for almost two decades.


Not 100% certain, but I think the killer feature is the ability to manage your own data while exposing semantic representations that external apps can interact with.

The guiding idea behind RDF was (at one point) that users manage their own data while apps they give permission to can use that data.


Ha, yes, JSON-LD is RDF in JSON but no one admits that so it doesn't look ancient.


Knowledge Graph.


And what is that exactly?


- Are there any examples of using Apache Jena for reasoning/inference CMU's NELL/RTW [0] data?

- Are there any tools for ontology inference like Apache Jena, but simpler and preferably built with js or python?

[0] http://rtw.ml.cmu.edu/rtw/


Big fan of Jena. The command line tools alone for running SPARQL queries and converting between formats is great. Can anyone compare and contrast Jena in a Java project and/or operation against flat files with Neo4J or other graph systems?


Jena is an RDF triple store whereas Neo4J, Memgraph, or JanusGraph, for example, are property label graphs. Two different data models with distinct use-cases. I would compare Jena to other RDF engines like Stardog or Anzo graph.

From a java project perspective, it doesn't really matter which one you choose as they all expose JDBC.


What do you think of http://rdf4j.org ?


Interesting. I'll have a look. Thanks for sharing :D


http://rya.incubator.apache.org/

How does it compare to rya?


Looks like rya is a triple store on top of accumulo. Jena has SDB (stores data in relational DB), TDB (single machine, single JVM, transaction-wise, on-disk store [with mmap, journal, cache, etc]).

Jena API is also extensible, and I remember someone was talking in the mailing list about an example with HBase I believe, as the backend for the triple store.

Both HBase and Accumulo can be scaled to multiple nodes. So you should have some pros and cons of using either (or rya), as well as pros and cons of using Jena.

Finally, I believe Jena is not simply a triple store. It has a web layer via Fuseki, services for managing graphs, command lines, and other tools useful for data processing with semantic web technologies.


My comment is a repost that I did in another RDF related thread. It is still valid so here we go:

I'm an engineer that used to do RDBs for a long time. One day a customer of a friend came with an issue that was in my opinion impossible to solve with relational DBs: He described data that is in flow all the time and there was no way we could come up with a schema that would fit his problem for more than one month after we finished it. Then I remembered that another friend once mentioned this graph model called RDF and its query language SPARQL and started digging into it. It's all W3C standards so it's very easy to read into it and there are competing implementations.

It was a wild ride. At the time I started there was little to no tooling, only few SPARQL implementations and SPARQL 1.1 was not released yet. It was PITA to use it but it still stuck with me: I finally had an agile data model that allowed me and our customers to grow with the problem. I was quite sceptical if that would ever scale but I still didn't stop using it.

Initially one can be overwhelmed by RDF: It is a very simple data model but at the same time it's a technology stack that allows you to do a lot of crazy stuff. You can describe semantics of the data in vocabularies and ontologies, which you should share and re-use, you can traverse the graph with its query language SPARQL and you have additional layers like reasoning that can figure out hidden gems in your data and make life easier when you consume or validate it. And most recently people started integrating machine learning toolkits into the stack so you can directly train models based on your RDF knowledge graph.

If you want to solve a small problem RDF might not be the most logical choice at first. But then you start thinking about it again and you figure out that this is probably not the end of it. Sure, maybe you would be faster by using the latest and greatest key/value DB and hack some stuff in fancy web frameworks. But then again there is a fair chance the customer wants you to add stuff in the future and you are quite certain that at one point it will blow up because the technology could not handle it anymore.

That will not happen with RDF. You will have to invest more time at first, you will talk about things like semantics of your customers data and you will spend quite some time figuring out how to create identifiers (URIs in RDF) that are still valid in years from now. You will have a look at existing vocabularies and just refine things that are really necessary for the particular use case. You will think about integrating data from relational systems, Excel files or JSON APIs by mapping them to RDF, which again is all defined in W3C standards. You will mock-up some data in a text editor written in your favourite serialization of RDF. Yes, there are many serializations available and you should most definitely throw away and book/text that starts with RDF/XML, use Turtle or JSON-LD instead, whatever fits you best.

After that you start automating everything, you write some glue-code that interprets the DSL you just built on top of RDF and appropriate vocabularies and you start to adjust everything to your customer's needs. Once you go live it will look and feel like any other solution you built before but unlike those, you can extend it easily and increase its complexity once you need it.

And at that point you realize that this is all worth is and you will most likely not touch any other technology stack anymore. At least that's what I did.

I could go on for a long time, in fact I teach this stack in companies and gov-organizations during several days and I can only scratch the surface of what you can do with it. It does scale, I'm convinced by that by now and the tooling is getting better and better.

If you are interested start having a look at the Creative Commons course/slides we started building. There is still lots of content that should be added but I had to start somewhere: http://linked-data-training.zazuko.com/

Also have a look at Wikipedia for a list of SPARQL implementations: https://en.wikipedia.org/wiki/Comparison_of_triplestores

Would I use other graph databases? Definitely not. The great thing about RDF is that it's open, you can cross-reference data across silos/domains and profit from work others did. If I create another silo in a proprietary graph model, why would I bother?

Let me finish with a quote from Dan Brickly (Googles schema.org) and Libby Miller (BBC) in a recent book about RDF validation:

> People think RDF is a pain because it is complicated. The truth is even worse. RDF is painfully simplistic, but it allows you to work with real-world data and problems that are horribly complicated. While you can avoid RDF, it is harder to avoid complicated data and complicated computer problems.

Source: http://book.validatingrdf.com/bookHtml005.html

I could not have come up with a better conclusion.


I think the semantic web never worked because of seo spam. The closest it got to adoption in any form was the keywords meta tag. We know how that ended up.


Less that and more the combination of bad tools, labyrinthine “standards” which were usually poorly implemented by said tools, and lack of concrete use-cases. Almost nobody had a case where implementing it showed a clear business value so most of the “you should do this” advocacy was along the lines of “this will become useful later” and that was a hard sell because it took many years before you could go to a webpage, read some decent instructions, mark something up, and run it through a validator or another client and get the expected data back out.


That is exactly what is happening: people and companies marking up their pages with JSON-LD and RDFa metadata because it impacts SEO: https://developers.google.com/search/docs/guides/intro-struc...


That was my point: to the extent that the semantic web has happened at all, it's because someone can go to their boss and say “If we do this, here's a non-hypothetical benefit we'll see immediately” and because Google invested in high-quality documentation and tools which make it easy to do correctly.


I prefer rdf4j as Rdf-Api for clients and GraphDB from Ontotext as RDF store. I never liked programming with Jena.


How is semantic web data being used today? Has there been any research to build a general AI using this?


What is the origin of the name / What does it mean? Couldn’t find an answer on the website.


Probably a reference to Frege, a founder of modern logic.

https://en.wikipedia.org/wiki/Gottlob_Frege


Jena is a city in the eastern part of Germany. It is historically famous for its university and optics innovation, and as an epicenter for bauhaus architecture.


That is why people from AKSW Leipzig built a similar thing for PHP named Erfurt (https://github.com/AKSW/Erfurt) for their semantic wiki OntoWiki (https://github.com/AKSW/OntoWiki).


why Java?


because it is an enterprise language? Jena originates from Hewlett-Packard labs.


Easier to build something quicky but definitely not the best option for DBs. It's always about tradeoffs I guess.


for quickly, I would have chosen a dynamic language (python, php)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: