Hacker News new | past | comments | ask | show | jobs | submit login
An Introduction to Knowledge Graphs (stanford.edu)
244 points by umangkeshri on May 22, 2021 | hide | past | favorite | 72 comments



I couldn't stress how important Wikidata (and its predecessor DBpedia) is as a public example of a huge knowledge graph (versus the ones hidden at big tech companies) but also as a Wikipedia-like collaborative project to organise all the knowledge among existing semantic web / linked data publishers, government open data, libraries, galleries, archives...

Also remember that Wikidata is open source and you can fire up your own knowledge graph as docker containers on your laptop: https://wikiba.se/

If you have been disappointed by RDF-based technologies before, I would say Wikidata/Wikibase have significantly innovated on top of them. For example, they allow each statement to have qualifiers, references, depreciation/preferredness attached to them in a user-friendly way while also keeping simple queries simple.


What is RDF?


I think it stands for _resource description framework_, https://en.wikipedia.org/wiki/Resource_Description_Framework


Knowledge graphs for text (the focus of the article) seem narrowly-scoped since they require "objective" facts and relations to be practical. Capturing the subjective and transient perspective of observations made by multiple observers (which is what we actually have access to) is more complicated.

For example, asking the same person the same question may yield different answers based on their mood or other environmental or situational factors. Who's asking the question can also matter, as does the specific phrasing of the question.


Recent work suggests it's possible to generate knowledge graphs from large corpi of text encoded with a language model: https://arxiv.org/abs/2010.11967


Similar work (I'm the first author): generating temporal graphs from text using language models https://arxiv.org/pdf/2010.10077.pdf.


Cool! I'm an incoming undergrad transfer to CMU this fall interested in graph-based NLP. Mind if I email you in the next couple days?


Sure :)


I wouldn't recommend using this work to generate knowledge graphs, as it requires a lot of rule based filtering. I suspect it wouldn't be any better than generating from a constituency parse tree. Disclaimer, I implemented this work a while ago and reach my current conclusion, still waiting for the official code to release.


You could use knowledge graphs to help solve those problems too. Objective fact is just a seemingly simpler domain to work in. Subjectivity is still based on objective facts, it's just a whole lot more of them that are subtler and harder to detect.


I wish all of you not to fall in the trap of ontologies. I worked very hard in this domain my conclusion is that all ontologies fail to scale eventually. I would recommend people in the field to go towards "perspectivism".


Is this[1] an example of ontological perspectivism? Can you point us at a good place to start?

[1]: https://link.springer.com/article/10.1007/s11406-021-00371-1


great start - I have presented this point of view myself.. no clue what "perspectivism" means really, though


I'm not understanding what you mean. The latter portion of your comment seems to disagree with the former.



Seconding the wisdom shared by u/julienreszka and you.

Stuff like schemas and data dictionaries and reuse are chinese finger traps for us geeks. Exquisite problems we can't look away from.

I eventually decided to treat most data ingestion (ETL) as screen scraping. Honoring Postel's Law. Pull out the interesting relevant bits as needed. Ignore the rest.

There's still an internal model, natch. But it's the smallest, most obvious model to support my immediate use cases. Nothing more.


And how does "perspectivism" somehow scale?

Any NLP application directly encoding knowledge (objects, perspectives, whatever) is going to have major scaling constraints, since it's impossible right now to encode human-level understanding automatically.


I don't think they're a trap if you admit ahead of time that the relevant knowledge and its associations are limited. You always have to qualify the result in context. I agree that they won't scale perfectly, but they're still a useful tool in many cases.


Could you (please, pretty please) elaborate?


There's no such thing in practice as "California is in the US." What you can get in reality is "Alice told the computer in 2008 that Bob wrote in his 1999 book that California was in the US in 1470BC."

Perspectivism is the understanding that it's impossible to interpret semantic "knowledge" without knowing the limitations and implicit context carried with the fallible, partial transcription of the truth (set in a world that obeys quantum mechanics, for one thing) into words (which run to about 1MB tops before the author gets bored.)


> partial transcription of the truth (set in a world that obeys quantum mechanics, for one thing)

Quantum mechanics actually says there's less information needed to describe X area of space than classical physics implies there is.

https://en.wikipedia.org/wiki/Bekenstein_bound


Both limits are unimaginably far above the amount of text you could reasonably write to describe something.


I’d like to read more about what you worked on specifically if you’re willing to share!


Gene Ontology is a pretty useful tool. What are you referring to?


> I would recommend people in the field to go towards "perspectivism".

what does this mean in this context?


Is this just a way of saying that no relations are absolute?


He most likely means that reasoners and databases that provide reasoning abilities do not scale. This makes sense, specially for OWL ontologies. For most OWL reasoners, if you feed them with the ontology and with a large set of instance data (class instances connected by edges that are labeled with properties defined in said ontology), it will likely take way more time than you would like to produce results (if it produces something).

The reason for that is twofold:

1. Many of tools created for reasoning are research-first tools. Some papers were published about the tool and it really was a petter and more scalable tool than anything before it. But every PhD student graduates and needs to find a job or move to the next hyped research area 2. Tools are designed under the assumption that the whole ontology, all the instance data and all results fit in main memory (RAM). This assumption is de-facto necessary for more powerful entailment regimes of OWL.

Reason 2 as a secondary sub-reason that OWL ontologies use URIs (actually IRIs), which are really inneficient identifiers compared to 32/64-bit integers. HDT is a format that fixes this inneficiency for RDF (and thus is applicable to ontologies) but since it came about nearly all reasoners where already abandoned as per reason #1 above.

Newer reasoners that actually scale quite a bit are RDFox [1] and VLog [2]. They use compact representations and try to be nice with the CPU cache and pipeline. However, they are limited to a single shared memory (even if NUMA).

There is a lot of mostly academic distributed reasoners designed to scale horizontally instead of vertically. These systems technically scale, but vertically scaling the centralized aforementioned systems will be more efficient. The intrinsic problem with distributing is that (i) it is hard to partition the input aiming at a fair distribution of work and (ii) inferred facts derived at one node often are evidence that multiple other nodes need to known.

loose from modern single-node However, the problem of computing all inferred edges from a knowledge graph involves a great deal of communication, since one inference found by one node is evidence required by another processing node.

[1]: https://www.oxfordsemantic.tech/product [2]: https://github.com/karmaresearch/vlog/


On a side note, I love the idea of researchers writing “articles” in this format. No paywall, no complex two-column format, no PDFs. As a researcher myself, I wish this is what my “productivity” was judged upon, I’d probably have a lot more fun and motivation to work and produce!


Didn't you hear? Having two columns makes it twice as sciencey.


>no complex two-column format,

Two columns is for the reader's benefit - your eyes can keep their place on the page much more easily when jumping half the distance to the beginning of the next line


Two columns makes inappropriate assumptions about the reader's choice of media. The aging publisher assumes everyone prefers paper. A reader attempting to scroll through the article on a tablet suffers.

We blew it, making a global effort to perfect fixed format technical typesetting rather than flowable text. Technical flowable text is just now becoming viable, and it is certainly not the norm for journal articles.


That's what they tell you.


Love to see that relationships (edges) are directional. I hate to see graph models where relationships are bi-directional as it loosens up data rules far too much with very little benefit.

I've worked with systems where the relationships are typed and can have attributes just like the vertices allowing the system to model data in a more intuitive fashion.


The definition seems faulty to me, since the pair (E: subset(N × N), f: E → L) does not admit of multiple edges with different labels, connecting the same ordered pair of nodes. Of course this is most often allowed in practical KG's.


Yeah it's not ideal. Also with the current preference for category theory it would probably make more sense to not phrase E as a subset of NxN but rather to define two functions pre: E -> N and post: E -> N.

You could then choose to add labels as either an additional function label: E -> L, or as several graphs layered on top of one another (especially helpful when you view each graph as an arrow N -> N, which in turn makes more sense when you have several different classes of nodes).

If anyone's interested I'm loosely basing this description on the description of petri nets as outlined in Tai-Danae Bradley's interesting (and readable) Applied Category Theory: https://arxiv.org/pdf/1809.05923.pdf


Indeed multiple edges (with different labels) are quite useful, particularly when you want to represent RDF graphs. But since there is no restriction on the form of L, you can still represent those by, e.g., letting L be a set of sets of IRIs, and thus labelling your edges with sets of IRIs, which you then interpret as a set of RDF triples (i.e., as a set of edges).


Slightly off-topic, does anyone recommend any open-source ANSI C implementations of knowledge graphs/graph databases? I had toyed around a bit with Neo4J and am interested in something in a language I'm more familiar with.

I've created some useful output from Wikidata's dataset with some-space saving decisions like focusing on certain languages and an arbitrary data structure. The dump is quite big and space at a premium.


In the RDF world, there is Redland: https://librdf.org/

But I don't know if it scales up to Wikidata size.



So this is the ‘semantic web’ from ~15 years ago?


Yes and no.

Yes because the web is already a kind of knowledge graph but it's mostly written in natural language, and thus it's very hard for machines to traverse and reason about it. The Semantic Web was an attempt to formalize some ways to make the web's inherent knowledge graph nature more explicit and thus easier for programs to understand.

No, because knowledge graphs predated the web by many years and KGs are a bigger topic than just the web.


Think of this as a programming language? The semantic web is another choice.

Programming languages are valued both when they reveal inevitable design, and when they enjoy widespread adoption. We continue to have many programming languages, because there is no consensus on inevitable design.

The design of "this" reveals no deep secrets about the nature of the universe; the only parts that seem inevitable are the parts that seem obvious. And all of what one sees seems obvious; the choices involve what one doesn't see, what is left out of the system.

The value here is in widespread adoption. The system is good enough that people can agree to use it.


IMHO not quite, because SemWeb around 15 years ago (OWL2) was already build on a logical rather than graph-based foundation, though duals/isomorphisms/parallels between these two exist.


schema.org is certainly getting used more. Not sure how trusted it is wrt search.


More or less.


KG are cool, but I haven't find a practical framework of combining simple logical predicates with temporal facts (things that are true at a certain moment in time) and information provenance (the truthiness of information given the origin). There might be ways to encode this information in a hyper graph but they are far from practical.


Wikidata statements (which roughly correspond to the edges in the Knowledge Graph) have quite a bit of Metadata associated with them: they can have refer to sources that state this particular bit of knowledge, they have a so-called rank that allows distinguishing preferred and deprecated statements, and the can be qualified by another statement in the graph. Temporal validity is encoded using a combination of rank and qualifiers, as for, e.g., Pluto[0], where the instance-of statement saying that “Pluto is a planet” is deprecated and has an “end time” qualifier, and the preferred statement says “Pluto is a dwarf planet,” with a corresponding “start time” qualifier.

In principle, all of this information is available through the SPARQL endpoint or as an RDF export (there is also the simplified export that contains only “simple” statements lacking all of that metadata), so reasoning over this data is not entirely out of reach, but the sheer size (the full RDF dump is a few hundred GBs) is also not particularly practical to deal with.

[0] https://www.wikidata.org/wiki/Q339#P31


The size of Wikidata knowledge base / relevant graph (as well as Linked Open Data Cloud KBs and other large KBs) certainly presents some challenges. However, I think that the largest challenge and, in fact, the main obstacle, for practical programmatic solutions is the use of essentially meaningless alphanumeric identifiers assigned to entities and properties. All corresponding identifiers need to be discovered first manually in order to construct relevant SPARQL queries. Needless to say that these queries are not particularly human-readable (or, rather, human-interpretable) as well.


Why manually, when you have APIs to find Wikidata items and properties based on their labels, descriptions, aliases, data, metadata and use?

If you mean autocomplete UI or tooltips, look no further than the query editor and its Ctrl+Space at https://query.wikidata.org/


I meant APIs, not UI or tooltips. And while Wikidata entities and properties could be accessed using MediaWiki API, arguably, there are, at least, two issues with this: 1) you have to know exact names of all the relevant metadata, which is quite overwhelming* (and the SPARQL query editor's autocomplete feature does not seem to help with this, except for top-level attributes); 2) entity disambiguation - yes, it can be implemented programmatically, however, it has to rely on knowing exact names (values), which brings us back to the point #1.

*) Here is an example of the number of attributes for a single entity: https://www.wikidata.org/w/api.php?action=wbgetentities&ids=....


>Wikidata

https://en.wikipedia.org/wiki/Wikidata

Thanks for that! TIL.

It seems a fascinating project in epistemiology!


Might be worth looking at Sowa's Conceptual Graphs. E.g., [1], talks about time, and links to his book.

[1] http://www.jfsowa.com/ontology/process.htm


Checkout Datomic. It’s a temporal database that uses datalog as it’s query language. There’s also Datascript, which does the same thing.


... there's a whole bunch of Datomic-likes these days:

https://github.com/simongray/clojure-graph-resources#datalog


Unfortunate name for a product, I can't find anything called Dynamic on DDG, only dynamic things with a lowercase d. Do you have a link to the project?


not dynamic but Datomic


Dyslexic moment on my part, thank you


Lysdexia makes fools of us all ;)


Check out Nexus, which was designed with versioning in mind, that solves this kind of challenge at the Blue Brain Project:

https://bluebrainnexus.io/


Is this really the semantic web debacle all over again? Curation doesn't scale and if you're going to do it just do it in a database already; standards lead to committees that end up choking progress because we have all the standards we really need already; NLP only pretends to understand written text but all it does really is tokenise badly. Just cache in a database already and move on!


SQL might be a good fit to model Knowledge Graphs, since FOREIGN KEYs can be named, using the CONSTRAINT constraint_name FOREIGN KEY … syntax. We thus have support to label edges.

Nodes = Tables

Edges = Foreign keys

Edge labels = Foreign key constraint names


This kind of approach is pretty common, including in compute engines like Spark's graphx. I suspect a lot of teams using graph DBs would be better off realizing this: it's good for simple and small problems

it does fall down for graphy tasks like multihop joins, connect the dots, and supernodes. So for GB/TBs of that, either you should do those outside the DB, or with an optimized DB. Likewise, not explicitly discussed in the article, modern knowledge graphs are often really about embedding vectors, not entity UUIDs, and few/no databases straddle relational queries, graph queries, and vector queries


> it does fall down for graphy tasks like multihop joins, connect the dots, and supernodes.

These can always be accomplished via recursive SQL queries. Of course any given implementation might be unoptimized for such tasks. But in practice, this kind of network analytics tends to be quite rare anyway.

One should note that even inference tasks, that are often thought of as exclusive to the "semantic" or "knowledge" based paradigm, can be expressed very simply via SQL VIEW's. Of course this kind of inference often turns out to be infeasible in practice, or to introduce unwanted noise in the 'inferred' data, but this has nothing to do with SQL per se and is just as true of the "knowledge base" or "semantic" approach.


I mean performance breaks: asymptotics catch up to you on bigger data. And again, most graphs are small so often fine, or can be done out-of-DB by sending the client the subgraph


yes, you can always map most structures into tables, or even excel.

but it think "good fit" is a stretch. when designing systems you generally want to look at data access patterns, and pick a data exec approach that aligns to that.

in tech, unfortunately, RDBMS are the "hammer" in "if your only tool is a hammer then every problem looks like a nail."


Honestly I think people's assumption that graph databases must be better in representing binary relations might be a bit optimistic. After all there's no reason relational databases (named after the n-ary relationships that tables represent) couldn't handle binary relations.

The one thing that's definite is that SQL is a bad choice for particular kinds of queries, though most graph databases don't seem to go much further than improving (?) the syntax a little bit and adding transitive closure (which is also present in several SQL databases). A few graph databases do allow for more complex (even arbitrary) inference, but this somehow never seems to make the headlines.


Modern SQL can express arbitrary queries (including transitive closure ofc.) since it allows for recursive table expressions.


That's still some ways of from what I'd call "arbitrary" but yeah you can go quite far.


> That's still some ways of from what I'd call "arbitrary"

How so? What's practically missing?


Well the most useful way I've found of classifying stuff is by analogy with formal grammars. If you view the kind of paths you can query as the set of words in a language then you get something like:

Basic (non-recursive) SQL: Finite paths

Recursive SQL / Transitive closure (with negation): Regular paths

However the hierarchy of languages doesn't end there. And some graph databases allow you to add arbitrary grammar rules, which make it possible to add some complex rules like:

- If condition X,Y and Z is satisfied then person A and B are the same person

- Equality is transitive

- If two people are equal then each property of one is also a property of the other.

The part that makes this tricky is that figuring out two people are equal can cause other people to now suddenly satisfy the condition for equality.

You could also construct some examples by creating conditions which aren't "regular" (e.g. person A and B have an ancestor which is the same number of generations back for both of them).


Graph databases likely are more optimized for this sort of data storage, but you've hit it on the head that SQL databases can be used to represent node/edge style data.


I'm surprised there's not many knowledge graph databases but there are graph databases eg Neo4J and there is GraphQL. It seems they are more popular




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: