The author of this post mentions the Humanities at the end of their post and TerminusDB. I work on a Humanities based project which uses the Semantic Web (https://github.com/cyocum/irish-gen) and I have looked at TerminusDB a couple of times.
The main factor in my choice of technologies for my project was the ability to reason data from other data. OWL was the defining solution for my project. This is mainly because I am only one person so I needed the computer to extrapolate data that was logically implied but I would be forced to encode by hand otherwise. OWL actually allowed my project to be tractable for a single person (or a couple of people) to work on.
The author brings up several points that I have also run into myself. The Open World Assumption makes things difficult to reason about and makes understanding what is meant by a URL hard. Another problem that I have run into is that debugging OWL is a nightmare. I have no way to hold the reasoner to account so I have no way when I run a SPARQL query to be able to know if what is presented is sane. I cannot ask the reasoner "how did you come up with this inference?" and have it tell me. That means if I run a query, I must go back to the MS sources to double check that something has not gone wrong and fix the database if it has.
Another problem that the author discusses and what I call "Academic Abandonware". There are things out there but only the academic who worked on it knows how to make it work. The documentation is usually non-extant and trying to figure things out can take a lot of precious time.
I will probably have another look at TerminusDB in due course but it will need to have a reasoner as powerful as the OWL ones and an ease of use factor to entice me to shift my entire project at this point.
> I work on a Humanities based project which uses the Semantic Web (https://github.com/cyocum/irish-gen) and I have looked at TerminusDB a couple of times.
I had never come across anything like this before, but this is a wonderful project.
"Reasoning" capability can be added to any conventional database via the use of views, and sometimes custom indexes. The real problem is that it's computationally expensive for non-trivial cases.
As you put the word Reasoning in quotation marks, I might misunderstand your bottom line here (I am Autistic, so please do not get quirky on natural language semantics), but the bare statement: "Reasoning can be added to any conventional database" is just not right. Reasoning is a well-defined notion from logic, that is based on formal languages, semantics and a relation called entailment (inference in proof theory) respectively. None of that does natively exist in a database. In the literature, there are two well-known ways for integrating a notion of reasoning into a database. Firstly, Datalogic was invented to create recursive queries. Datalogic's relation to reasoning was a side-effect, and it only covers a fragment based on horn clauses. On the other hand there's OWL-DL a (limited) fragment of OWL, that encodes some kind of reasoning via query expansion on vanilla SQL-Queries. So maybe you can elaborate on the notion of "using views, and sometimes custom indices to add reasoning to a conventional database".
You can think of views as modelling a particular sort of implication, which is nevertheless somewhat restricted. Despite the restriction, it may be sufficient to cover many usages of OWL, but you may need to squint a bit -- what I mean to say there is it is not exactly an implementation of implication, but it may be used to model it and so some degree of reinterpretation of the resulting set of tables and views may be needed. The type of implication supported is roughly "a result to given SQL query based on (a combination of) existing tables and views => new record in a fresh table/relation".
I hardly see how you can define in a RDBMS that a resource that both have an engine and four wheels should be seen as a car.
Without going into a nightmare of unbearable SQL...
The SQL for describing "resources that contain other resources" gets a bit unidiomatic, but defining a query for those that have e.g. an engine and four wheels is quite easy. Then you can add that as a custom view, so that your inferred data is in turn available and queryable on an equal basis with raw input to the knowledge base.
Sure. But maintaining the coherence between your business data model definitions and their implementation in the RDBMS can quickly become a massive headache, don't you think?
Very cool topic... and not the article I was expecting!
I actively work with teams making sense of their massive global supply chains, manufacturing process, sprawling IT/IOT infra behavior, etc., and I personally bailed from RDF to bayesian models ~15 years ago... so I'm coming from a pretty different perspective:
* The historical killer apps for semantic web were historically paired with painfully manual taxonomization efforts. In industry, that's made RDF and friends useful... but mostly in specific niches like the above, and coming alongside pricey ontology experts. That's why I initially bailed years ago: outside of these important but niche domains, google search is way more automatic, general, and easy to use!
* Except now the tables have turned: Knowledge graphs for grounding AI. We're seeing a lot of projects where the idea is transformer/gnn/... <> knowledge graph. The publicly visible camp is folks sitting on curated systems like wikidata and osm, which have a nice back-and-forth. IMO the bigger iceberg is from AI tools getting easier colliding with companies having massive internal curated knowledge bases. I've been seeing them go the knowledge graph <> AI for areas like chemicals, people/companies/locations, equipment, ... . It's not easy to get teams to talk about it, but this stuff is going on all the way from big tech co's (Google, Uber, ...) to otherwise stodgy megacorps (chemicals, manufacturing, ..).
We're more on the viz (JS, GPU) + ai (GNN) side of these projects, and for use cases like the above + cyber/fraud/misinfo. If into it, definitely hiring, it's an important time for these problems.
Generally agree. There is a lot of discussion concerning the technical difficulties, RDF flaws and road blocks little acknowledgement of other non-technical impracticalities. Making something technically feasible does insure adoption. Changing a bunch of code over time will always be preferable redefining ontologies and reprocessing the data.
Funnily enough, the why semantic web is good section is the section that actually identifies why it failed.
We are going to have an ultra flexible data model that everyone can just participate in?
That never works. Protocols work by restricting possibilities not allowing everything. The more possibilities you allow, the more room for subtle incompatibilities and the more effort you have to spend massaging everything into compatibility.
That's discussed in the article though. The open world assumption is untenable. Having shareable interoperable schemata that can refer to each-other safely would be a god send however. And that's what is currently very hard but needn't be.
Whether you trust the URIs or the data that was placed there is not a problem for the semantic web. The fact that you _can_ state these things and relate to other resources and concepts on the web is already wonderful and useful in itself. Google is reading this metadata and relating it to their trust/ranking-graph. The semantic web 'community' could do the same later also, in a more decentralized way (blockhain web IDs perhaps?). For now it all works fine.
The reason why the semantic web is even more fundamental: You can't get everyone to agree on one schema. Period. Even if everyone is motivated to, they can't agree, and if there is even a hint of a reason to try to distinguish oneself or strategically fail to label data or label it incorrectly, it becomes even more impossible.
(I mean, the "semantic web" has foundered so completely and utterly on the problem of even barely working at all that it hasn't hardly had to face up to the simplest spam attacks of the early 2000s, and it's not even remotely capable of playing in the 2022 space.)
Agreement here includes not just abstract agreement in a meeting about what a schema is, but complete agreement when the rubber hits the road such that one can rely on the data coming from multiple providers as if they all came from one.
Nothing else matters. It doesn't matter what the serialization of the schema that can't exist is. It doesn't matter what inference you can do on the data that doesn't exist. It doesn't matter what constraints the schema that can't exist specifies. None of that matters.
Next in line would be the economic impracticality of expecting everyone to label their data out of the goodness of their hearts with this perfectly-agreed-upon schema, but the Semantic Web can't even get far enough for this to be its biggest problem!
Semantic web is a whole bunch of clouds and wishes and dreams built on a foundation that not only does not exist, but can not exist. If you want to rehabilitate it, go get people to agree (even in principle!) on a single schema. You won't rehabilitate it. But you'll understand what I'm saying a lot more. And you'll get to save all the time you were planning on spending building up the higher levels.
Wikidata is already providing a nearly globally accepted store of concept IDs. Wikipedia adds a lot of depth to this knowledge graph too.
Schema.org has become very popular and Google is backing this project. Wordpress and others are already using it.
Governments are requiring not just "open data", but also "open linked-data" (which can then be ingested into a SPARQL engine), because they want this data to be usable across organizations.
The financial industry are moving to the FIBO ontology, and on and on...
There're a lot of wrong perspectives on the topic in this thread, but this one I like the most. When someone starts to talk about "agreeing on a single schema/ontology" it's a solid indicator that that someone needs to get back to rtfm (which I agree a bit too cryptic).
The point here is that in semantic web there're supposed to be lots and lots of different ontologies/schemas by design, often describing the same data. SW spec stack has many well-separated layers. To address that problem, an OWL/RDFS is created.
"The point here is that in semantic web there're supposed to be lots and lots of different ontologies/schemas by design, often describing the same data."
Then that is just another reason it will fail. We already have islands of data. The problem with those islands of data is not that we don't have a unified expression of the data, the problem is the meaning is isolated. The lack of a single input format is little more than annoyance and the sort of thing that tends to resolve itself over time even without a centralized consortium, because that's the easy part.
Without agreement, there is no there there, and none of the promised virtues can manifest. If what you say is the semantic web is the semantic web (which certainly doesn't match what everyone else says it is), then it is failing because it doesn't solve the right problem, though that isn't surprising because it's not solvable.
If what you describe is the semantic web, the Semantic Web is "JSON", and as solved as it ever will be.
A "knowing wizard correcting the foolish mortals" pose would be a lot more plausible if the "semantic web" had more to show for its decades, actual accomplishments even remotely in line with the promises constantly being made.
so if it tries to have a unified ontology that's why it's destined to fail, but if it's designed to working with many small ontologies… that's why it will fail! lol, but you can't have it both ways.
In SW, the "semantic" part is subjective to an interpreter. You can have different data sources, partially mapped using owl to the ontology that an interpreter (your program) understands. That allows you to integrate new data sources independently from the program if they use a known ontology seamlessly or create a mapping of a set of concepts into a known ontology (which you would have do anyway in other approach). So in theory, data consumption capabilities (and reasoning) grows as your data sources evolve.
> If what you describe is the semantic web, the Semantic Web is "JSON", and solved.
It has nothing to do with JSON, JSON-LD, XML, Turtle, N3, rdfa, microdata and etc.. RDF is a data model, but those are serialisation formats. That's another interesting point, because half of the people talk only about formats and not the full stack. That's not a reasonable discussion.
> which certainly doesn't match what everyone else says it is
> if it tries to have a unified ontology that's why it's destined to fail, but if it's designed to working with many small ontologies… that's why it will fail! lol, but you can't have it both ways.
You're only supposed to say "you can have it both ways" about contradictory things. It can both be a hopeless endeavor because it is impossible to agree on ontologies and a useless endeavor if you don't agree on ontologies.
> The point here is that in semantic web there're supposed to be lots and lots of different ontologies/schemas by design, often describing the same data.
This is incredibly problematic for many reasons. Not the least of which is the inevitable promulgation of bad data/schemas. I remember one ontology for scientific instruments and I, a former chemist, identified multiple catastrophically incorrect classifications (I forget the details, but something like classifying NMR as a kind of chromatography. Clear indicators the owl author didn't know the domain).
The only thing worse than a bad schema is multiple bad schemas of varying badness, and not knowing which to pick. Especially if there is disjoint aspects of each which are (in)correct.
There may have been advancements in the few years since I was in the space, but as of then, any kind of probabilistic/doxastic ontology was unviable.
It doesn't, which is exactly the problem. Ontologies inevitably have mistakes. When your reasoning is based on these "strong" graph links, even small mistakes can cascade into absolute garbage. Plus manual taxonomic classification is super time consuming (ergo expensive). Additionally, that assumes that there is very little in the way of nebulosity, which means you don't even have a solid grasp of correct/incorrect. Then you have perspectives - there is no monopoly on truth.
It's just not a good model of the world. Soft features and belief-based links are a far better way to describe observations.
Basically, every edge needs a weight, ideally a log-likelihood ratio. 0 means "I have no idea whether this relation is true or false", positive indicates truthiness and negative means the edge is more likely to be false than true.
Really, the whole graph needs to be learnable. It doesn't really matter if NMR is a chromatographic method. Why do you care what kind of instrument it is? Then apply attributes based on behaviors ("it analyses chemicals", "it generates n-dim frequency-domain data")
Yes, that's not solvable with just OWL (though it might help a little) or any other popular reasoners I know. There're papers, proposals and experimental implementations for generating probability-based inferences, but nothing one can just take and use, but there're tons of interesting ideas on how to represent that kind of data in RDF or reason about.
I think the correct solution in SW context would be to add a custom reasoner to the stack.
I've been part of 4 commercial project that used the semantic web in one way or another. All these project or at least their semantic web part where a failure. I think that I have a good idea on where the misunderstanding about the semantic web originate. The author does seem to have a good understanding and is right about the semantic web forcing everything into a single schema. Academia sells the straight jacked of the semantic web as a life long free lunch at an all-you-can eat-buffet but instead you are convicted to a life sentence in prison. Adopting RDF is just too costly because it is never the way computers or humans structure data in order to work with it. Of course everything can be organised in a hyper graph, there is a reason why Steven Wolfram also uses this structure, they just so flexible. At the end of the day I don't agree with the author opinion of the semantic web having much of a future, I did my best but it didn't work out, time for other things.
> semantic web forcing everything into a single schema
I don't think "forcing" is the right word here, I think the right one would be "expects it to converge under practical incentives". That's a more gentle statement that reflects the fact, that it doesn't have to for SW tech to work.
Also, the term "schema" is a bit off, bc there's really no such thing in there. You can have the same graph described differently using different ontologies at the same moment without changing underlying data model, accessible via the same interface. It's a very different approach.
> never the way computers or humans structure data in order to work with it
If you haven't mentioned that you had an experience, I would say you confuse different layers of technology, because graph data model is a natural representation of many complex problems. But because you have, can I ask you to clarify what you mean here?
> Academia sells the straight jacked of the semantic web as a life long free lunch at an all-you-can eat-buffet
I disagree, bc I in fact think that academia doesn't sell shit, and that's the problem. There's no clear marketing proposal and I don't think they really bother or equipped to make it. There's a lack of human-readable specs and docs, it's insane how much time you need to invest in this topic even just to be able estimate whenever it's a reasonable to consider using SW in a first place. Also, lack of conceptual framework, "walkthroughs", tools, outdated information, incorrect information drops survival chance of a SW-based project by at least x100. But it can really shine in some use-cases, that unfortunately have little to do with the "web" itself.
RDF is just an interoperability format. You aren't supposed to use it as part of your own technology stack, it just allows multiple systems to communicate seamlessly.
I don't think it's impossible to agree on one schema but it's very expensive to do so and requires tools from the study of philosophy.
While I don't work in the domain, the ontologies in the OBO Foundry and all the ones deriving from Basic Formal Ontology[0] have some level of compatibility that make their integration possible. Still far from "one schema to rule them all" but it shows that agreement can be achieved.
There are other initiatives that I'm aware of that could also qualify as a step in the right direction: "e-Government Core Vocabularies" and the "European Materials Modelling Ontology".
I hope and want to believe that, sooner than later, we will have formalized definitions for most practical aspects of our lives.
I'm building a front end app for Wikipedia & Wikidata called Conzept encyclopedia (https://conze.pt) based on semantic web pillars (SPARQL, URIs, various ontologies, etc.) and loving it so far.
The semantic web is not dead, its just slowly evolving and and growing. Last week I implemented JSON-LD (RDF embedded in HTML with a schema.org ontology), super easy and now any HTTP client can comprehend what any page is about automatically.
See https://twitter.com/conzept__ for many examples what Conzept can already do. You won't see many other apps do these things, and certainly not in a non-semantic-web way!
The future of the semantic web is in: much more open data, good schemas and ontologies for various domains, better web extensions understanding JSON-LD, more SPARQL-enabled tools, better and more lightweight/accessible NLP/AI/vector compute (preferably embedded in the client also), dynamic computing using category theory foundations (highly interactive and dynamic code paths, let the computer write logic for you), ...
Having worked for an Academic Publisher that had intense interest in this I finally came to the following conclusions to why this is DOA.
1. Producers of content are unwilling to pay for it (and neither are consumers BTW)
2. It is impossible to predict how the ontology will change over time so going back and reclassifying documents to make them useful is expensive.
3. Most pieces of info have a shelf life so it is not worth the expense of doing it.
4. Search is good enough and much easier.
5. Much of what is published is incorrect or partial so.
In the end I decided this is akin to discussing why everybody should use Lisp to program but the world has a differ opinion.
Semantic Web lost itself in fine details of machine-readable formats, but never solved the problem of getting correctly marked up data from humans.
In the current web and apps people mostly produce information for other people, and this can work even with plain text. Documents may lack semantic markup, or may even have invalid markup, and have totally incorrect invisible metadata, and still be perfectly usable for humans reading them. This is a systemic problem, and won't get better by inventing a nicer RDF syntax.
In language translation, attempts of building rigid formal grammar-based models have failed, and throwing lots of text at a machine learning has succeeded. Semantic Web is most likely doomed in the same way. GPT-3 already seems to have more awareness of the world than anything you can scrape from any semantic database.
Comparing "rigid formal grammar-based models" (whatever that might actually mean for now) to machine learning is like comparing apples to bananas. The former one is a rigorous syntactical formalization, aimed at being readable by machine and humans alike. The latter one is a learned interpolation of a probability distribution function. I do not see a single way to compare these two "things". Nevertheless, I may guess, what you actually are trying to say: Annotating data by hand (the syntax is completely irrelevant) is inferior to annotating data by machine learning. And this claim is at least debatable and domain-dependent. There are domains where even a 3% false-positive rate translates to "death of a human being in 3 out of 100 identified cases", and there are domains where it's to much work to formalize every bits and pieces of the domain and extracting (i.e. learning) knowledge is a feasible endeavor. I have experience in both fields, and I dare to say, that extracting concepts and relations out of text in a way that it can be further processed and used for some kind of decision process is way more complicated than you might imagine, and GPT-3 et al. do not achieve that.
Sure, but there are still a lot of decisions being made behind the curtain, when it comes to producing a model like GPT-3. How was the training data ontologized? Where did it come from? To some extent, these are the same problems facing manual curation.
GPT may have had some manual curation to avoid making it too horny and racist, but on a technical level for such models you can just throw anything at it. The more the better, shove it all in.
On our podcast, The Technium, we covered Semantic Web as a retro-future episode [0]. It was a neat trip back to the early 2000s. It wasn't a bad idea, pre se, but it depended on humans doing-the-right-thing for markup and the assumption that classifying things are easy. Turns out neither are true. In addition, the complexity of the spec really didn't help those that wanted to adopt its practices. However, there are bits and pieces of good ideas in there, and some of it lives on in the web today. Just have to dig a little to see them. Metadata on websites for fb/twitter/google cards, RDF triples for database storage in Datomic, and knowledge base powered searches all come to mind.
I was hired by a BIG company to help their data governance, and a pragmatic semantic web is giving pretty interesting results.
Just to add some hotness/trollness to the discussion, Neo4J was a mind opener for many people [both technical and non-technical]
```
The systems that have succeeded at scale have made simple implementation the core virtue, up the stack from Ethernet over Token Ring to the web over gopher and WAIS. The most widely adopted digital descriptor in history, the URL, regards semantics as a side conversation between consenting adults, and makes no requirements in this regard whatsoever: sports.yahoo.com/nfl/ is a valid URL, but so is 12.0.0.1/ftrjjk.ppq. The fact that a URL itself doesn’t have to mean anything is essential – the Web succeeded in part because it does not try to make any assertions about the meaning of the documents it contained, only about their location.
There is a list of technologies that are actually political philosophy masquerading as code, a list that includes Xanadu, Freenet, and now the Semantic Web. The Semantic Web’s philosophical argument – the world should make more sense than it does – is hard to argue with. The Semantic Web, with its neat ontologies and its syllogistic logic, is a nice vision. However, like many visions that project future benefits but ignore present costs, it requires too much coordination and too much energy to effect in the real world, where deductive logic is less effective and shared worldview is harder to create than we often want to admit.
Much of the proposed value of the Semantic Web is coming, but it is not coming because of the Semantic Web. The amount of meta-data we generate is increasing dramatically, and it is being exposed for consumption by machines as well as, or instead of, people. But it is being designed a bit at a time, out of self-interest and without regard for global ontology. It is also being adopted piecemeal, and it will bring with it with all the incompatibilities and complexities that implies. There are significant disadvantages to this process relative to the shining vision of the Semantic Web, but the big advantage of this bottom-up design and adoption is that it is actually working now.
```
"However, like many visions that project future benefits but ignore present costs, it requires too much coordination and too much energy to effect in the real world" ... Wikipedia, Wikidata, OpenStreetMaps, Archive.org, ORCID science-journal stores, and the thousands of other open linked-data platforms are proofing Clay wrong each day. He has not been relevant for a long time IMHO. Semweb > tag-taxonomies.
Natural language processing (NLP) may indeed understand the unstructured text, then according to (2), the "Semantic Web" is not needed, except for perhaps caching NLP outputs in machine-readable form.
(1) is more fundamental: a lot of value-add annotation (in RDF or other forms) would be valuable, but because there is work involved those that have it don't give it away for free. This part was not sufficiently addressed in the OP: the Incentive Problem. Either there needs to be a way how people pay for the value-add metadata, or there has to be another benefit for the provider why they would give it away. Most technical articles focus on the format, or on some specific ontologies (typically without an application).
A third issue is trust. In Berners-Lee's original paper, trust is shown as an extra box, suggesting it is a component. That's a grave misunderstanding: trust is a property of the whole system/ecosystem; you can't just take a prototype and say "now let's add a trust module to it!" In the absence of trust guarantees, who ensures that the metadata that does exist is correct? It may just be spam (annotation spam may be the counterpart of Web spam in the unstructured world).
No Semantic Web until the Incentive Problem and the Trust Problem are solved.
"No Semantic Web until the Incentive Problem and the Trust Problem are solved."
No. The semweb is already functional as is (see my other comments here). Trust is orthogonal and can/is being solved in different ways (centralized/decentralized as in Wikidata/ORCIDs/org-ID-URIs).
Talking about “the incentive problem” as if it’s some minor fixable issue ignores all of human psychology and economics.
The climate crisis is a somewhat comparable example - it requires changing behavior on a massive scale for abstract benefit. In the climate case the benefit is much more fundamental than what semweb promises. And despite massive pain and effort we are very very far from addressing it. Thinking semweb would happen just cuz it sounds cool is super naive.
- SPARQL is _a lot better_ than the many different forms of SQL.
- Adding some JSON-LD can be done through simple JSON metadata. Something people using Wordpress are already able to do. All this will be more and more automated.
- The benefit is ontological cohesion across the whole web. Please take a look at the https://conze.pt project and see what this can bring you. The benefit is huge. Simple integration with many different stores of information in a semantically precise way.
2) AI/NLP is never completely precise and requires huge resources (which require centralization). The basics of the semantic web will be based on RDF (whether created through some AI or not), SPARQL, ontologies and extended/improved by AI/NLP. Its a combination of the two that is already being used for Wikipedia and Wikidata search results.
> The benefit is ontological cohesion across the whole web
This has no benefit for the person who has to pay to do the work. Why would I pay someone to mark up all my data, just for the greater good? When humans are looking/using my products, none of this is visible. It's not built into any tools, it doesn't get me more SEO, and it doesn't get me any more sales.
Why are people editing Wikipedia and Wikidata? What would it bring you if your products were globally linked to that knowledge graph and Google's machines would understand that metadata from the tiny JSON-LD snippet on each page? The tools are here already, the tech is evolving still, but the knowledge graph concept is going to affect web shop owners too soon enough.
It’s unclear to me at this point why people are contributing to Wikipedia and certainly wikidata, but they’re getting something out of it (perhaps notoriety), and a lot probably has to do with contributing to the greater good. It’s all non profit. The rest of the web is unlike these stand out projects.
Meanwhile, why would say Mouser or Airbnb pay someone to markup their docs? WebMD? Clearly nothing has been compelling them to do so thus far, and when you’re talking about harvesting data and using it elsewhere, it’s a difficult argument to make. Google already gets them plenty of traffic without these efforts.
They do it because it benefits them too. OpenStreetMaps links with WD, GLAMs link with WD, journals/ORCIDs link with WD, all sorts of other data archives link with WD. Whoever is not linking with may see a crawler pass by to collect license-free facts.
Also, I just checked: WebMD is using a ton of embedded RDF on each page. They understand SEO well as you said :)
A refinement on your second point is that the groups who would have benefited the most from semantic web were the googles of the world, but they were also the ones who needed it the least. Because they were well ahead of everybody else at building the NLP to extract structure from the existing www. In fact the existence of semantic web would have eroded their key advantage. So the ones in a position to encourage this and make it happen didn’t want it at all. So it was always DOA.
working for a company, 100 % semantic web, integrating many, many parties for many years now, all of it rdf.
- you get used to turtle. one file can describe your db and be ingested as such. handy.
- interoperability is really possible. (distributed apps)
- hardest part is getting everyone to agree on the model, but often these discussions is more about resolving ambuigties surrounding the business than about translating it to model. (it gets things sharp)
- agree on a minimum model, open world means you can extend in your app
- don't overthink your owl descriptions
- no, please no reasoners. data is never perfect.
- tooling is there
- triple stores are not the fastest
pls, not another standard to fix the semantic web. Everything is there. More maturity in tooling might be welcome, but this a function of the number people using it.
Very well written introduction to some of the problems with semantic web dev.
Personally I think the reason it died was there were no obvious commercial applications. There are of course commercial applications, but not in a way that people realize what they're using is semantic web. Of all the 'note keepers' and 'knowledge bases' out there, none of them are semantic web. Thus it has languished in academia and a few niche industries in backend products, or as hidden layers, ex. Wikipedia. Because there wasn't something we could stare at and go "I am using the semantic web right now", there was no hype, and no hype means no development.
Very hard to make a business case because for the reasons you mentioned + the costs are very front-loaded because ontologies are so damn hard to build, even for very well-contained problems. Without a clear payoff, why bother
Yes because that is about formalizing all human thought and knowledge. In principle that has nothing to do with computers and is something everybody working in science and humanities has been always trying to do starting with Socrates or was it Pythagoras. It is about "building theories".
Now computers can help in that of course but it doesn't really make it easy to create a consistent stable "theory of everything". As we used to say "garbage in garbage out".
Semweb people got burned out by the stress of making new standards which means that standards haven't been updated. We've needed a SPARQL 2 for a long time but we're never going to get it.
One thing I find interesting is that description logics (OWL) seem to have stayed a backwater in a time when progress in SAT and SMT solvers has been explosive.
> Semweb people got burned out by the stress of making new standards which means that standards haven't been updated.
True. But and also, web standards seem to have mostly been abandoned/died beyond just semantic web. I am not sure how to explain it, but there was a golden age of making inter-operable higher-level data and protocol standards, and... it's over. There much less standards-making going on. It's not just SPARQL that could use a new version, but has no standards-making activity going on.
I can't totally explain it, and would love to read someone who thinks they can.
A recent paper connects SHACL (mentioned in OP) to description logic and OWL: https://arxiv.org/abs/2108.06096 . This is a surprising link which seems to have been missed by SemWeb practitioners when SHACL was proposed.
That's a very good point re SAT/SMT. F* (https://www.fstar-lang.org/) has done truly amazing things by making use of them, and it's great to be able to get sophisticated correctness checks while doing basically non of the work.
I'm going to have to go away and think about how one could effectively leverage this in a data setting, but I'd love to hear ideas.
It doesn't have anything directly to do with SAT but I'd say the #1 deficiency in RDFS and OWL is this.
Somebody might write
:Today :tempF 32.0 .
or
:Today :tempC 0.0 .
The point of RDFS and OWL is not to force people into a straightjacket the way people think it is but rather make it possible to write a rulebox after the fact that merges data together. You might wish you could write
:tempC rdfs:subPropertyOf :tempF .
but you can't, what you really want is to write a rule like
?x :tempC ?y -> ?x :tempF ?y*1.8 + 32.0
but OWL doesn't let you do that. You can do it with SPIN but SPIN never got ratified and so far all the SPIN implementations are simple fixed point iterators and don't take advantage of the large advances that have happened with production rules systems since they fell out of fashion (e.g. systems in the 1980s broke down with 10,000 rules, in 2022 1,000,000 rules is often no problem.)
Wikidata is quite usable though with SPARQL through REST. To me the biggest problem seems lack of documentation but for small scale experiments interesting stuff can be done with it (with enough caching, probably with SQL). Running my own triple store seems a lot of work though, already choosing which one to use actually
Semantic web is data science for the browser. Most people can’t even figure out how to architect HTML/JS without a colossal tool to do it for them, so figuring out data science architecture in the browser is a huge ask.
one that thinks you should use tools to generate HTML/JS and those tools should generate strict XML and any extra semantic data. The problem is that the actual users of these tools either don't care, or know about semantic HTML nor semantic data.
Then the other camp that thinks HTML should be written by hand which makes it small, simple and semantic (layout and design separated into CSS) without any div elements.
Hand-writing the semantic data in addition to the semantic HTML becomes too burdensome.
I only skimmed the article so maybe I missed I but at a glance it seemed the completely miss the biggest issue. People will intentionally mislabel things. If chocolate is trending people will add "chocolate" to there tags for bitcoin.
You can see this all over the net. One example is the tags on SoundCloud.
Another issue is agreeing on categories. say women vs men or male vs female. for the purpose
of id the fluidity makes sense but less so for search. to put it another way, if I search for brunettes i'd better not see any blondes. If I search for dogs I'd better not see any cats. And what to do about ambiguous stuff. What's a sandwich? A hamburger? a hotdog? a gyro? a taco?
Looks like it's been temporarily suspended, but worth mentioning: The Cambridge Semantic Web meetup, which I attended frequently around 2010-2013. It was cofounded by Tim Berners-Lee, and I got to meet him there a couple times. In fact, I think its earliest iteration was Berners-Lee and Aaron Swartz.
Met once a month in the STAR room at MIT. The best part was staying after to schmooze and drink with older programmers at the Stata Center bar down the hall from the STAR room. What a cool building, the Stata Center! And what cool topics we would discuss every week. Since Cambridge has so many pharma companies, a lot of the talks were regarding practical ontologies for pharmacology.
edit, a spandrel: Isn't w3c based out of MIT? And Swartz and Berners-Lee were in Boston at the same time.
Well, in one sense the are directly interconvertable. The documents in TerminusDB are elaborated to JSON-LD internally during type-checking and inference.
However, it's not just a question of whether one can be made into another. The use of contexts is very cumbersome, since you need to specify different contexts at different properties for different types. It makes far more sense to simply have a schema and perform the elaboration from there. Plus without an infrastructure for keys, Ids become extremely cumbersome. So beyond just type decorations on the leaves, It's the difference between:
In the late 1990s, I worked on lowercase-semantic Web problems.
I used descriptions like "the Web as distributed machine-accessible knowledgebase".
Some of the problems I identified were already familiar or hinted at from other domains (e.g., getting different parties to use the same terms or ontology, motivating the work involved, the incentive to lie (initially thinking mostly thinking about how marketers stretch the facts about products, though propaganda etc. was also in mind), provenance and trust of information, mitigations of shortcomings, mitigating the mitigations, etc.).
One problem I didn't tackle... I got into distributing computation among huge numbers of humans, and probably stopped thinking about commercial organization incentives. I don't recall at that time asking "what happens if a group of some kind invests lots of effort into a knowledge representation, and some company freeloads off of that, without giving back?". But we had seen eamples of that in various aspects of pre-Web Internet and computing. Maybe I was thinking something akin to compilation copyright, or that the same power that generated the value could continue to surprise and outperform hypothetical exploiters. Also, in the late 1990s, every crazy idea without traditional business merit was getting funded, and it was all about usefulness (or stickiness) and what potential/inspiration you could show.
I think the general message here is that complex and complete architectures tend to fail in favor of simpler solutions that people can understand and use to get things done in the here and now.
Its interesting to me that the recent uptick in the personal knowledge management space (aka tools for thought)[0] is all around the bi-directional graph which is basically a 2-tuple simplified version of the RDF 3-tuple. You lose the semantics of a labelled edge, but its easier for people to understand.
[0] See Roam Research, Obsidian, LogSeq, Dendron et al.
We're trying to make semantic web models easier to use with a project called TreeLDR...I think usability has been one of the biggest issues of this ecosystem and OSS in general. Think programmer-friendly data structure definitions that compile to JSON-LD contexts, jsonschemas, and beyond.
While I love the semantic web I see two major issues with it:
1. Standardization in regards of (globally) unique identifiers and ontologies. Most things un the semantic web have multiple identifiers and, based on personal preferences, attributes linked to different ontologies. There's several projects that try to gather data for the same thing from various ontologies, but sometimes the same attributes have differing values because of conversions or simply extracting data points from different publications where different methods have been used to measure stuff.
2. Performance of large datasets gets really bad since distributing graphs is still a problem that lacks good solutions. One of the solutions is to store data in distributed column stores. But there's still a ton of unsolved graph traversal performance issues.
I strongly believe that the technological batriers need to be solved first. Until then there will always be the person in meetings, asking why not use relational or NoSql tech because of performance...
Many of the biggest companies in world are using semweb tech: http://sparql.club
Open linked-data has been growing very fast over the last few years. Many governments are now demanding LD from their executive/subsidized organizations. These data stores are then made accessible using REST and/or SPARQL.
Interesting writeup. I'm of the opinion that the problem of the naming issue (how to call "things"?) sits in the idea that going from structured documents to structured data is one abstraction level too deep (i.e., people don't agree on how to call "things"). I believe this can be solved by similarity search; if we can approximate the data and represent the structure in embeddings. Hopefully, this might be a step in the 2nd try, as mentioned in the MD :)
> It would be like wikipedia, but even more all encompassing, and far more transformational.
My take is that we know a lot of this already but refuse to accept the solutions. The way to exchange data and the way to relate and query data is both known to a large extent; canonical S-expressions and datalog-ish expressivity. I just can't understand why no one thinks datalisp.is a persuasive foundation.
I think there is a lot of fussing about technical solutions to what is ultimately a cultural problem.
Suppose we had the perfect technology to define ontologies over real data.
This doesn't address the fact that Anglo-American culture is hostile to alternative ontologies. The idea of "one Truth" is baked into the national consciousness, from classical Western religion+philosophy to the liberal-democratic Constitution to Wikipedia and the current Fact-Checking™ Brought To You By Lockheed-Martin™ news-media regime.
With this worldview, there is no reason to invest in designing or implementing Semantic Web technologies. It's like building a a monument to a god that you don't believe exists. Waste of time.
To be clear, I spend a lot of time thinking about the technical side too and implementing enterprise solutions. I just think it's naive to frame it as primarily a technical problem when it comes to wider public deployment.
> My experience in engineering is that you almost always get things wrong the first time.
Probably the oldest gem I can remember, harvested from from a more senior mentor type, was the quip “It takes 3 times to get it right. And that’s an average. Get failing.”
Now, I’m that older guy. I still think this holds.
This is a really fascinating analysis. I have wondered why the semantic web never took off, and I am finding myself interested in being able to create data sources in a federated way. The author’s mention of Data Mesh and his own project, TerminusDB looks like what I had been looking for, for a side project.
One adjacent project I did not see mentioned is XMPP. The extensibility of XMPP comes from being able to refer to schemas within stanzas of the payload. It’s also an interesting case study on an ecosystem built from a decentralized, extensible protocol. One of the burdens plaguing the XMPP ecosystem is spam, and I wonder to what extent we might see that if the semantic web revives again.
Building a trust relationship between commercial entities isn't automatable; it nearly always requires a contract to be carefully hand-written and argued over by high-priced lawyers before any meaningful exchange of value can take place.
Sure, this is an unfortunate level of friction, and overkill in many cases, but think about it from a cost/benefit perspective: I can spend $10k on legal fees and successfully avoid not just a lot of uncertainty, but very infrequently, the contract also protects me from losses that can be orders of magnitude larger than it cost me to negotiate the contract.
Look at the EDIFACT. Huge standardization effort, but it was still not possible to automate system to system communication, because ultimately you need to rely on some words, and words are flexible. I was working with multiple companies that understood "through-invoicing" in EDIFACT differently, but the differences were so subtle they needed a third party to clarify those differences.
Lately, in various sectors, such as finance, there are commercially available reference data models. These are extremely complex, because they need to cover all the possible alternatives businesses might have, in various countries. Just to gain basic understanding of such a model is a huge effort. To have people to label things properly would probably involve learning a similar system.
Sort of reminds me of the original idea behind REST. IMO automated system-to-system is a dead end... you're always going to need humans in the loop for any useful non-trivial data.
The web is already semantic and machine-readable. The machine reads and interprets the HTML code and displays the semantic meaning of the page to the user.
If you want the machine to read the same meaning out that the human does, you need a smarter machine, not a different format.
For a year and a half, I worked on a project called OSLC: Open Services for Lifecycle Collaboration [0] which became an Oasis Open Project. It's an open community building practical specifications for integrating software. For software tools that adopt and provide OSLC enabled APIs, data integration and supported use cases become really easy.
As an example, if your department prefers Tool A for defining requirements (Aha, etc ...), Tool B for change management (bugzilla, etc ...) and Tool C for test management and they aren't already a unified platform, it can be hard to gain semantic context across them. I've seen many situations where dev teams prefer a specific FOSS/vendor change management tracking tool while testers prefer a different thing and are unwilling to change because of historical test automation investment. To illustrate, imagine I run a test and it fails. I want to open a bug and have it linked to this failing test and also associate it with an existing requirement. If all 3 tools are OSLC API enabled consumers/producers, then their data can be integrated together trivially and experiences can be far more seamless and pleasant to all involved (e.g. testers can have popups to query (find/select reqmnts) or delegated creates (open new bug)) without leaving their own familiar test tool's UI. Nice. Anything can have an OSLC enabled API adapter from existing servers to spreadsheets (with an associated proxy server). It has great promise in bringing FOSS/vendor tooling together.
In a nutshell, it's a set of standards around building a digital thread for tools to integrate together. Workstreams are focused per domain (quality management, change management, requirements management, etc ...) [1]. Linked Data and RDF are its core tech underpinning [2]
A lot of the semantic web has evolved, spurred on by SEO and the need to accurately scrape data from web pages. The old semantic web seemed to be more of a solution in search of every problem. I'm not surprised that searches for "semantic web" are down - as most interest now is focused on structured data via microformats, LD-JSON and standards published at schema.org.
I am surprised no one has mentioned schema.org. It is a much simpler standard and more widely used than RDF/OWL.
Another point I think is that it is not in any publishers interest to publish structured data, as it easily copy-able. For example, neither Amazon nor Wikipedia publishes using schema.org. It would make their data susceptible to 3rd party aggregators.
I love that a post on why we need the semantic web has a subheading titled “Key Innvoations”, because really the reason the semantic web died is because we need automated agents capable of dealing with the web as it is, not a web designed for automated agents.
Nit: Datalog isn't as powerful as Prolog, that's the whole point of it as a decidable fragment of first order logic (and it's seeing increased use in SMT/fixpoint solvers and databases)
But yeah, if getting rid of the whole SemWeb stack, triples, and their many awful serialization formats, design-by-committee query and constraint languages (keeping just the good parts) means we can finally return to focus on Prolog, Datalog, and simple term encodings of logic, I'm all for it.
The semantic web is notion for defining data relationships. Datalog and SQL are languages for queries. These have little to do with one another. It's like saying that HTML is failing as a format, so the answer is HTTP.
SQL is not only for queries, it includes a Data Definition Language as part of it I believe. You use SQL to define the schema to which the data in the database must confirm.
Similarly for Prolog and Datalog. You can define the form, the schema of data in a simple way and then the relations between the data by declaring simple inference rules between them.
Datalog has the benefit of being simpler, than Prolog. Simpler is better if that is all we need.
I'm always surprised when articles like these don't talk about the meaning behind the original phrase. The semantic web came to be because there existed a need for computers to understand the contents of webpages, which were invariably human-generated.
We now have a huge set of tools for that under the broader AI/ML umbrella - ML is obviously imperfect, but its cautious utilization across various industries is, to me, a step in the right direction. There's simply no need to pigeonhole ourselves into a "semantic web" data model that might not fit a particular topic or application.
I personally think that embeddings (https://milvus.io/docs/v1.1.0/vector.md) will eventually take over semantic search applications. NLP, for better or worse, have seen great success naively throwing text into massive pre-trained models (I say naively because a lot of these models still think Obama or Trump is president). We've also made great progress unifying the architecture of NLP and CV models via transformer architectures, and we're now seeing lots of CV applications follow suit.
The future of web standards will be structured in neural network high dimensional spaces. Accessibility to that future web will be built in models that exist across a decentralized environment similar to blockchain/smart-contract architectures.
That github was created 2 days ago, wasn't this article discussed elsewhere someplace? It looks very recognizable. Was it on a blog or something and just made a new home in github or was it some other similar article I may be thinking about.
semantic web and dead I think paired up a few places previously. Thanks for cataloging all the subject matter in github. I have a background interest in this subject that's been kicked around a bit through the years.
You can debate syntax forever but the semantic web will never rise without the proper incentives. Not only is there no incentive for industry to participate in it, there's in fact an anti-incentive to do so.
Say you've build a weather app/website. Being a good citizen, you publish "weatherevent" objects. Now anybody can consume this feed, remix it, aggregate, run some AI on it, new visualizations, whichever. A great thing for the world.
That's not how the world works. Your app is now obsolete. Anybody, typically somebody with more resources than you, will simply take that data and out-compete you, in ways fair on unfair (gaming ranking). You may conclude that this is good at the macro level, but surely the app owner disagrees on the micro level.
Say you're one of those foodies, writing recipes online with the typical irrelevant life story attached. The reason they do this is to gain relevance in Google (which is easily misled by lots of fluffy text), which creates traffic, which monetizes the ads.
Asking these foodies instead to write semantic recipe objects destroys the entire model. Somebody will build an app to scrape the recipes and that seals the fate of the foodie. No monetization therefore they'll stop producing the data.
In commercial settings, the idea that data has zero value and is therefore to be freely and openly shared is incredibly naive. You can't expect any entity to actively work against their own self-interest, even less so when it's existential.
As the author describes, even in the academic world, supposedly free of commercial pressure, there's no incentive or even an anti-incentive. People rather publish lots of papers. Doing things properly means less papers, so punishment.
Like I said, incentives. The incentive for contributing to the semantic web is far below zero.
As my Reinforcment Learning professor said: "It's all about incentives, people"
This is the kind of idea that begs me to reconsider crypto as a possible real-world-problem-solving-tool. But I've yet to see an example of crypto working in a way that feels like it'll take off for anything other than (1) another form of "stock" at best, or (2) a grift at worst. I suppose we're in the market for another solution.
I think the fundamental issue in the digital world is that you compete with the entire damn world.
When I open a bakery, competition is limited to just a few miles of space. Provided I provide a decent product, I can exist. This idea allows for millions of independent bakeries to exist around the world, which is awesome. It provides great diversity in products, genuine creativity, cultural differentiation, meaningful local employment.
When you need to compete with the entire world, it's a different game altogether. Everything you do digitally can fairly easily be replicated at low cost. This creates an unstoppable force of centralization fueled by capital but also consumer preference: they rather have one service that has it all.
So even if you found a way to pay for data use (via crypto or not) all power will continue to flow to a dominant party.
The answer to almost any question beginning with "why don't they" (or why didn't they), is almost always "money".
Producing, aggregating, storing, or otherwise adding value to information costs money. Operating the internet costs money. Providing access to data costs money.
People are lazy. Businesses on the internet have learned that they can extract more money from this vast pool of lazy people by presenting information rather than just providing information. By this, I mean that the value-add and/or lock-in of many internet businesses is tied to how the information is presented; adopting a standard format would be effort that would not be financially rewarded.
(by "lazy", I mean "looking for local minima in effort to accomplish whatever task that they're trying to do")
Finally, the web envisioned itself as a hypermedia system that incorporated presentation (and subsequently active content) instead of just semantic content. Since presentation is a property of the web, it was quickly adopted for the reasons described above and evolved into the modern web (which replaced the blink tag with shit tons of javascript, don't get me started).
Therefore the "semantic web" could never exist because "semantics" is fundamentally incompatible with "web". Once you invent the web, you can't have the semantic web anymore because money.
When the phrase "The King is dead, long life the King" is used, the two kings are different people; the one that just passed and the one that replaced him. If the King is replaced by a Queen then the phrase is "The King is dead, long live the Queen". This is not some life after death thing. You aren't saying the King will live on in the hearts and minds of the people, you're stating your support for the successor.
That original use of the phrasal template is still valid, but today’s common use is when X goes though some sort of transition, is “reborn” in a new form, or when people realize that something which has generally been presumed dead is still around.
If you think of “King” as a role, this modern use is not that different than the original use. The capital-K role of King didn’t die, but continues in another form.
The entire movement felt like a massive tragedy of the commons. There is just no incentive for any single player to push the standard forward and the commercial players are already reaping enough benefits from Web 2.0 that putting more money in Semantic Web makes no sense.
Semantic Web was supposed to be the Web 3.0. It's so dead now that even its name is stolen by the blockchain. RIP.
The main factor in my choice of technologies for my project was the ability to reason data from other data. OWL was the defining solution for my project. This is mainly because I am only one person so I needed the computer to extrapolate data that was logically implied but I would be forced to encode by hand otherwise. OWL actually allowed my project to be tractable for a single person (or a couple of people) to work on.
The author brings up several points that I have also run into myself. The Open World Assumption makes things difficult to reason about and makes understanding what is meant by a URL hard. Another problem that I have run into is that debugging OWL is a nightmare. I have no way to hold the reasoner to account so I have no way when I run a SPARQL query to be able to know if what is presented is sane. I cannot ask the reasoner "how did you come up with this inference?" and have it tell me. That means if I run a query, I must go back to the MS sources to double check that something has not gone wrong and fix the database if it has.
Another problem that the author discusses and what I call "Academic Abandonware". There are things out there but only the academic who worked on it knows how to make it work. The documentation is usually non-extant and trying to figure things out can take a lot of precious time.
I will probably have another look at TerminusDB in due course but it will need to have a reasoner as powerful as the OWL ones and an ease of use factor to entice me to shift my entire project at this point.