Hacker News new | past | comments | ask | show | jobs | submit login
Tech giants let the Web's metadata schemas and infrastructure languish (threadreaderapp.com)
301 points by timhigins on Aug 7, 2020 | hide | past | favorite | 73 comments



Actual title: Google and other tech giants are happy to have control over the Web's metadata schemas, but they let its infrastructure languish

I know that hating on Google is fashionable, but that's a bit too much editorializing. Especially considering the content of the post, and Google just being a small side note.

---

On-topic: I recently looked into using schema.org types as the basis for a information capturing system, but many of the types are somewhat outdated, of questionable quality or just missing. Development indeed seems slow, while changes that are needed by one of the larger involved companies get pushed through quickly.

I think a big part of that stagnation is a lack of interest though. The whole semantic web domain has been pretty much inactive.

It's a real shame: having canonical types for most things in existence, and have those actually be supported as import/export formats or for cross-app integrations, would be immensely valuable! But there is absolutely no business incentive there - rather the opposite. Easy portability of data is not something most companies would want.


> But there is absolutely no business incentive there - rather the opposite. Easy portability of data is not something most companies would want.

That depends on what kind of data it is. For example, your home address is not part of your bank's primary business model, but keeping it up-to-date is important for it. If data portability in and out of the bank makes it more likely that you'll keep it up-to-date, that's useful for your bank as well.

Legislation and customer demand is also making it more and more palatable. If some data is not critical to your business model, but being the sole guardian of it is a legal/reputational liability is, then actually handing control over that data over to someone else and re-using that is very useful.


That's interest on the side of the data consumer not the data provider, for lack of better words.

If the bank was the one owning the information they would not want it to be shared with others as that would allow their client to easily migrate to another bank which they definitely do not want.

But as the one receiving the data,sure, it would be nice to have others share it with me, they'd say.

I'm afraid without legislation data sharing is never going to be a thing.


At this moment that bank is the entity that keeps this data. Their challenge is, however, that the data gets outdated. But if they give other parties the ability to access that data, then the consumer will have more motivation to keep it up-to-date, and the bank will now have access to more accurate address data.

(Note that the bank is an example - it could be another party.)


The solution for this would be for banks to use the government as single source of truth - in Germany we have the Melderegister anyway, it's mandatory to register your primary address.

Unfortunately it's not allowed by law that a consumer gives "push access" to e.g. banks, health insurance or employers.


somehow all this makes me think that well maintained metadata would be a real boon to the more untrustworthy elements of the web.


A fair concern - this really depends on legislation like GDPR that only allows for the data sharing with the consumer's explicit consent.


It's a textbook collective action problem. Everyone would gain from having a high-quality shared ontology, but nobody gains enough individually (it's a public good).

The typical solutions to collective action problems are (1) benefactors who subsidise production (either privately or through taxation), or (2) direct command and control. Google was apparently filling the role of benefactor.


I'm not even sure if everyone would gain from having a high-quality shared ontology, because as soon as you go beyond trivial examples, the details of the data model inevitably have competing incompatible needs which require some compromise.

I could certainly imagine that for many companies the disadvantages of using a model that's not simply a copy of their specific view of that problem domain are larger than the hypothetical benefits of interoperability, so even if such a shared ontology would exist, many would intentionally choose to use their own ontology instead of adapting to that standard.


At one point I worked at a company founded in 2005ish, so one of their core things was an ontology. We found that while some very generic things were reusable (person and address, say) almost everything that drove business logic was different, use case to use case.


Yeah, trying to standardize this seems like it would quickly turn into one of those "things people believe about time" rabbit-holes of edge cases and differences.

Even inside a single company I get scared now when someone says "if only we had just one standard way to handle this sort of thing"... if it's rarely simple for just one company, how would it work globally?


This is a really good point.

Without substantial benefits of a large universal ontology and without the ability to painlessly diverge from an ontology to not compromise on accurately modeling a particular domain, it’s hard to see the net benefits. Everyone will want to customize things for their domain, or point of view. An ontology should be easy to fork, like a repo.


I think even in those cases there are benefits.

I would certainly want to extend, modify and replace the data models for my core business as I see fit. But beyond those there are still going to be a whole lot of models in need in order to run the company but want to keep low maintenance. E.g. for hiring I might not have strong opinions what a job post, a candidate or an application should look like, so I'd be happy sticking to the standard in those cases and benefit of it being easier to mix and match tooling and pass around the data.

Also I reckon to me that a partially customized ontology, which is inevitable, is still easier to map between orgs than if they build it from scratch completely

Or maybe see it less as a standardized ontology but as a standardized way to create ontologies


this is where NIST or a similar agency could step in.


> I recently looked into using schema.org types as the basis for a information capturing system, but many of the types are somewhat outdated, of questionable quality or just missing.

It grew out of the semantic web community so this was roughly what I expected. That space just seems cursed to have these lofty ideals which are never realized because it’s hard to justify spending time on something which has no known consumer. Schema.org seemed poised to change that but they only use a couple of types and then only for a few types of searches.


Spent time with schema.org years ago. It's just not needed/useful for most scenarios and the amount of work and convincing most groups to use it isn't worthwhile so continuing to extend it isn't worth the effort.


Not needed nor useful for who? It’s obviously useful for a more robust and open web, for our collective society, so I’m not sure who the subject of your statement is.


Not to be curt, but it clearly isn't obviously useful—otherwise the project wouldn't be languishing such as it is. The notion of creating a single overarching conceptual map to regulate the representation of the varied manifold of human experience on the web is almost certainly a deeply misguided idea, and even if it's philosophically sound (a big if), it's not clear that schema is anything like the correct approach. I'm open to be convinced of it's value if you'd like to elaborate, but I'll just say its far from obvious.


Ah, I think we might be talking about different things. I think the larger promise of the semantic web is a categorically different thing than adding a bit of meta data to pages to know basic things like author, content type, description, etc.

It’s the latter I think is clearly valuable, in order for us to have competition for the likes of google and Facebook. It lowers the barrier for creating competing search engines, modern rss readers, and even things like distributed social networks.


Schema.org is designed to be useful to Bing and Google but not other entities. It is enough to help them compile better training sets to extract that kind of metadata without schema.org, but not enough to build a simple extractor that would be useful to a smaller software company.


Yes, despite its agnostic branding and name indicating basically totally maximal scope, schema.org has basically the features Google's interested in supporting for pulling out things from pages and emails.

To the extent that other uses can basically piggyback on data that sites added to target Google, it does provide some value, but I don't see it as really even attempting to be a generally useful "semantic web" or linked data vocabulary in the sense of interoperating with other things.


The Dutch startup I work at [1] is active in the semantic web technology space. It's not pretty much inactive. The industry is simply not in the foreground of things.

[1] triply.cc


We've edited the title to a different subset of the tweet. It's not always obvious how to condense those into 80 chars.

Submitted title was 'Google is happy to control core Web schemas, but they neglect project'


Development indeed seems slow, while changes that are needed by one of the larger involved companies get pushed through quickly.

Whichever company did that would be accused of trying to "take over" the web.

Ideally large companies should be sponsoring open efforts to define things that affect how the web works rather than doing the work themselves. Smaller open teams that move fast to define structures that work for as many people as possible, even if they're not perfect for Google, Microsoft, etc, would be more useful to the internet industry as a whole.


i recently went though schema.org a bit while putting together a blog, and it was a long list (for a human to digest), but relative to all the objects in the world, tiny. google's vested interest and stamp on it was pretty evident.

i also went through microformats, which seems to be much smaller, and more tightly-focused around blogs and structuring data shared among federated sites.


deleted


This reminds me of the tragic situation where if you process XHTML locally using XML tools that incidentally fetch the DTD, then things block and become absolutely dirt slow, because the W3C sysadmins are permanently pissed off by that: https://stackoverflow.com/a/13865692/82


This is great. It's a perfect solution too, because it's not super long, but it's long enough that nobody is going to go into production without figuring out how to cache it


To be fair - it seems that caching in general when it comes to web-resources has been neglected collectively on the web. Two big factors include Web 2.0 and the rise of Youtube which actively fought to have their content un-cacheable via normal software & existing mechanisms.


Just the other day I was looking for a full XHTML DTD to find the list of built-in entities and was so surprised about the speed. This explains a lot!


It was once put to me that Google's promotion system creates this dynamic.

Starting a new project that garners widespread attention looks good in a package, but replacing lightbulbs and scrubbing floors doesn't. Folks create a splash, get promoted, then move on and are not replaced.

I've never worked at Google, so I do not know if this dynamic is real. I would be interested to hear from Googlers about incentives to work or not work on something.


I can't speak for the entire org, but when I was in Display Ads, promotion went to people who moved the needle on certain metrics, e.g processing latency, revenue, throughput, resource usage, etc. More often than not, impacts like that were accomplished by shipping new features (while deprecating legacy code) instead of fixing bugs. If you're going for promotion, having launched something that makes an impact (appropriate to the level you're going for) is a must. That said, it's a fairly common occurrence that once you get promoted, you move to a different team, leaving your old projects to your former teammates.


Former Googler, 16 hours ago, stated: https://news.ycombinator.com/item?id=24077692


He says incentives and recognition began to shift by 2010, which is good to hear.

But that leaves me wondering how the linked situation occurs. It's a cliché that Google shutters projects or loses interest. I don't know whether that's particular to google or if it's due to an availability heuristic (Google is well-known, so its wanderings-off are widely publicised).

But if there are such dynamics and incentives, they are worthy of attention. Google exerts enormous gravity on the fabric of the technology industry, it would be helpful to avoid hurtful externalities arising from internal incentives.


My theory is that Google is notably worse at this because their core advertising business is so profitable. Most companies would have been forced to become good at managing projects by necessity but until ad revenues substantially decline Google can subsidize a ton of inefficiency and still report good numbers to Wall Street, just like Microsoft's various write-offs in the 90s and 2000s.


It certainly doesn't help. I refer to this as the "Mississippi of Money" problem. If a river is deep, wide and fast-flowing, you can do pretty much anything and still get somewhere.

The rest of us have to make do with leaky canoes, going up a certain creek, often sans a paddle.


When I left in 2015, this promotion incentives was also a problem.


Not sure a former Googler's estimation of what may have started to happen after he left is much of a signal.


Isn't schema.org supposed to be an "industry wide" collaborative effort? In which case we must also remark on the disinterest shown by players like Microsoft, Apple or Google, or even Facebook, Twitter and so on, all of whom benefit by this semantic markup.


My opinion is the disinterest from bigger players is because of the lack of traction/interest from the broader community. Maybe there's a chicken and egg problem that comes with bootstrapping any new standard


IMO, the reason for the lack of interest from the broader community is because it's unnecessarily complicated, and somehow simultaneously incomplete to the point of being almost unusable for some projects.

It was clearly designed by bureaucrats who enjoy making rules and sub-rules and sub-sub-rules. It doesn't matter if it works, or is useful, as long as there are plenty of rules.

There's a reason nobody wants to play with the jerk dungeonmaster.


That's fair. I can see that being the issue. Speed and ease of use don't go hand in hand with having a well defined ontology and processes for updating it


I think Facebook only cares about OpenGraph


Google is dropping the ball here, as they stand to benefit the most from a single central ontology for the web. It does illustrate that this approach doesn't work if you are looking to innovate quickly and not be dependent on the goodwill of a single institution that doesn't even know who you are.

Maybe we can finally stop using ontologies for the semantic web and start solving the hard problem of language pragmatics.


A single ontology would level the search/structured-web playing field. Right now Google has a huge advantage because they leverage their ML/NN knowledge and funding to "extract" structured information from the unstructured web. New players just can't do that without a lot of time & funding, which wouldn't be the case if a lot of the data was in a structured and agreed-upon format/schema.


ML/NN methods and knowledge only become more effective and powerful if accurate input data is available. You can not just "extract" information, but try to perform complex queries, inference etc. over structured data.


I'd love to see schema.org updated and used more. As someone still doing linked data work, albeit in academia, I mainly use it simply to provide more context to self-created, domain specific properties within ontologies using things like rdfs:seeAlso, skos:related, etc.

Ideally it'd be nice (imo) if schema.org had more domain specific extensions, similar to the bib[0] one which allows for things like comic book properties to be described.

[0] https://schema.org/docs/bib.home.html


I don't understand. This is a tool for Google to extract information from websites and keep potential visitors on Google instead. Every use case for and future progress on it will be measured on that metric.

They don't care to address any of the issues or "fix the infrastructure" because this isn't a "organize all the information in the world!" project at all. The guys that take Google visitor retention stats into their next performance meeting are probably poking fun at all the ontology nerds that have descended on their metric-driven scheme.


From the official website:

> Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.

> A shared vocabulary makes it easier for webmasters and developers to decide on a schema and get the maximum benefit for their efforts. It is in this spirit that the founders, together with the larger community have come together - to provide a shared collection of schemas.

If this isn't an "organize all the information in the world" project, then Google and the other companies involved are branding it in a horribly dishonest way. In which case, they should be criticized for presenting a company-specific visitor retention strategy like it's some kind of altruistic gift to the world.

Sites like Facebook and Twitter have their own 'lite' metadata schemas that they use to help identify and render links. Hardly anyone criticizes them over it, because they haven't registered a generic domain like 'schema.org' and presented their work like it's some kind of community-driven collaboration. They're upfront that it's just a simple API for their website.


The Schema.org vision certainly is not dead within Google. See the Google-backed DataCommons project at http://datacommons.org/ which heavily relies on the schemas defined by schema.org. Headed by the creator of schema.org.


My solution was to reverse engineer highly ranked web pages. I used a subset of the schema that seemed to be universal to those pages. Schema.org just gave me the proper file formats.


Extrapolating is this possibly a “JavaScript - The good parts” sort of problem?

Care to elaborate further as to which of the schema that your kept/found most useful?


care to share it?


I think their choice of JSON-LD as the recommended format and not being transparent in how it effects results is the biggest issue. JSON-LD requires duplication of content, where as microdata is inline with existing content.


Microdata used to be the recommended format. Google eventually switched recommendations to JSON-LD, and I updated my tools to reflect that.

JSON-LD is far easier to work with. Having to mix metadata with markup was a pain. Half the required elements would have to be hidden in CSS because they didn't make sense in context.


JSON-LD is also much easier to generate in scenarios where you can access post meta data and output scripts but can't necessarily filter HTML markup output (WordPress, corporate CMS, etc.).

And you can generate it dynamically: https://developers.google.com/search/docs/guides/generate-st...


A lot of websites are now generated using static site generators and it is much easier doing so inline, than have to duplicate the content which also makes the pages much bigger. Like I said the issue is more about lack of transparency about how it may effect ranking.


Why not use XML-RDF? ::troll face::


Actual thread for anyone wanted to look at the images which Thread Reader stripped:

https://twitter.com/alkreidler/status/1291509746000855040


The images show up just fine for me? If you're like me and don't like Twitter, use nitter instead; it works just fine without javascript (the entire thread is there on one page)


Their lazy loader appeared to fail - the first image rendered but none of the rest did.


So fork? Its not the big G's responsibility to solve all the internet's problems and honestly most other web metadata standards have failed, only difference is that this one has a big name attached we can all blame.


There is one question I always have about the semantic web schemes? What if it finally catches on and the end sites just immediately start lying their ass off for selfish purposes? Like many of the earlier search engine optimizations to try to land common hits on a massive page that doesn't actually provide what you are looking for.

The only way around that is for somebody to do the processing of the real data to validate that it isn't just bullshit for a nefarious purpose. From what I've heard about the Semantic web conceptually seems a bit skeumorphic as a concept.


The nice thing about the newer semantic web standards is that they're a lot more detailed than the "description" and "keywords" standards of old. It's more obvious if a website is providing misleading information for self-serving purposes.


All the information is already out there. Ontology is a crutch.


@dang there's a typo in the name: should say "infrastructure", not "infrastrucure"

It's missing the T, as-in: infrastrucTure


How do you avoid the Bike-Shedding problem?

Would forcing the proposer to quantify costs and benefits help?


How does the IETF manages it? It operates by seeking broad consensus. That might work. Also getting a good compromise done should be in the interest of everyone. Plus there could be iterations every few years. (Like there's with JS/ECMAScript via TC39.)


It's "langushing" and they should do it for us? It's flourishing and they're doing it for us and they have lots of open issues and I want more for free without any work.

Wow! Nobody else does anything to collaboratively, inclusively develop schema and the problem is that search engines aren't just doing it for us?

1) Search engines do not owe us anything. They are not obligated to dominate us or the schema that we may voluntarily decide to include on our pages.

We've paid them nothing. They have no contract for service or agreement with us which compels them to please us or contribute greater resources to an open standard that hundreds of people are contributing to.

2) You people don't know anything about linked data and structured data.

Here's a list of schema: https://lov.linkeddata.es/dataset/lov/ .

Here's the Linked Open Data Cloud: https://lod-cloud.net/

Does your or this publisher's domain include any linked data?

Does this article include any linked data?

Do data quality issues pervade promising, comparatively-expensive, redundant approaches to natural-language comprehension, reasoning, and summarization?

Here, in contributing this example PR adding RDFa to the codeforantarctica web page, I probably made a mistake. https://github.com/CodeForAntarctica/codeforantarctica.githu... . Can you spot the mistake?

There should have been review.

https://schema.org/ClaimReview, W3C Verifiable Claims / Credentials, ld-signatures, and lds-merkleproof2017.

Which brings us to reification, truth values, property graphs, and the new RDF* and SPARQL* and JSON-LD* (which don't yet have repos with ongoing issues to tend to).

3) Get to work. This article does nothing to teach people how to contribute to slow, collaborative schema standards work.

Here's the link to the GitHub Issues so that you can contribute to schema.org: https://github.com/schemaorg/schemaorg

...

"Standards should be better and they should pay for it"

Who are the major contributors to the (W3C) open standard in question?

Is telling them to put up more money or step down going to result in getting what we want? Why or why not?

Who would merge PRs and close issues?

Have you misunderstood the scope of the project? What do the editors of the schema feel in regards to more specific domain vocabularies? Is it feasible or even advisable to attempt to out-schema domain experts who know how to develop and revise an ontology or even just a vocabulary with Protegé?

To give you a sense of how much work goes into creating a few classes and properties defined with RDFS in RDFa in HTML: here's the https://schema.org/Course , https://schema.org/CourseInstance , and https://schema.org/EducationEvent issue: https://github.com/schemaorg/schemaorg/issues/195

Can you find the link to the Use Cases wiki (which was the real work)? What strategy did you use to find it?

...

"Well, Google just does what's good for Google."

Are you arguing that Google.org should make charitable contributions to this project? Is that an advisable or effective way to influence a W3C open standard (where conflicts of interest by people just donating time are disclosed)?

Anyone can use something like extruct or OSDS to extract RDFa, Microdata, and/or JSON-LD from a page.

Everyone can include structured data and linked data in their pages.

There are surveys quantifying how many people have included which types in their pages. Some of that data is included on schema.org types pages.

...

Some written interview questions:

> Which issues have you contributed to? Which issues have you seen all the way to closed? Have you contributed a pull request to the project? Have you published linked data? What is the URL to the docs which explain how to contribute resources? How would you improve them?

https://twitter.com/westurner/status/1291903926007209984

...

After all that's happened here, I think Dan (who built FOAF, which all profitable companies could use instead of https://schema.org/Person ) deserves a week off to add more linked data to the internet now please.


I think that might be fair, but when she makes org came out it pitched itself as trust us we will take care of things, so yeah they don’t owe us anything but track record matters for trust in future ventures by these search engine orgs


schemaorg/schemaorg/CONTRIBUTING.md https://github.com/schemaorg/schemaorg/blob/main/CONTRIBUTIN... explains how you and your organization can contribute resources to the Schema.org W3C project.

If you or your organization can justify contributing one or more people at full or part time due to ROI or goodwill, by all means start sending Pull Requests and/or commenting on Issues.

"Give us more for free or step down". Wow. What PRs have you contributed to justify such demands?

https://schema.org/docs/documents.html links to the releases.


What's in it for the tech giants? Google is merely interested in peddling its ads to its Chrome users. Use Brave instead.


Google neglect a project? No...I don't believe it.

Isn't this SOP for Google?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: