Hacker News new | past | comments | ask | show | jobs | submit login
Stephen Wolfram on a .data TLD (stephenwolfram.com)
81 points by LisaG on Jan 10, 2012 | hide | past | favorite | 35 comments



Wolfram's proposal sounds completely backward to me. You'd have to consider: does google.data apply to google.org or google.com? Should we have google.org.data and google.com.data?

I think the right way is to put things under the domain. So data.google.com, or google.com/data or even a META tag on a web page that tells the browser the URL for the data relevant to a particular page.


They tried that with ".mobi", while the rest of the world went with "m.".


> Wolfram's proposal sounds completely backward to me.

Yep. Just call it a New Kind of Internet.

Edit: Relevant:

Edit2: Amazon removed the review! Search for "A new kind of review" on this page

http://shell.cas.usf.edu/~wclark/ANKOS_humor.html


> Wolfram's proposal sounds completely backward to me. You'd have to consider: does google.data apply to google.org or google.com? Should we have google.org.data and google.com.data?

As for that notion, maybe we should switch to naming it such as we do for java packages :D

Google.com would be com.google.search, com.google.mail, etc :P


It has crossed TBL's mind:

"I have to say that now I regret that the syntax is so clumsy. I would like http://www.example.com/foo/bar/baz to be just written http:com/example/foo/bar/baz where the client would figure out that www.example.com existed and was the server to contact. But it is too late now."

http://www.w3.org/People/Berners-Lee/FAQ.html#etc


"If a human went to wolfram.data, there’d be a structured summary of what data the organization behind it wanted to expose. And if a computational system went there, it’d find just what it needs to ingest the data, and begin computing with it."

This sounds to me like a high-level description of how the web is supposed to work today, only implemented using a new TLD instead of HTTP headers.

It sounds odd to me, coming from someone whose major web service sends all results -- even text and tables -- as GIF.


Good point re Accept: headers, though I think discoverable formats (such as <link rel="alternate" type="application/rdf+xml" href="data.rdf" />) are even better.


Problem with this idea is that .data will encourage "data servers" but not create a "data web"

Allow me to explain.

RDF[1] was created towards solving the "data web" problem. However, the challenge has been representation and modeling "things" such that we can cross-link "data" on "web". The language to create such shared representations (Web Ontology Language[2]) is difficult to use and standardize. Nevertheless, this approach has been hugely successful in knowledge-intensive domains such as biology and health care.

On the Wild Wild Web, the microformats[3] have got wide support from Search engines and web publishers.

1. http://www.w3.org/RDF/

2. http://www.w3.org/TR/owl-features/

3. http://support.google.com/webmasters/bin/answer.py?hl=en&...


Don't forget the linking open data initiative: http://linkeddata.org/ They've been building a huge distributed data set.


Are we going to transfer hypertext? No? Then use data://google.com and define a protocol for GET PUT POST AND DELETE data over the wire using standard data formats(how about INSERT, SELECT, UPDATE and DELETE?).

The index page will give you all the discoverability and from there you can go to google.com/employees or bestbuy.com/products etc showing whatever data is public (or private provided oauth mechanisms) and what can be created,modified and deleted according to roles and security levels.

This has been tried before but the well was poisoned when they dropped SOAP in it.


Are we going to transfer hypertext?

I certainly hope so; linked data (of which RDF is the main implementation nowadays) is much more useful than having disconnected silos. Of course, we won't transfer HTML, but that's just one implementation of hypertext.

Besides, even if we weren't, why would we replace HTTP by something that accomplishes the same? Doesn't make much sense to me.


TLDs should help to identify the type of organization that has control of the domain, not some arbitrary thing about the (hypothetical) website. Why people are making this so difficult?


So is .data just another way of pointing to an API? So if hacker news had an API, you would call up news.ycombinator.data all of the time? It would be awesome if this was the case, but even better if people that had .data domains came to a consensus of how to document their data, eg. www.domain.data/docs and have a common layout among websites to make it easier for programmers and scholars alike to figure out how to access the information they need.


[deleted]


Maybe I misread the article then. If data.google.com, data.google.org makes more sense than google.data (which I agree, it does) then what would the application of .data be? Would google.data be an all-in-one resource for all Google API calls?


I was trying to think of a situation where the various TLDs are owned by different organizations, like woot.com and woot.net or back when whitehouse.com was a porn site.


API is more general term for programming interface whose function is not nessecarilly obtaining data. (it could also be updating or deleting data). this .data TLD as i understand would be just for obtaining data from website in structured way. for example if google had .data domain it would be something like this : you enter google.data?search=lady+gagga and it will return you page in json or some other format for results of that google search.


Yes. Also, almost all APIs are designed around small bits of user or query data, just like your example.

This seems more intended for bulk data, which is likely going to be some pregenerated chunk in the MB, GB, TB range and so less suited to the JSON API call paradigm, and more likely to involve a simple lookup to disk rather than being computed on the fly from some database.


Is not this what the semantic web strived for? I don't know if a new top level domain would create enough momentum for it.


> I think introducing a new .data top-level domain would give much more prominence to the creation of the data web—and would provide the kind of momentum that’d be needed to get good, widespread, standards for the various kinds of data.

I'd say thats a pretty good reason for using the new TLD, technicalities aside.


"And my concept for the .data domain is to provide a uniform mechanism—accessible to any organization, of any size—for exposing the underlying data."

Who would be the standards body for defining and regulating such a uniform mechanism?


and why would the people with valuable data eg FT or Bankers or Lexis Nexis do this


They could put a paywall in front of it.


Why did OData never catch on?


Microsoft made a pretty big push for OData with their WCF Data Services. I feel like there's a pretty decent community around it too.

http://msdn.microsoft.com/en-us/data/bb931106


This is less a technical discussion than speculation on human factors. Will a special TLD inspire people to offer their services differently?

It’s just a namespace, one of many possible choices. But I wouldn’t discount its importance as a protocol, or an expectation. “.com” has a very important non-technical meaning.


Bringing everyone’s data as close to “computable” as possible is an all-round win so I hope this takes off.

A big problem is how to ETL these datasets between organizations, and I think Hadoop is a key technology there. It provides the integration point for both slurping the data out of internal databases, and transforming it into consumable form. It also allows for bringing the computations to the data, which is the only practical thing to do with truly big data.

Currently there are no solutions for transferring data between different organizations’ hadoop installations. So some publishing technology that would connect hadoop’s HDFS to the .data domain would be a powerful way for forward-thinking organizations to participate.

Another path towards making things easier is to focus on the cloud aspect. Transferring terabytes of data is non-trivial. But if the data is published to a cloud provider, others can access it without having to create their own copy, and it can be computed upon within the high-speed internal network of the provider. Again, bringing the computation to the data.


I read your comment several times, but I still don't understand why you think Hadoop is the key technology for data interchange between organizations. I don't mean to be harsh, but your comment is a bit like buzzword soup (hadoop, etl, cloud, bring the computation to the data).

> [Hadoop] provides the integration point for both slurping the data out of internal databases, and transforming it into consumable form

Hadoop does no such thing. It doesn't "slurp data out of internal databases". It's just a DFS coupled with a MapReduce implementation. Perhaps you're thinking of Hive?

> Currently there are no solutions for transferring data between different organizations’ hadoop installations.

All data isn't "big data". By being myopically hadoop-focused, you're ignoring the real problem, which is data interchange. XML was supposed to be the golden standard; it's debatable how far it's achieved its initial goal.

> So some publishing technology that would connect hadoop’s HDFS to the .data domain

So basically, forsake all internal business logic, access control, and just pipe your database to the net? When you have a hammer...

> Transferring terabytes of data is non-trivial. But if the data is published to a cloud provider, others can access it without having to create their own copy, and it can be computed upon within the high-speed internal network of the provider

See AWS public datasets for exactly this, but it's still a long shot. It also ignores the problem of data freshness (i.e., once a provider uploads a dataset, they also need to keep updating it). http://aws.amazon.com/publicdatasets/


Let me unpack it for you then.

There is a reason XML, the semantic web, linked data failed to really change the data world, whereas hadoop did. The reason is computation.

The problem isn't data interchange formats and ideal representations, the problem is being able to compute with data. Distributed computation can then be used to solve all the other problems.

Case in point: Slurping data out of databases. Apache Sqoop leverages the primitives provided by Hadoop, in terms of partitioning and fault tolerance, to make it easier to do massive data transfers out of existing databases.

Another example of a solution coming from the hadoop perspective: Avro. It beats the pants of off XML as a data interchange format, precisely because it makes computing with the data (which is the ultimate point) easier.

Now, there is a reason I called Hadoop the integration point. It is becoming a general purpose computation system, which at the same time is also the datawarehouse for organizational data. So rather than dealing with the details of proprietary commercial systems, programmers can target applications to the open-source hadoop ecosystem, and have those solutions be reusable and customizable on a large scale.

The "publishing solution" would of course deal with access control, business logic, freshness, etc. That is exactly what I'm advocating be built.

Individual pieces of data may not be big data, but the aggregate problem still is. In fact this is exactly the Wolfram Alpha case: tons and tons of little datasets that add up to a lot of headache.


Actually, Hadoop is seeing adoption because it can be used within a company's data silo. The semantic web and linked data solve heterogeneous data from a technical perspective, but the data still needs to be shared between diverse actors, and that isn't something commercial entities are in the habit of doing.


i think this is unfair to linked data. linked data could be hidden behind layers upon layers of distributed sparql queries much like how the human readable works today, with each entity playing it's part, but with hadoop you have to have like 15 different ports opened up between each box in a whole setup before you can even begin.


Hadoop is an ops and usability disaster. Yet companies large and small are adopting it because it does "something people want".

RDF and ontologies are just more data. Without computation, that data is not useful, and all the things one "could do" with it will not come to pass without a credible computational platform that people actually want to use.

So IMHO I would like to see that community focus less on standards and ontologies and RDF-as-panacea, and and more on the infrastructure needed to put the data to work.


Thanks for clarifying; I find this much more insightful.


CouchDB might offer some interesting stuff here, but it's not opimized for structured data.


How do you embed flash ads in structured data?

Are people willing to make micropayments for access?


Wolfram.com/data, data.wolfram.com etc.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: