>GET /apiv2/entries?cat=6 -- List entries that are datasets
>GET /apiv2/entries?cat=5 -- List entries that are papers
These could be written as:
/apiv2/entries/datasets
/apiv2/entries/papers
2). You may not need path elements like entries, entry, collection, and collection name. For example further simplification would leave
/datasets
/papers
3). Don't use capitals like this, switch to lowercase:
/apiv2/entry/INFOHASH
4). Use HTTP verbs in a standard, semantic way. For example this
POST /apiv2/collection -- create an collection
POST /apiv2/collection/collection-name/update
POST /apiv2/collection/collection-name/delete
POST /apiv2/collection/collection-name/add
POST /apiv2/collection/collection-name/remove
This could all be collapsed into one form and would in turn be more familiar to developers.
I strongly agree with these suggestions, if the creator(s) happen to read this.
And while these suggestions may seem like nitpicking, they are not.
While not everything around REST APIs is entirely standard, there's a great deal of agreement about proper resource naming and use of HTTP verbs. If you follow these widely-used standards, not only will developers find it much easier to interact with your API, but also there is a lot of tooling and client libraries out there built specifically to work with APIs that use HTTP verbs and URIs in the standardized ways.
Hi. I found this site late last night while looking for material for a hobby project. Really, really awesome site.
I also found the creators here [1] and here [2]...maybe you can help shoot them an e-mail so that they can acknowledge a lot of the praise on this thread and the flood of helpful suggestions. Additional contributors can also be found here [3] as well as related publications/presentations [4] including a massively interesting Reddit discussion [5].
Question about your first point - doesn't it depend on the use case?
What would happen if I wanted to get a list of datasets and papers? (maybe in this case it's nonsensical, but it's a problem with some other APIs I've used and I've never figured a good way to work around it)
GET /apiv2/entries?cat=5&cat=6 vs 2 separate requests and client logic to combine results?
I wouldn't get too dogmatic about REST. I've been working with it for years, and it seems that every time somebody makes a semantic HTTP API, they run into a URI length limit at some point.
Every semantic API ends up with an overriding parameter like _action=PUT because well intentioned API developers don't know that broken middleboxes exist on the internet until their beautiful creation collides with reality.
Most REST API frameworks that I'm aware of (e.g., Django REST framework) easily allow you to accept either the semantic HTTP actions, or an action parameter in the query string. Some also let you fall back to the `X-HTTP-Method-Override` header. I think these are even on by default in many frameworks. So fallback is pretty easy if you're using one of those, and even if you're not, it's still just an extra line or two in your routing, plus a little extra error handling in your client for error code 405.
I do agree that you should have a fallback, but for reference, I've been consuming semantically-defined REST APIs for 4-5 years now, and I've never run into this problem. I agree with you that it can happen because of firewalls, etc, but it must be incredibly rare. Maybe in some corporate environments, for the usual BS reasons.
Also, if you use SSL/TLS, you won't run into this problem unless you're being MITM'd, like in some corporate environments (because no one in between server and client can tell what HTTP method is used). Use of SSL/TLS is probably why I haven't run into this in recent years.
I'm pretty glad that torrents have started to break out of the "it's only for warez" stereotype. It's a useful technology, regardless of what made it popular.
Even now I do the same. In india direct downloads are slightly slower. I guess it's to do with the mirror location or something else. Torrent makes it much faster
Same in China. The Chinese firewall throttles TCP connections making direct downloads very slow. Not uncommon to have an unbearable slow connection while browsing web pages and then when starting a torrent download, you immediately get 2 Megabytes per second download speeds here.
>My name is Eriberto and I am not a C developer. I imported Axel from its old repository[1] to GitHub (the original homepage and developers are inactive)
Uh oh. Thanks to the repo owner for updating the README, but that's not a good situation.
Oh, I know that programmers and hackers and whatnot knew torrents were cool; I'm talking about non-technical people.
For example, until fairly recently, if I mentioned a "torrent" to my non-technical mom, she would assume I meant ThePirateBay or something like that. Nowadays, she knows it as just another means to download files.
That's a fair point; I guess what I was trying to get at is that when I worked for NYU, I think it would have been an incredibly tough sell to use torrents in any capacity, because of the stigma of piracy. However, I think if I were to pitch it now, there would be serious consideration.
A huge number of games used torrents for patching since around that time, notably in the MMO scene. WoW and every Nexon game come to mind. AFAIK the Battle.net launcher still downloads updates via torrent.
The Blizzard updater is actually a very cool download utility, worth hacking/poking at.
AFAIK it pioneered the concept of "web seeds", using HTTP GETs with a Range: header to download specific chunks from a CDN that were not healthy/available in the swarm.
This could take off it only a big player like Ubuntu pushed it. I don't see why we depend on a set of centralized servers for a bunch of files that a huge number of people download on a very regular basis.
And yet, the idea has been stagnant for years.
Edit: By which I mean, it works, but not enough people use it.
It's all listed in your link, but the biggest reason is that Bittorrent was inherently built to distribute static content, and the packages is an ever-changing list. Which means that you'd have upstream servers constantly calculating new torrent files and distributing them. Moreover Bittorrent makes it real easy for an observer to know what packages you're installing, and thus what version you have.
See this alternative that uses similar idea but not real Bittorrent, they worked around the 1st problem: http://www.camrdale.org/apt-p2p/
I still don't see why Ubuntu/Debian/et al don't take this (or something like it) up in a more official manner. I can see why it's not a default of course, but it could be made a question during installation for example.
An additional benefit would be that you'd be able to source packages from machines on your local network, with fallback to the internet, and it would all be pretty much automatic and configuration-free.
apt-p2p and debtorrent are entirely dead. The person who created them seems to be MIA from Debian. The bootstrap nodes are dead. Both packages are orphaned. apt-p2p will not be in the next release of Debian and debtorrent was removed from the last release of Debian.
I'm pretty sure the security issue (Everybody knows you have old packages) is a good point against this system, but it's true, I'd love to see this system more widespread.
For the local network part at least, it's really not that complicated to implement, all you have to do is to listen for announces on the network and ask those peers before asking remotely; there is a standard example for archlinux in pacserve (http://xyne.archlinux.ca/projects/pacserve/) with my own very crude reimplemantation (https://github.com/rakoo/paclan)
I totally agree; what really bothers me about current OS's is that they all depend on something centralized, so if, for example, Canonical went broke, I wouldn't be able to install new packages on my Ubuntu laptop.
I'm aware that you can swap out the PPAs as needed, but I would really like something distributed and decentralized.
Agreed...although it is rather ironic that "CrackStation's Password Cracking Dictionary"[1] tops their list of most downloaded by a respectable 9% over 2nd most downloaded "Arizona State University Twitter Data Set"[2].
Are you serious? This is the original purpose of BitTorrent. It was not invented as a "warez technology", but as a way for service providers to save on bandwidth costs when serving huge files. Back when BitTorrent spec was written hosting a collection of large datasets like this meant paying thousands of bucks in hosting expenses, making it prohibitively expensive for most independent developers.
Torrents and mesh networks were breaking out to be the future of internet during mid and late 90s (Also why Skype was P2P and why it was successfull). I'm not sure what happened to it though.
How so? It's meant to be used by researchers distributing the datasets for their own papers. (Notwithstanding the fact that people seem to already be uploading textbooks/lectures, which seems against the stated intent).
If only WebTorrent (https://github.com/feross/webtorrent) worked with standard bittorrent protocol instead of a custom one on top of WebRTC, we would have live access to all the papers and "displayable" data directly instead of firing up a torrent downloader just for a small file.
I don't think you understand how WebTorrent works. WebTorrent in fact works with the regular BitTorrent network if you run it from node, and falls back to use WebRTC when used in the browser.
So you can seed those torrents directly in the browser with something like instant.io.
I do know how WebTorrent works. The problem is that it creates a completely parallel network of nodes, which on the surface happen to exchange the same framing of messages, so:
- When WebTorrent runs on the standard bittorrent network from node, that doesn't change anything: it's still not available from the browser.
- When WebTorrent runs on the WebRTC network through instant.ion or anything else, it will only work if somebody else is also seeding the same torrent in their browser. Which they can only have in the browser if they first got it somewhere else. Oh and I'm willing to bet that none of the nodes who currently have the content (ie on the bittorrent network) also share it on the WebTorrent network.
I don't expect classic bittorrent peers to ever implement the mess that is WebRTC just to accommodate browsers, unfortunately.
I think the problem lies with the browser vendors, who aren't implementing the bittorrent protocol in the browser.
If WebTorrent were to do that itself, it would have to become a "plugin" rather than just an extension.
So start asking Mozilla/Google to implement the bittorrent protocol in the browser (or even better, implement IPFS directly, as that's a more wholesome technology specifically made for the browser).
> I think the problem lies with the browser vendors, who aren't implementing the bittorrent protocol in the browser.
browser vendors shouldn't have to implement it. they should expose posix-like APIs (bsd sockets, file IO) or process management+ipc via plain pipes (talk to native bittorrent client) so it could be provided through an extension.
The problem with browsers is that they create a backwards-incompatible API stack. This is understandable for web content. Not so for extensions.
> In node.js, the webtorrent package only connects to normal TCP/UDP peers, not WebRTC peers.
That's why there's this webtorrent-hybrid client which runs a hidden electron process to communicate with WebRTC peers and normal TCP/UDP peers. According to the readme, there's (understandably; it's running chromium in the background) a lot of overhead with this method so they're working toward a non-electron version of WebRTC in Node.
So, what are the rights and licenses for this data? I see that one of them is Yelp photo data from a Kaggle contest [0]. Yelp distributes another Academic data set, but you have to fill out a form and agree to their TOS [1]. So they're OK with the data being available like this?
Another random datapoint: When EdX/Harvard released a dataset showing how students performed/dropped out, I uploaded a copy to my S3 to mirror and linked to it from HN. I got a polite email the next day asking for it to be taken down. Academics are (rightfully, IMO) protective of their data and its distribution (particularly its attribution).
One thing I would love to see on here is stuff from ICPRS, such as its mirror of the FBI's National Incident-Based Reporting System [2]. As far as I can tell, it's free for anyone to download after you fill out a form. But it also should be free to distribute in the public domain, but for all I know, ICPSR has an agreement with the FBI to only distribute that data with an academic license.
(The FBI website has the data in aggregate form, but not the gigabytes that ICPSR does)
A lot of these "data sets" appears to be coursera courses. I'm not sure if those are legal to redistribute. It also clutters the browse function since a lot of results aren't data sets
I like this idea! In my research we deal with relatively large amounts of sequence data, all of which needs to be associated with geo (https://www.ncbi.nlm.nih.gov/gds/). While geo is in many ways a good thing, it is not the most pleasant to use - I would love it if we could use something like torrents instead.
I feel like there is a danger, however, that using torrents would facilitate the thousands of nonstandard (often redundant) formats bioinformaticians seem to create.
I'm wondering if torrents as such are actually useful for this. I'd figure some kind of virtual file system (perhaps based on BitTorrent) would be very useful. You'd simply pass a file path to an open() routine in your scientific code and data would get opened transparently. You currently have this with URLs and HTTP but there's no useful caching or data distribution.
BTSync and SyncThing are also tools to do this, and I'm sure there are FUSE things to work with BT and block chains ("bittorrent fuse" google results look promising).
Magnet links from torrents already do this. You just need to have the SHA1 reference of the torrent, working effectively as a url, to load the metadata, and then the files described by the metadata payload.
The P2P nature of the network then help the descentralization of sources, populating several clones of the dataset.
It's a great initiative. One more step to help bring science collaboration into the modern Internet world.
How much data do you have? How much storage do you project is needed? I'm wondering how practical it would have been to use centralized storage, which has its own advantages.
- this is a mechanism for sharing files and directories (e.g., zipped csv files), whereas noms defines its own structured data model that is much more granular
- noms has versioning built-in, you can track the history of a particular dataset
- this is firmly based on bittorrent. you could maybe run noms on top of bittorrent, but it's more intended to be run like git, where you talk directly to a server that you want to collaborate with
Is there anything preventing the abuse of this systems to share copyrighted content? Just call it "Population.Data.torrent" where the actual data is a movie?
.com? Is that wise? I imagine the domain will be seized as soon as the site becomes popular enough. They should at least have some contingency plan to deal with that.
Their content is not illegal whatsoever. They are not like Sci-hub where the content is actually owned by someone else. On Academic Torrents, the submissions are made by people who have chosen to share their work on this platform rather than submitting to an academic journal.
Academic journals* rarely host large datasets (it's fairly unusual to even see some PDF supplements with vital summaries or results), and if you want to share data >10MB, you're mostly stuck doing it yourself through your own website or some institution's, or not sharing it at all.
Also, academic datasets aren't free of copyright concerns. Consider the famous Imagenet dataset for image classification. It's made of a million images pulled from Google Images. Did they get each photographs' creators' permission for such unlimited redistribution? Of course not. But there's no way the 'implied license' of posting a photo online extends that far... Like so much of the Internet, it's only possible in the absence of enforcement of copyright law.
* which is particularly frustrating because academic publishers make such enormous profits and hosting large datasets is exactly the sort of thing they should be doing if they were remotely interested in supporting science rather than making more money
In his defence, it doesn't really matter whether the content is legal or not. There are people out there who see the torrents as evil. These people have political power and have been known to disrupt and seize .com domains on the scantiest of evidence. A more resilient tld would be a good idea.
Dr. Cohen! But I don't care about titles. Thanks for posting about it! Do you know a hacker news post launched the site in 2014? I was chatting about it at an event in Boston and someone posted it on here and it went crazy. The machines we had hosting data had their 20Mbps connections maxed out for weeks. Now we are prepared and it's just another day!
Woah! That screengrab is a mess at that tail end- massive traffic, not to mention that this went on for weeks. Regardless, this is an amazingly useful tool that a n00b like myself finds useful(going forward). Thank you, Daktari (Swahili for "doctor").
1). Don't use hard coded values for types
>GET /apiv2/entries?cat=6 -- List entries that are datasets
>GET /apiv2/entries?cat=5 -- List entries that are papers
These could be written as: /apiv2/entries/datasets /apiv2/entries/papers
2). You may not need path elements like entries, entry, collection, and collection name. For example further simplification would leave
3). Don't use capitals like this, switch to lowercase: /apiv2/entry/INFOHASH4). Use HTTP verbs in a standard, semantic way. For example this
This could all be collapsed into one form and would in turn be more familiar to developers.