Hacker News new | past | comments | ask | show | jobs | submit login
Perkeep – Open-source data modeling, storing, search, sharing and synchronizing (perkeep.org)
255 points by noncoml on Dec 15, 2017 | hide | past | favorite | 105 comments



Since people are confused about what this is I'll write a summary (from old memory so it's probably 80% correct)

It is a consumer-oriented storage system that is:

- Content addressable

- Indexed

- Tag-oriented (vs. hierarchical)

- Permissions, encryption, compression, sharing, etc.

- Spans storage across machines and clouds

- FUSE mountable

- Has CLI and Web interfaces built-in

The intent is to be a personal data dumpster that you can throw all of your files and other data (tweets, etc.) into for search and backup.

The website could be better organized to convey this information quickly.


Camlistore (renamed to Perkeep) author here.

It is true that the website needs some love & updated docs. We've been working on Camlistore for 8 years now (with a few drier spells) but our focus has never been marketing. If anything, we didn't want too many non-nerd users for a number of years because it wasn't ready for non-developer usage. That's starting to change.

We have pretty good docs for configuration and such, but we lack some concise high-level text about what the project is and why.

I'll prioritize that.


For everyone else reading this, here's more context. I once tried creating durable physical storage that spanned multiple external hard-disks with a single logical schema, but then discovered Camlistore and git-annex and decided to let more competent people build it.

The idea is that we should be able to own and manage our personal data - which runs into terabytes across one lifetime - without having to trust and/or pay the big cloud companies. So Camlistore from its earliest days had integrated photo gallery since multimedia is where most of the bytes are consumed.

The whole thing once had the label the IndieWeb movement (which we should revive), and Wired wrote about it here - https://www.wired.com/2013/08/indie-web/

Brad Fitzpatrick is also the creator of LiveJournal where he wrote the original version of Memcached in Perl. He also wrote OpenID, and then went on to work with Rob Pike and team on the Go Programming language. Camlistore was one of the earliest projects written in Go (before Hashicorp made it cool) and I imagine that had something to do with him getting into the language itself, but that's for Brad to clarify :)


Brad also wrote MogileFS (omg files!) at LiveJournal, a self-hosted precursor to cloud object stores like AWS’ S3.

https://code.google.com/archive/p/mogilefs/


It would be interesting to have a line or two about the differences and potential synergies with Upspin (https://upspin.io).


Brad, the Perkeep author, previously answered this briefly here: https://news.ycombinator.com/item?id=13700968


Thanks a lot for chiming in.

Sorry to say I'm still confused by what Camlistore does.

Would it be fair to say it's similar to Syncthing[1]?

[1] https://syncthing.net/


Some previous threads on Camlistore/Perkeep:

* [2014 Jun] https://news.ycombinator.com/item?id=7842629

* [2011 Jan] https://news.ycombinator.com/item?id=2156374


The thing is that nothing is good enough for keeping it for lifetime. A hardware might be broken, a supply might be discontinued and a software maintainer might disappear. You'll need to keep refreshing the data from one device to another, for the rest of your life. That said, I'm curious how easy this system can handle porting from one device or service to another, in varying formats and architectures. The only way to stay relevant is to constantly keep changing/adapting to new things.


A huge focus of the project is on human-readable schemas and formats. Even if all specs & source code of the project is lost, the data should still be recoverable from a curious archaeologist.

Between replicating between several companies as well as your own hardware & having friends & family mirror your stuff (encrypted or not), the ideas is that some copies will continue to exist.

Hardware failures are a given. Companies failing and friends & family dying is also a given. Natural disasters too. The only option seems to be trusting nothing and replicating all your data to lots of places, in future-friendly formats, and that's what Perkeep aims to do. And then a ton of tooling on top of that.


Interesting. I thought plaintext + .tar.gz or .zip format on either FAT or ext2 fs is the best bet for forward compatibility, and anything beyond that is too complex or obscure for future archaeologists. The obvious problem is the searchability, but I'd imagine in future that indexing a few TB of text/image will be a breeze.


Looks like there's been some nice progress since I last looked at Camlistore! The importers from cloud services like Twitter look really interesting.


Camlistore & Brad Fitzpatrick's original writings are what initially got me into decentralized web advocacy. Since then, I've moved on from this project, since it seems to move at a very slow place and the authors do not seem very interested in widespread user adoption.

With this name change, I'm slightly more interested again. We'll have to see in the coming months whether they become ready to displace actual large social media platforms or whether it remains a toy project.


How does this work?


That was my first question, too. After clicking through a few links and even opening up an intro presentation I was left unsatisfied and closed the tab. This project desperately needs an FAQ or overview video up-front.


The video demo on the front page is a great place to get an overview of what it's all about.

https://www.youtube.com/watch?v=8Dk2iVlc67M


Is there an overview that's less "an hour long" and more "three sentences"?



This 24 minute overview gives a good idea as to the fundamentals of their system. https://www.youtube.com/watch?v=yxSzQIwXM1k


It downloads and catalogs a bunch of crap onto a local hard drive.


... but hard drives don't last forever? And if that is all it's doing, why not just save the stuff to your hard drive in the first place.

I am so confused by what these people do.


I spent about ten minutes on the site, so hardly a domain expert. I’m still confused, too. But as best I can understand, you have the option of storage on S3, Azure, and the like. I assume that with a plug-in/driver, you could store anywhere you like.

But non-local storage does seem to be designed in, because there is text like “if there’s a daemon running rsync in the background, you’re doing it wrong” and “if your UI requires marking folders to be synched/not synched, it’s broken”, so there appears to be an assumption of putting your data elsewhere.


It's like your own personal google drive / dropbox / git repo

It's a content addressable storage system. There's plugins to import or export from various major services like foursquare, twitter, etc. and plugins let you store stuff in S3 or mongo or google cloud storage, etc.


I suppose technically running something like OwnCloud with a plugin to fetch your content (text/photos etc) from the various social networking APIs would look near identical from the outside?


Yeah, that part is confusing. If you can throw local drives at it plus a cloud service or two, and it'll just make backups on the cloud, that's actually pretty interesting. It's not clear if that's what's going on though.


As much as possible the design attempts to be agnostic about where the stuff is actually stored. Usually you want to be storing your stuff in multiple different places. Among others, Perkeep lets you choose one or more (or all) of the following:

Your own hard drive local to the Perkeep instance Your own hard drive local to another Perkeep instance you run Your friend's hard drive local to another Perkeep instance _they_ run with encryption to ensure privacy A removable hard drive you can periodically sync to and from A cloud service like S3, etc


It has sync and backup options, tools for non destructive changes and robust search, a web UI so it can be used like a dropbox. Just because it's not useful to you doesn't mean it's useful to no one. Oh and importers for cloud things, they show off automatic sync pulling down a twitter feed and some foursquares check-ins.

Not an effort to sway your opinion, just pointing out that it's not really as simple as just saving images to a hard disk.


That is a completely misleading / outright wrong description.

1. It can work with any data storage method you want afaict. S3, B2, GCS, local, etc. 2. It's primary goal is to store your data forever, regardless of hard drive failures / storage companies folding / whatever.


It's content addressable storage - as used by git and plan9's fossil/venti.

https://perkeep.org/doc/prior-art

https://en.wikipedia.org/wiki/Fossil_(file_system)



I've been watching Camlistore for a few years. I peek in on it every once in a while, long enough between that I usually can't remember the name. I like the look of it, but haven't been convinced to go from my decade old ZFS setup to Camlistore.

I feel like OwnCloud is more compelling, from a glance. Anyone use one or both and able to comment?


Camlistore author here.

If you only store files, sure, use ZFS.

Perkeep (Camlistore) doesn't write to a block device. It has storage backends for a filesystem (which can be ZFS) and any number of cloud object storage providers (S3, GCS, etc).

Perkeep's main value over a fancy POSIX filesystem is storing nameless things (tweets, other social media content + interactions, bookmarks) in common schemas, and permitting search over it all, and then having a variety of ways to browse it (CLI, FUSE, API, web UI, etc).

It's also good at sync to & from things any which way without merge conflicts.


How is this any better than just burning your data to a blu-ray, which lasts centuries when stored under proper conditions (theoretically, anyway) I need to give this a closer look.


This is such a classic hacker news comment


Always a good time for linking the Show HN for Dropbox: https://news.ycombinator.com/item?id=8863


The second comment is the important one, where they completely miss the point of ease of use. (The first comment is right, installing stuff on a corporate comptuer is tricky.)

So let's look at ease of use! You need to have a server and separately manage GPG keys. Looks like an archival blu-ray wins on that front. (And yes I see where it's a goal to make this easy to use for everyone, it's not there yet.)

So whether tabeth is wrong or right to think it's of limited use, they are not fundamentally missing the point.


Not having to worry if there will be any Blu-Ray readers available in a century.


Seriously. The only device I have which can read a CD-ROM is my car. The PS4 can read Blu-Ray and DVD but not CD-ROM.


I would actually be far more comfortable with storing my data for archival on CD-ROM or DVDs than BluRay, since the former standards have been publicly and freely documented[1][2] from the physical properties up to the logical bits and bytes, while I don't believe the same exists for the latter.

In other words, anyone can, with enough engineering resources, create a drive capable of reading those discs, which is more than can be said of more proprietary formats.

[1]http://www.ecma-international.org/publications/standards/Ecm...

[2]https://www.ecma-international.org/publications/standards/Ec...


If you really need to read a CD-ROM then getting your hands on a SATA DVD-Drive, which usually are able to read CD-ROM, shouldn't be that big of a problem. Without looking hard I'd probably come up with 3 spare ones in my basement alone.

Tho I don't think that many of the self-burned CD's from 2 decades ago are still any good, I know mine usually ain't.


> Tho I don't think that many of the self-burned CD's from 2 decades ago are still any good, I know mine usually ain't.

You'd be surprised. I went recently through several of mine and lo and behold they could all be read. I guess it depends a lot on your storage conditions.


Then I’d need to find a computer with a SATA interface... looking 10 years in the future it’d be even less easy.


I have friends who work in the IT section for an under-resourced cultural institute focused on the preservation and recording of disappearing cultures and ethnic groups, including the preservation of speech and utterances in languages that now have no living native speakers. They discovered recently, to their alarm, that the only surviving copies of some recordings were now on old 3½ and 5¼ inch floppies that had somehow been stored without accurate cataloguing. They are struggling to find equipment that can 1) read the discs, 2) interface with the disc drives, 3) tell them what is actually on each disc and what file formats are in use (they have good guesses, but no certainty) and 4) find software that will be compatible with those formats.

They have neither the skills nor budget to do in-house nor outsourced forensics for this. At this point they don't even know what exactly might be lost to humanity's knowledge, and the descendents of these people, forever.


Hi, Jason Scott of the Internet Archive. Let's talk. jscott@archive.org.


Please! Put them in touch wih the Internet Archive. The equipment you're looking for is called the Kryoflux - https://www.kryoflux.com/


So, I probably shouldn't have been surprised, considering it's HN, but the number of people who reached out to me on this was surprising and touching. I think I've responded to everyone directly, my apologies if I missed you.

I'm glad to report that, since last I'd spoken to them about the problem, they've figured out what to do and consider the matter solved. The material is recovered, recatalogued, and in good order.

Thanks again everyone for the concern, interest, and offers! I know where to come if something like this comes up again!



Floppy + SD card reader -- looks handy. But still has an IDE connector. Recent motherboards only have SATA. The review here says you can't access the floppy via USB, fwiw: https://www.newegg.com/Product/Product.aspx?Item=N82E1682019...

Maybe something like this: (USB floppy) https://www.amazon.com/External-Floppy-Portable-Windows-Requ...


Yes, with a perfect floppy disk. There aren't any perfect floppy disks left. You need a kryoflux or similar.


I have a device that interfaces SATA (and a few other formats) to USB. A few of my friends have been very happy to borrow it, so it's definitely a tool I'm going to hang to. With no moving parts, it should last a while.


I have one as well, and it even does IDE too. I do computer repair in my spare time, so I'm not exactly the typical user, but I've gotten a lot more use out of mine then I would have expected as well.

For those looking to buy one, I would personally recommend just going for a simple one that has an external AC adapter. I've found a lot of the ones that attempt to power the drive straight from USB ports can unexpectedly have the drive power-off due to the USB ports not being able to supply enough power, which is obviously a huge issue (Even my eSATA with a Y-cable has this issue with some drives, making it basically unusable). Powering it externally is a lot more reliable.


I store backups on BluRay and bought a USB BD drive for this reason. I’m not planning on keeping the backups in this format forever, if something better comes out in 10 or 20 years I’ll move to that. I only use the drive once or twice a year, so it should last that long, and there will still be adapters for USB type A then.

My biggest concern is getting the disks. I can walk into my local supermarket and buy a DVD-R or CD-R no issues, but BD-Rs (especially high capacity discs) are hard to find even now.


Out of curiosity, have you seen M-Disc?

https://en.wikipedia.org/wiki/M-DISC

It's supposed to provide a decent (not degrading over time) DVD and BluRay backup approach.

Years ago, with CD's I used to use Kodak "Gold" for archival things. M-Disc seems like the modern version.


There are USB disc readers, USB floppy readers, USB to RS-232 cables, you name it. No need for SATA. They usually aren't very expensive either.


In a century I don’t even think we’d have CD/Blue-Ray. By then most of us would be dead already, so why worry?


I’m sure a graph of Time vs Value for data would have a significant dip shortly after creation, but on the scale of centuries it only goes up. (just look at the Dead Sea Scrolls).


>> Not having to worry if there will be any Blu-Ray readers available in a century.

Century? Startup sites like the one above last on average 6 months, that is, until they find out that their $6/mo DigitalOcean droplet suddenly costs... $10/mo! Or $100/mo or whatever and then they find out they cannot fund their $100/mo droplet and call it quits.

So... if you need the data to be around for 100 years, maybe not give it to the random startup.


It's an open-source project that has been around for nearly a decade, not some new startup.


This project started development 8 years ago.

https://github.com/camlistore/camlistore


M-DISC is even better. Burnable discs use an organic dye which oxidizes over time. M-DISC uses a "glassy carbon" layer that is inert to oxidation.

They adhere to DVD-R, BD-R, and BD-XL standards so it's readable in standard disc drives. You need a special drive to burn them, however (requires a high-power laser).


> Burnable discs use an organic dye which oxidizes over time.

This is only true of DVDs and a rare variant of Blu-Ray called LTH. Even cheap shitty Blu-Rays from Chinese manufacturers use inorganic dyes these days.

Also, the French Archives did a test of a variety of DVDs for longevity in adverse conditions and found that M-DISC didn't last significantly longer than competitors, even those with inorganic dyes: https://documents.lne.fr/publications/guides-documents-techn...

The US DoD also did a similar test under different conditions and found it performed much better than the competition though: http://www.esystor.com/images/China_Lake_Full_Report.pdf

I suspect the difference between the French and US tests might be the French using a longer test duration and the Americans using light. The French went up to 1000h while the Americans only went to 24 as far as I can tell.

And unlike DVDs, I haven't seen any studies of longevity for M-DISC Blu-Rays.


It's different (better?) in that it doesn't rely on you remembering to actually burn that data, then store it safely. It comes with an app you can run on your phone to upload all your photos immediately, for instance. It has importers to archive all your tweets automatically, for example. It allows you to outsource the task of "Keep this blu-ray safe" to a cloud provider (or a friend) while encrypting your data to keep it private.


I've been keeping an eye on this project for years, because it seems well-designed, and the authors are very capable developers.

The biggest problem I found was getting documentation on replication. Having two+ servers mirror-each other, across the internet, seems like a good idea given that otherwise you have a single point of failure as you import all your media/files.


I’d be interested in a system for converting existing stuff from, for example, the Firefox “ScrapBook” plugin, to this format. (The ScrapBook plugin is not compatible with Firefox 57’s plugin API, so anyone who upgrades to Firefox 57 immediately loses all their saved ScrapBook pages.)


I have no idea how compatible this is, but someone is working on a new version. https://addons.mozilla.org/en-US/firefox/addon/scrapbookq/


The perfect tool for a digital hoarder like myself. Will follow this with attention.


So, its just a document server that can be run over multiple computers? I was expecting something peer to peer. If I understand correctly, you can think of this as a dropbox that you can self host?


What is the target audience of this? What are the intended use cases?

Is this supposed to be used directly by users or as an API for a user-facing application? How is this different from a document DB like MongoDB?


Long time follower of the project here... So far it's been aimed at geeks who want to archive their content from the cloud, eg tweets, but it also stores files. Because of the way it is designed I've always thought there is a compelling use case for its use as a file and object store for organizations where auditing of data records is expected and sharing of data is a requirement.


So is this ready for prime time yet? I used to follow camlistore, and it was still a little rough even for CLI nerds.


So I just downloaded it and played around and as far as I can tell there is no way to delete files. Or, more specifically there is a way but it's not implemented or otherwise accessible as far as I can figure from the rather sparse documentation.

If someone would like to explain to me how (if?) the garbage collection works I'd appreciate it, because I like the concept and kinda want to use this, but deleting stuff is a rather important feature for me. All I could find searching was a post by the devs saying it was already mostly implemented but not finished and not a priority...

https://github.com/camlistore/camlistore/issues/792

Like, I understand that this is a spare time project (I think) but not considering deleting/pruning files to be an important feature is really confusing to me. In its current state, if I accidentally upload the wrong file, am I now stuck with it forever?

Edit: ok I figured out how to at least delete things in the UI (clicking the check mark opens a side menu apparently, `camput delete` doesn't seem to do anything), but as far as I can tell it doesn't actually delete them from the database without running a garbage collect, which isn't implemented so it just hangs around in purgatory.


Is this possibly a Dropbox replacement ? do I have to host the files on my own server ?


Alternatively: "Hard-drives let you permanently keep your stuff, for life"


Hard drives are an especially bad choice for lifetime reasons, and SSDs don't solve the problem either :P


Tape is usually the preferred magnetic media for long-term storage.


I don't agree — that's why things like redundancy are commonplace. :D


“You're weak on logic, that's the trouble with you. You're like the guy in the story who was caught in a sudden shower and Who ran to a grove of trees and got under one. He wasn't worried, you see, because he figured when one tree got wet through, he would just get under another one."

http://multivax.com/last_question.html


Huh, I don't think I've seen a reference to that story in years, but just emailed it to a coworker a couple hours ago.


One of my all time favorites. Though it took me a while to remember the source of the quote. I thought it had been used in the context of global warming so google didn’t turn up much. Then I remembered it’s actually from a story about universal cooling.


That's not a hard drive. That's a system built on top of hard drives.

And so is perkeep.


That's incorrect. RAID is a system build on top of hard drives for redundancy. Redundancy (for this use of the word) is simply duplication across multiple hard-drives, which doesn't require a system at all.


Really, this is a very simplistic view of long term storage.

A RAID is not magically more reliable than a single drive, it needs a bunch of infrastructure and it needs to be duplicated to some other location far away enough to ensure that a single catastrophe such as a fire does not destroy your entire raid.

You are missing the wood for the trees: hard drives and raid devices are storage mechanisms that fall far short of the boundary conditions set to keep something permanently, at worst you will store your data for a couple of hour like that and in ideal conditions maybe for a couple of years, but on a scale of decades or centuries they are useless as a complete solution, though they could be part of such a solution.


I wonder how terrifying it would be to get a notification every time a single underlying storage device on something like Dropbox or S3 failed. We all know there is some kind of redundant system but how often does your data get moved around because of failures?


At that scale it's probably like a slow rain.

You could be made to feel better if the alert only came when it concerned your data. But even then, and going by the NAS sitting under my desk you could be months without any activity and then suddenly two drives fail in two weeks. It's a nice little random data generator.


Ah yeah I was unclear there, I meant notifications of the devices under your own data.

e: Also, is it so surprising that your drives failed around the same time? It’s likely they were purchased together!


Past a certain point it would be just a part of the job. As long as you have hot spares, you'd just go around replacing failed drives every day or whenever.


Just saying to yourself "I save all my stuff on two drives" is a system. It's just kind of a crappy one that's really prone to failure.


redundancy is not helpful if your system consistently fails after some regular interval



It's not clear how much better than they are than regular media since there haven't been many tests. There are two that I'm aware of, one by the French Archives (who've done this a few times it so happens) and one by the US DoD.

The French found that M-DISC didn't perform much better than regular DVDs and that a weird kind of glass DVD beat everything else hands down.

The Americans found no errors at all in their tests of M-DISC while all other disks encountered them.

I suspect the important differences were:

- The Americans' tested the discs after light exposure, the French did not. It may be that the light caused the regular DVDs to fail but not the M-DISC.

- The French tests were far longer (1000h) than the Americans' (24h). It may be that M-DISC can't survive the adverse conditions past a certain point that the Americans didn't reach.

Also as far as I'm aware, there are no tests of the Blu-Ray variant of M-DISC.

Personally, given the cost of M-DISC, I'd buy a few cheap terrible Blu-Rays instead and just make sure they're not exposed to too much light.

French test: https://documents.lne.fr/publications/guides-documents-techn...

American: http://www.esystor.com/images/China_Lake_Full_Report.pdf


> While the exact properties of M-DISC are a trade secret

If long-term accessibility is the goal, not off to a good start...


They can be read with any standard DVD or Blu-ray drive. Not that anyone has one of those any more.


Spent the whole morning burning them, as it so happens.

Trying to get data in to an air-gapped environment is a true PITA.


Question if anybody gets to this: I'm taking a break from work and computers for a year. How would you guys suggest I store my kbdx data securely In a failsafe manner without worrying about forgetting passwords or losing paper chits or USB keys?

Edit: after seeing some good suggestions about physical storage, I've decided to increase the difficulty of the question, hard mode- How would you do this without physical stuff? (more, new answers about physical welcome too)


For something on the timescale of a year I would just keep the system that you already have up and running. It it were much longer than that I'd go with a bank vault that contains the access keys and something like tarsnap and yet another backup with another cloud provider.


I'm assuming all my electronics fries, papers burn and memory goes away. (to be safe)

Bank vault might be a good idea (assuming they id me fine)


> Edit: after seeing some good suggestions about physical storage, I've decided to increase the difficulty of the question, hard mode- How would you do this without physical stuff? (more, new answers about physical welcome too)

Store one copy in a gmail account, and another on imgur.

> assuming [...] memory goes away. (to be safe)

And tattoo the site+username+pass on your thigh.


Good thinking!


I wonder if a system like this would be good for your general problem:

Generate a random seed sentence of so many words. From the secret seed + site domain name generate a password

Store piece of paper with:

Algorithm (could be public in github too) Seed word Site names


Wouldn't biometrics be good use case here since OP doesn't want to remember it?


Yes if you can get a biometric that you can keep secret, and can easily access.

Finger prints aren't very good (they end up everywhere!). Retina scan? Not very cheap I'm guessing. Face? Definitely not secret.


Yes!! Is there a solution like that?


For a year? a burned CD in a safe deposit box. Also a USB key there for convenience. Basically paying for physical security of the devices/data.


I was gonna say, this sounds like Camlistore.


Because it is! (edit: oh I see it's in the header)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: