xz in multithreaded mode supports random access too, at least theoretically. But there's no reasonable way with xz to actually find the file in a tarball you want to access, it's that bit that pixz provides.
Another nice thing about pixz is it does parallel decompression, as well as compression.
I was thinking about that "no reasonable way" comment. When you uncompress the first block, you will find the first tar header. From that you can know the uncompressed offset of the next tar header. If the compressed stream does support random access, you should be able to uncompress a block (assuming uncompressed block size was a multitple of 512 bytes) to get to the next tar header. You can repeat this until you get to the file you are looking for.
With large files, this approach would be of huge value. If the files tend to be no larger than block_size - 512, there will be no speedup.
Of course, this would need to be implemented directly in tar, not by piping the output of a decompression command through tar.
I like almost everything about LXD, except the attitude towards networking. The project has rejected simple solutions such as port forwarding, saying that managing the network shouldn't be LXD's job. Instead, they'd like the user to manually configure their own bridges or routes or iptables chains.
I can kinda understand their point of view, there's no simple solution that will please everyone. But most developers or IT folks aren't networking experts, and LXD won't be an intuitive tool for them without a simpler mode of operation.
I have just finished a small LXC deployment, and the networking did take a little bit of time to figure out what direction to go.
I ended up choosing to use bridging to connect all the physical network adapters to the ones in the containers. This is nice because I set up each container with its own IP address, which travels with it when the container is moved to a new host.
> [...] most developers or IT folks aren't networking experts, and LXD won't be an intuitive tool for them without a simpler mode of operation.
Really, no expertise is required. Just basic understanding how the heck the
network works. If somebody can't grasp what bridge interface is or how NAT
operates, he apparently doesn't have the qualifications for writing software.
Unfortunately it took quite a long time, the Mozilla process is quite confusing to a newcomer, even one with a lot of open source project experience. I definitely second the recommendation of finding a mentor.
I don't think I would have been able to get my two patches into Mozilla if it wasn't for the mentored bugs program. It's incredibly useful and I wish more projects would implement it. ahemopenstackcough.
openstack is a "commercial" open source. Several (if not most) developers who contribute are paid by "companies" - so typically new "contributors" find mentors within their "company"
lzma parallelizes very well! My implementation of parallel xz ( https://github.com/vasi/pixz ) works quite well. There are others as well, including pxz and the alpha version of the standard xz. All these tools produce compressed files that conform to the xz format and can be decompressed by any xz utility.
Your first point is interesting, but I'm no so sure about the second one:
Given certain search strings it should be possible for amazon to detect that this [person] engages in piracy.
Amazon does not see the search as coming from an individual. Rather, the Ubuntu servers act as an intermediary. All Amazon can see is "some unidentifiable Ubuntu user is searching for this". That's hardly something they could report to any authorities.
>All Amazon can see is "some unidentifiable Ubuntu user is searching for this". That's hardly something they could report to any authorities.
It's surprisingly easy to take anonymous search data and figure out who it is. You might remember the mess that happened when AOL released anonymized search data (hint:peoples identities were compromised). http://en.wikipedia.org/wiki/AOL_search_data_leak
Consider the simple example of files that are named after the person doing the search.
Anonymization of queries is really really hard and I see no system academic or otherwise that would protect users from being identified.
For instance if someone were to accidentally click on a link to an amazon product and they had an amazon account it would immediately link the person and the query. Someone downloads a movie, searches for it to find it and then accidentally mistakes the amazon link for the pirated movie.
The #1 reason I want Gmail backup is to be able to migrate in case something goes wrong with Gmail. Does Gmvault make this possible? I don't see any documentation about exporting from Gmvault to another IMAP server, or to a common format like Maildir or mbox so that I could run my own server.
If the gmvault on-disc format is well defined why not produce a tool to convert that to maildir or whatever? Gmvault does one thing well (allegedly, I've not used it), "export to some other format" just seems like feature creep especially when conversion can be done offline.
Gmvault stores email content in individual files as text files (EML file) so it should be pretty easy to add some export functionalities (it might be a second tool). I will add that in the road map for v2.0.
Currently Gmvault gives you the ability to save your emails on disk and restore them on any Gmail account with all the features. For example labels will be restored as identical. Many email backup tools are very generic and you will loose quite a lot of information when restoring your emails in a blend IMAP Server. Now all emails are stored in individual EML files with a unique filename. It is quite open and it should be pretty easy to create a Maildir structure. I will add that in the roadmap for v2.0. I am not convinced by mbox because it is a unique file with all the emails concatenated in it. I will see what to do with it.
I also realize it might not make sense to use Maildir as the native DB for Gmvault, since you don't want to store multiple copies of each email like OfflineIMAP does. A 'gmvault export' option would be sufficient for my purposes.
I use OfflineIMAP to back up Gmail and Google Apps mail. I have it set up to run every evening from cron, backing up a Gmail account as well as a couple Apps accounts. It uses Maildir format.
Yeah, I'm currently using OfflineIMAP until I find something better. But it's really not a perfect solution: it downloads emails multiple times (once per label), it doesn't restore to Gmail terribly well, it's pretty slow, and it crashes sometimes. If Gmvault adds some sort of export facility, I'd switch.
As said above OfflineIMAP will create a blend copy of your emails and you will loose quite a lot of very interesting features when copying them to a standard IMAP server.
That said I understand the need and will think on an approach to allow users to leave Gmail for another IMAP server in v2.0
Another point is that Gmvault is meant to be easy to use. OfflineIMAP is a very good tool but it is meant for advanced users like us as Gmvault is meant for users with very little computer knowledge.
Gmvault v2.0 will go further as it will have a GUI (while still having a CLI mode) to allow my Granma to backup her emails :-).
There's a rough open source clone of Filelight for Macs[1]. Disclaimer: I'm the original author, but no longer involved.
A commercial app in the same genre, DaisyDisk[2], is much more flashy and polished. You can also get the KDE/Mac version filelight from MacPorts, as part of package 'kdeutils4'.
Remember that compressors, including xz, support multiple compression levels. The default level for xz is 6, which is perhaps too far on the small-but-slow side. Levels 2 and lower tend to give similar compression levels to bzip2, and are considerably faster.
Also, note that decompressing bzip2 is very slow, xz usually beats it by a factor of two or more.
I agree that the default level (6) for xz probably errs too much on favoring file size over speed. My tests with compression levels 1-2 do indeed show modestly improved size and speed performance relative to single-threaded bzip2.
The fact remains, however, that I can't seem to find a simple way to install a parallelized version of xz. Perhaps I'll post an issue in the Github issue tracker for pixz and see if we can't resolve that. :)
Another nice thing about pixz is it does parallel decompression, as well as compression.
(Disclaimer: I'm the original author of pixz.)