One area that may be impacted by this strategy is bulk-downloading your photos. My wife set her phone to automatically backup all her photos to flickr. Thinking she could get them back at anytime, she started deleting them off her phone. After three months, I discovered she was doing this and begged her to also backup her files to an external hard drive.
Every year we sync all our family photos to have redundant backups. When she went to get the three months of backups from flickr she got "download error" after error. She sent me this link and hypothesized that the bulk-download feature is no longer working because of the need to now first decompress the files before transmitting them.
Luckily, she was able to get the 25 gigs of family photos down using a third-party application, but it's another reminder to never wholly trust the "cloud."
Really interesting that (if I'm reading the article right) you can take an already-compresed JPEG, recompress it losslessly using another technique to get better compression than the original compressed JPEG, and then decompress it to the original JPEG again.
The concept makes sense but I'd never thought of that before.
JPEGs use Huffman coding to convert a series of bytes (DCT coefficients) into a series of bits. Huffman coding is optimal if each member of the input distribution has a 1/2^n probability (e.g., if 1/2 of the input is 0, you could represent that with just one zero bit).
Arithmetic coding allows fractional probability distributions to be represented. It's usually at least 10% more efficient than Huffman coding.
Once you look at the JPEG format it's kinda obvious that this should be possible: JPEG essentially compresses blocks independently, so intuitively each block boundary is stored twice.
What they're saying is that you can take a JPEG compressed image, decompress to raw pixels, and then recompress with JPEG more efficiently (if you're careful, no specifics on how this is done), and you save space.
That's why they mention that they're doing it very carefully, because you've got to make sure that when you decompress the new optimized image that it pixel for pixel matches the original decompressed image.
As far as I can tell, 'raw pixels' is not quite accurate. Both PackJPG and Lepton seem to decompress as far as frequency-domain coefficients, but no further. That means they don't repeat the lossy stages in JPEG encoding - the transformation of pixel data to the frequency domain, and the discarding of high frequency infomation. It looks like PackJPG and Lepton do some interesting tricks with the coefficients to essentially make them more amenable to the following lossless compression. Both PackJPG and Lepton have their own file format, so they're definitely not outputting standard JPEG images.
That's not how lossless compression of JPEGs work.
Besides removing information from the file that doesn't affect the rendered image (like EXIF data), lossless recompressors typically replace the huffman coding of DCT coefficients with a more efficient arithmetic coder. So you don't start over from raw pixels, but you replace the type of compression used with a more modern and efficient algorithm. That means ordinary software can't read the JPEG (since you've essentially created a new format) but you can just decompress into standard JPEG whenever someone wants to look at the image.
> Besides removing information from the file that doesn't affect the rendered image
You can do this if the goal is pixel perfect accuracy, but Flickr can’t do this since they have “a long-standing commitment to keeping uploaded images byte-for-byte intact”…
I bet a lot of those ICC color profiles are the same across many images though... One you could strip the metadata and keep it in a separate deduplicated database, and then reassemble when the user accesses the file.
Lepton (one of the examples they mentioned) losslessly compresses a JPEG to Lepton format, then losslessly decompresses it back to JPEG. The pixels are never decompressed and in fact a JPEG decompressed from Lepton is bit-exact the same as the original. I've tested and verified this on several million images.
It could be either. For example, I believe zpaq or 7z or one of the super-cutting edge compression tools actually does compress JPGs losslessly and provide you bit-for-bit the same binary file you put in (but saving 10-20% inside the archive).
Some large organisations store as many as 15 copies of each bit of user data...
When you have a database which keeps redundant master slave replicas and mutation logs, a backup system which keeps many previous backups on-site and offshore on tapes, a storage system that has many RAID mirrors per host, a distributed filesystem which stores replicated chunks of data across an entire datacenter to handle host failure, and you keep copies of the entire lot in multiple datacenters for accessibility during planned downtime and natural disasters.
Another way to reach this goal is to make sure features of the service remain unfriendly to users. Case in point, when providing a search interface, only return as many results as you want to, not as many as you have. And don't bother coalescing spammy results all from the same account into one expandable item; instead let them flood the results because they used keyword tag spam, and then cut off the search results after a few pages.
Make it hard to navigate by hiding everything behind hashes, to prevent fair use downloads. Keep tags in beta for 15+ years.
Of course, when usage goes down, that helps with the problem quite a bit. A poor experience, even for viewing content, lessens engagement and leads to lower usage and fewer uploads.
Sadly, I'm afraid a much more extreme data storage reduction approach awaits faithful users of Flickr.
When Yahoo! bought a large photo blogging site in Taiwan, it simply shut it down with about six months notice, deleting everything as it did.
I'll tell you what Flickr isn't spending money or time on: Support.
I have a flickr pro account from 6 or so years ago with hundreds of photos on it. I've tried over 10 times over a year to contact their support and get turned over to Tech Support in India that won't even read into your case!
Of course, the original email address I used for my flickr was deleted, so none of the avenues on Yahoo Help (which is where they redirect you) work. Not to mention the password may be reset after all the leaks Yahoo had.
So when I see these people on @FlickrHelp on Twitter (No replies) and Flickr having office parties, it really makes me feel quite disappointed! Yeah sure, real human touch! Former paying customer who just wants to login his account with tons of priceless photos. And they have a thread of like thousands of people who can't get into their accounts [1]
At least the employees are having fun with data compression. Sad I can't talk to an actual human to get access to my account!
I was also locked out of my Yahoo / Flickr account last year (all I use my Yahoo account for is Flickr, so I consider them one and the same). After 10 years of Flickr Pro subscription without a hitch.
Agreed, their support is terrible, and they don't seem to distinguish between free or paying users (and in terms of account access issues, at least, there isn't any dedicated Flickr support, just horrible Yahoo support).
I had access to the email address registered with my account, so at least that wasn't a drama. But still, took about 3 months of assertively-worded support emails before they got me back in and I could upload Flickr photos again.
I think Flickr is a great service, and I've been faithful to it for many years. But that episode jaded my opinion significantly. And with Yahoo now going under, I'm not expecting much re: the future of Flickr. I'm considering moving my primary online photo storage to S3, Github Pages, or some other alternative.
OH OH, have you looked in to any such tools? I've been considering writing something jekyll-esque for photo galleries (static, hostable on s3/wherever), or a plugin for jekyll or other static site generator that can do the same (i.e: cloud-backed images, host the static site wherever you want)... have you seen anything like this?
The other thing they don't spend money on is making their app work. For me it often fails to load images and has to be closed and restarted and it maybe displays images. I mean, c'mon, that's your core task, don't fail at this.
Also I have found an issue where they replaced the Flash based profile avatar chooser by HTML5 but forgot to test on computers without Flash installed so it was still not shown. Apparently nobody noticed for months. To their credit, I spoke to one of their engineers and it was solved very quickly.
Fortunately I can log in if I remember just correctly what email I used for my Yahoo! account. my browser fortunately remembers and the credentials in lightroom still allow my to put stuff up there.
If your browser has the password saved then you can extract it now and save it to a more secure place, like a password manager. Presumably you could then use it to update your Yahoo! account email to a current, working address.
For a less ad hoc approach to reducing storage costs, I suggest looking into the ZFS filesystem. Compression is completely transparent in ZFS. Once you enable compression in ZFS, all of your files will automatically be compressed when written, and decompressed when read.
I am currently managing a Postgres cluster with a petabyte of data in it. We found ZFS to be a great way to reduce overall storage costs. We just switched our machines to machines running ZFS, and we were suddenly using 1/3rd the amount of disk space. Although it took us a while to learn all of the gotchas of ZFS, it wound up saving us a huge amount of $$$.
(As I understand it, ZFS would not have helped in Flickr's case. Since JPEGs are already compressed, ZFS would not have provided any benefit. Flickr was able to save storage by using an ad hoc compression algorithm.)
When we first switched to ZFS, we had some ingestion issues. It turned out the problems were caused by poor architecture decisions we made and was merely exaggerated by ZFS. So yes, there was a noticeable increase in latency, but ZFS was only a tiny piece of it.
How much RAM do you need for a Petabyte ZFS cluster?
I investigated ZFS for my home server and it was recommended to have 1Gb RAM for every Tb, and much more if you enabled deduplication/compression (I forget which).
The 1GB ram per TB is wrong. More memory definitely helps, especially if you add an L2ARC cache which can use a lot of memory just for mapping. But in most cases you are probably fine with just 1-2GB of ram.
I would never use deduplication on ZFS, it is very slow even if you have enough RAM. And in most cases, savings are less than 10%.
> "There are several accepted resize algorithms, but to retain the Flickr “look”, we implemented the same Lanczos resize and kernel sharpening algorithms that we’ve used for years in CUDA."
Ordinary image sharpening tends to just make existing edges look sharper; dof blurriness obliterates the edges. Strong sharpening on a strongly DoF-blurred image might actually make the visual contrast between in-focus and out-of-focus areas stronger.
Can B2 reliably handle something the size of Flickr, with its consumer access demands (vs an enterprise/business focused service with far lower requests and bandwidth usage), as opposed to knowing for certain that AWS can? That has to be a rather dramatic consideration for a service the scale of Flickr. It's a near certainty that AWS isn't going away for the next decade at least; the same cannot be said with even remotely the same confidence for Backblaze. It's the difference between 99% and 90% confidence that your provider is going to still be around and offering a high quality service ten years out.
Is there a name for an exploit where a malicious client requests rarely-accessed contents that has been tucked/compressed away in order to overwork their server?
Most large web services today would die a horrible death if someone pulled off an effective cache pollution attack.
For example, if a piece of client side malware, for every Vimeo video a user views, loads in the background a random rarely viewed video, it would probably lead to days of downtime.
The file sizes of the different thumbnail sizes. Baseline shows all the different sizes they used before, and current thumbnails shows the space saved by switching to just two thumbnail sizes.
The way I read it is that they eliminated intermediate thumbnail sizes. So the graph shows the before and after of storage space used by generated thumbnails.
>On a very high-traffic day, Flickr users upload as many as twenty-five million photos. These photos require an average of 3.25 megabytes of storage each, totalling over 80 terabytes of data.
>increasing camera resolution, burst mode and the addition of short animations (Live Photos) have increased bytes-per-image rapidly enough
>Users only rarely delete or change images once uploaded.
I'm very curious, how much of all this tr.. sorry, sweet memories are never ever viewed after, say one week from upload.
Regarding the lossless JPG compression change: the review strategy... was that done manually by eye or automatically using some sort of image comparison library?
I have no personal experience, but think an LTO tape would last longer than a HDD for long term storage. LTO tapes are rated to 30 years of storage and I doubt anyone would be using an HDD to store data that long.
With drives, you are typically checking every few hours that data is readable. If it becomes unavailable, you create a new replica right away.
With tapes, even though they might last longer, you typically wouldn't scan and check the data as regularly. That means, if it does go unavailable, there is a larger window of time for other replicas to fail in before re-replication completes.
Indeed. They spent ~$1m of engineering talent and took on some significant technical debt in order to save recurring annual datacenter cost.
I'd say if they reduced their datacenter annual cost by at least $5m it was money well spent. From the sound of it they actually saved a lot more than that. And by increasing their overall storage efficiency it's a feature that pays dividends even when they do ultimately add more storage in the future.
I'm sort of surprised to see you're getting downvoted by all the greedy wanna-be zillionaire founders. If they had any brains, they'd be the ones developing this compression code and getting the million-dollar bonuses.
Hah yeah I was surprised by the downvotes. My point was that if you were in finance, these sorts of numbers generate massive bonus'...but not in other fields sadly.
Erasure coding isn't something new, unknown or unique to Minio, it's built into Ceph, GlusterFS and Openstack Swift, the largest distributed object stores.
I want to see a distributed erasure coding system. For example, the data is distributed across 10 datacenters, and every object is available in one datacenter without requiring slow costly network roundtrips, but if that datacenter becomes unavailable it can be recovered from a combination of the other 9.
I'm not aware of any company doing that (I know backblaze does that but at the pod level, not at the datacenter level... because they only have one)
You may be interested in Tahoe-LAFS though (https://tahoe-lafs.org/trac/tahoe-lafs). It has many good things in it, one of them is that all files get erasure-encoded so that k nodes out of n are needed to restore the file. When you set a node to be a storage provider (such as S3, GCS, ...), then you effectively have erasure encoding over providers: If S3 is down, you can still retrieve your data from the rest of the providers.
Minio is an interesting new project but they seem to be favouring "features" over stability. For now it would be better to go with a proven system like WD Active Archive or Scality.
TL;DR; the longtail (rarely accessed images) becomes really expensive. So to save storage (and thus cost), both highly compress and dynamically generate rarely accessed images.
I always wondered if YouTube hates people who put lots of unlisted family videos for friends that get say 10 or so views. It's costing in them storage but with virtually no ad revenue.
disks have hit 10tb per 3.5" disk, which means you could have petabytes in each rack. I think we're at the point now where the materials are almost "free" compared to the amount of information you can fit on them.
Plus, they mine the shit out of that data. Even if they're not earning ad revenue on it, they're tracking location, usage statistics, I wouldn't be surprised if their user agreement includes the ability to have AI "watch" the video and try to mine data about what's happening in the video the same way they mine your email to build smarter and more effective consumer profiles about you. So even with nobody watching your video, I wouldn't put it past the "G-men" to find a way to eke some profit off it.
Our automated systems analyze your content (including emails) to provide you personally relevant product features, such as customized search results, tailored advertising, and spam and malware detection. This analysis occurs as the content is sent, received, and when it is stored. [1]
Not saying you're wrong, but the first sentence of the article is "One of the largest cost drivers in running a service like Flickr is storage." Sure doesn't sound "free" or even "almost free" to me.
I'd say 4K videos are already here. My Lumia 950, released late 2015, records in 4K by default, and the iPhone 6S, released around the same time, also supports it.
They aren't the norm, but I imagine YouTube sees a fair number of 4K uploads.
I do agree, although I think the 4k will take longer than it may seem. For instance 99% videos uploaded to YouTube are smaller than 4k.
And the trend is not improving much (or at all). By comparing video uploads to YouTube from last two years, the trend is roughly the same: only 1.1% of videos are in 4k [620M videos, 7M in 4k for 2016; 322M videos, 3.6M in 4k].
> So in the end, storage is growing a lot faster than the bitrate of videos.
Is that true? I've been pricing hard drives recently, and they seem to have stagnated for the last few years, staying at about the same price, with the best value per Gb at around 4Tb.
I thought h265 took about half the space of h264 for the same quality, but a 4k screen is actually 4x the pixels of 1080p...
So I'm guessing 4k h265 would be about double the bitrate of 1080p h264.
More importantly, don't just look at a year or two. 1080 availability is significantly more than a decade old. In fact, you should probably be comparing against 1080 mpeg-2. That gets you nearly the same bitrate between then and now. In that time hard drives have gone from ~1GB per dollar to 35GB per dollar.
Technically you could almost do that with spinning disks. I've seen chassis that do 60 3.5" drives in 4U.
With 10TB drives that could be 6PB or more in a single rack.
For original photo storage which is primarily write once w/ rare if ever deletes and infrequent reads, you could potentially get by with PMR drives, further reducing the cost.
I guess my general assumption is they don't mind because these same people are then probably popping around through other videos, which do have ads on them.
Agreed. (Anecdote but) the number of people I know in my personal life who upload to YouTube could be counted on one hand, and all of them are heavy YT consumers as well.
Well, there is not that many of them. Only around 2% uploaded videos to YouTube are private (roughly 14M in the last year). Another 8% are public videos with under 1k views (50M in the last year).
I do believe that YouTube is not hyper excited about these videos, but they do have some optimization techniques in place. For instance they don't automatically generate all possible video profiles up until they are not requested. By default they produce around 6, while very popular videos may have up to 40.
Perhaps in the future Google could make YouTube videos count towards your Drive quota? Though this would be problematic for accounts that already have tons of footage.
There are plenty of other ways it creates value: user retention, analytics/data mining, and more I'm sure. In my experience, people who upload family videos are more likely to not have AdBlock installed, so maybe it balances out for them.
Good reason to not think of Flickr as any sort of photo backup mechanism? I always presumed when you accessed a photo in "original" dimensions, it was the original file uploaded (even if renamed). There's probably some fine print TOS that clearly states otherwise...
While the wording may not have been the best, the "Indian call centre" is just a place where people have been turned into drones (these exist in the US too, but the fact that it's been out-sourced to India reflects more on the "cost-cutting" mentality around support that the company itself has). They have a script to follow and not much agency to do anything (and possible penalties for escalating issues rather than "resolving" them at the first tier of support).
To be fair, most issues that call centers deal with are rather simplistic, and don't really justify employing highly qualified people there. Apart from those not being interested in that job anyway.
Of course I'm mad too when I have to deal with a clueless call center rep, especially when they don't move you up to higher tier support even if they have no idea how to handle the issue. But as long as they do that, I'm fine with the concept.
The problem is that every issue that is escalated costs more money to resolve, so there is pressure from managers to make sure that the number of escalated calls is minimal. The reality is that this is a horrible metric to go by. The issue that a caller wants resolved it what dictates the level of support that they need, and the company really has no control over the volume of calls that require more than tier 1 support (other than building a decent product, I guess).
How do you know they are in India? Accent? Asking because India offshore support peeps used by Dell, Walmart etc are all give white christian names - like Mary, John, Adam etc - and also undergo 3 months of rigorous 'American Accent' training. I know because 2 of my Indian cousins ( I am American-Indian born and raised here) work at such call centers in Mumboi and Chenna respectively.
So it's quite difficult to discern that they are Indian cos the Companies that hire them spend millions of $ trying to disguise their voice and tone to make them sound like they are local / American.
I am not a native English speaker, and it's pretty easy to notice the accent.
American/English sounding names they introduce themselves are all the more jarring because of that. I've got no problem talking to Prakash or Dharmesh (or whoever from wherever), but comically-obviously-not-Steve instantly gets me annoyed (it doesn't help that so far, 100% of these occurences have been spam calls).
>Asking because India offshore support peeps used by Dell, Walmart etc are all give white christian names - like Mary, John, Adam etc - and also undergo 3 months of rigorous 'American Accent' training.
Yeah, like that would really help with 20+ years of speaking an accent.
I've been to India twice, and I don't have anything at all against the people, but their English accent just drives me completely insane. And I'm not even a native speaker. For some reason it just drives me mad.
3 months of training is obviously not close to sufficient to mask that accent...
As an Indian, I find US/UK accents difficult to understand, it doesn't drive me crazy though. The thing is, you don't get to hear Indian English accent much, but we do, via movies and tv shows etc.
I've never found accent to be a huge issue in general with first line technical support. It's more like 95% of the time (OK, I've done something dumb now and then) they're going to take me through scripts that are increasingly non-responsive and orthogonal to my problem before promising to call back--which they never do.
If I can get to someone in a US support center, I've just found that they tend to be better enabled to actually address my problem--even if they have a southern accent :-)
I'm not too annoyed to quickly jump through their useless hoops to get to the next step if I have to, but I find it super hard to cope with the accent.
I'm in Australia and I've noticed more of them being Filipino which I find just slightly less difficult to understand than Indian, but still difficult and irritating.
But I agree, it's such a relief if you win the lottery and get a native English speaker, things always seem to move along so much quicker.
Do you really disagree with brokenmachine's premise that poor English makes communication impossible?
Or is it, rather, that Sam and YC {and venture capital in general}, by proxy through you, are annoyed at decreasing access to low-wage workers from a developing country notorious for unintelligible accents?
Pretty sure I have no low-wage worker agenda, but perhaps you see the dark corners of my soul.
From a moderation point of view, it's not a question of disagreeing. We can agree with someone and still ask them to stop. In fact that's common. In this case it was about asking people not to go on offtopically about peeves, because that's not interesting, and also about HN being an international community. Respect across national divides is important here.
All I said was that I had trouble understanding heavily accented English. I never said I don't like the accent. It's irrelevant if I like it or not. It's that I can't understand. Being able to understand is IMO a basic requirement for effective telephone support.
I have no problem with the people themselves. I actually don't care what their nationality is or where they are physically located. But they need to be able to do their job. The system is broken if I can't understand the person who is trying to support me over the phone.
The only reason I'm writing this is in case some future CEO is reading these comments, hopefully they won't be blinded by the cost savings and fall down the same trap as so many companies who roll out such ineffective and irritating phone "support".
be a whole lot cooler if they'd spend those millions just hiring Americans in the first place. 3 months of voice training isn't going to trump having english as your first/primary language your whole life, and those voip connections are the worst.
Many of them probably have been speaking English, if not their whole life, from a very young age. OTOH, that probably doesn't make making them sound American easier, as Indian English is quite different from American English (or even British English, to which it is much more similar.)
And it's matter of far more than just tone use and accent, though, IME, plenty of big companies are not succeeding in doing much about those two elements in Indian call-center staff serving US customers, even if they are some song lots of money on it -- and I'm not talking about disguising the fact they are Indian, but even getting them to the point where they accent and tone use isn't an impediment to communication with Americans.
But then, plenty of companies with US-based call center staff aren't doing much to select people that are great at clear oral communication with other Americans, either.
I'm about as anti-trump as they get my friend. The notion that the world deserves better than having native tech support however is something I disagree with. When you are frustrated to the point of putting up with sitting on hold to get help, the last thing you want to be confronted with is an understanding barrier. Between accents and regional colloquialisms this is a likely reality of foreign tech support.
Also, your assumption is wrong. 95%+ of Americans speak English while only 12.5% of Indians speak it. Even with India's much larger population that still amounts to roughly a 3rd as many compared to the U.S. There is also a knock-on effect. When only 12% of the population speak a language they tend to avoid using it as a primary conversational language and thus only get practice in professional settings which lowers their fluency in it. Add to that all of the regional oddities and its quite different from American English. Consider for example phrases like "Kindly do the needful" or "out of station". Phrases they say daily that are completely foreign to American ears.
When you next have a chance compare someone in the Philippines (92% english) speaking English vs someone from anywhere in India speaking English and it'll become very apparent to you that the percentage of the population that speaks the language makes a huge difference in the intelligibility of said speaker.
Counter: How many of those who are willing in India can do it as well as those who are willing in the US? I expect the population who speak flawless and accentless (to the US ear) English in India who are also willing to do call center work is infinitesimally small.
It's not about nationalism. The job is talking to people, so it's important to be mutually intelligible.
Willingness to work saves the company money. It doesn't help the user. What helps the user are things like clear speaking and good phone connections. It's possible to do that in any country, but the relative cost varies, and outsourcing implies cutting corners.
I don't care at all where they are located or their nationality. I just haven't come across many instances where the outsourced phone support did the job as well as the companies that are using local call centres.
I worked with a woman who was born in India, learned British English, and then learned American English. She had impeccable accents in all 3.
While I agree that 3 months of training isn't going to cut it, it isn't necessary to have spoken in that language your whole life if you are serious about your accent.
FYI - Indians who have emigrated to the Americas are known as Indian-Americans, while Native Americans (who were here before any of us) assume the title of American Indians.
Well sure, it's not so hard to go a year without adding another byte of storage if you're Flickr ;)... let's see Instagram or Facebook do it. Are people even still uploading things to Flickr?
Photographers still use and make projects or series on stuff like Flickr or Behance.
Instagram and Facebook are more "blog-like" and aimed at snapshots of daily life, holidays etc to share with friends. It's normal imho that these are much higher in volume.
For an image junkie like me Flickr is still my favorite internet black hole, after all these years. Nothing comes close to it. No matter how much time you have spent on Flickr, you will always find some wonderful collection of images you have never seen before, on any subject imaginable.
I really dread that some day Yahoo will manage to casually destroy it.
Every year we sync all our family photos to have redundant backups. When she went to get the three months of backups from flickr she got "download error" after error. She sent me this link and hypothesized that the bulk-download feature is no longer working because of the need to now first decompress the files before transmitting them.
Luckily, she was able to get the 25 gigs of family photos down using a third-party application, but it's another reminder to never wholly trust the "cloud."