It took us only a few weeks to write our home-brew
analytics package. Nothing super fancy yet now we have
an internal dashboard that shows the entire company much
of what we used analytics for anyway - and with some
nice integration with some of our other systems too.
I never quite grasp how the above isn't just a matter of intuition to anyone working in the tech sector. Google Analytics thrives on developers' laziness in my opinion.
And to echo other posters: SpiderOak deserve thanks. If I find myself with any need for a service like theirs, I know I'll be looking at them.
>>> I never quite grasp how the above isn't just a matter of intuition to anyone working in the tech sector. Google Analytics thrives on developers' laziness in my opinion.
Ah, the "not invented here" syndrome!
There are tons of things that you could do "in a couple of weeks" that more or less work. However, it doesn't mean you have to or even that it would be a good idea.
If all developers adopted the attitude that you have expressed, there would be thousands of sad sad developers who need to maintain shitty in-house analytics system because someone once said "I could do it in a week". There are tons of awful CMSes already because someone once said "I could do better than wordpress" / "I could create a better framework" / etc.
In a lot of the cases, GA is just good enough. Sure, you might need to spend some time to explore its features (custom dimensions, etc), there's more to GA than a number of pageviews for a given day. There are cases when GA is not enough. Fair enough. But it's definitely not the majority of the cases.
Sure, it makes sense for SpiderOak given it's target audience. However, there's no need to make such a generic statement about 'anyone working in the tech'.
Then the question is do you really want to maintain the infrastructure required to run the analytics smoothly? Especially if your company has dozens of millions of pageviews a month and depends on the real time needs (extra infrastructure to support that).
Are you familiar enough with the stack so you could have a high degree of confidence that you can fix productions issues which are inevitable? Quite often, an honest answer here is 'no'. Then can you afford to lose a few hours/days/weeks (whatever it would take to fix the issue) of data? Again, often the answer here is 'no'.
Of course, you have hosted solutions. But they are no better than GA in terms of privacy.
Paid support exists too but the cost can skyrocket pretty quickly, on top of paying for the infrastructure and maintaining it.
Processing logs is a lot cheaper than the javascript download and other additional http requests needed for google analytics, not to mention the privacy costs. Cheaper for the website, the user, and the web in general.
Not to mention you get perfectly accurate analytics, with no loss due to request blockers or disabled javascript.
The code for this is generic. An open source solution costs nothing beyond some CPU to process the logs and a database to store the analytics.
It's been a while since I've used GA but being able to segment into age, gender, and interests(1) are things that you can't do without paying a marketing aggregator hundreds of thousands of dollars a month or using GA. You can do some geolocation classification and things like campaign effectiveness, bounce rate, etc, but since Google has so much aggregate data off-hand the value of being able to classify user-x as "Male, 40s, Interests-similar-to-demographic-we-sell-to"(2) is invaluable whether you're selling seats of enterprise software, high-fashion luxury items, or cheapo stoner knick-knacks. You can't really market segment with your own software.
Obviously, they're using the same information that's helping you calibrate your campaigns to add to the hive-mind, so they can further data-mine. You're sacrificing the anonymity of your end-users in doing so. Obviously they're offering it so that they can refine their profile of you more accurately to sell ads / direct more relevant traffic to you better. I'm not an industrial engineer but I've been reading about it for the last few weeks. I turned off Adblock for a while and even with my Opt-out plugins(a,b) I started getting ads for $4,500 Fluke multimeters. The combination of one's search history plus a fairly comprehensive history of the sites you visit(b) to a terrifying degree, but at the same time, the average business with only a few million dollars a year going towards both sales and marketing can't really approach Quantcast and ask for access to their API.
a: https://tools.google.com/dlpage/gaoptout
b:https://chrome.google.com/webstore/detail/do-not-track/ckdcp...
b: I don't have the study off-hand, but IIRC some guy after finishing his masters from Stanford wanted to assess how much information Google had re: an average users browser history. The findings, based off Common Crawl data of the top 100k sites + presence of GA.js yielded something like ~> 75% of the web was tracked (not to be confused with how much of an end-user's traffic is tracked, that number will be far higher) based on sites with a GA.js history factoring in Referer tags. Those were unweighed numbers, i.e., I bet more than one out of two 45 year old woman's traffic can be analyzed to a 95% degree of completely entirely based off of Pinterest, Facebook, search history and the outbound links from her e-mail.
Interesting points. I think there are many ways to use Google Analytics that go beyond what many people want from "visitor data". [Some of t]he kind of questions GA can answer is only possible if one is willing to collude in destroying (meaningful) privacy.
I've had "simple foss analytics" on my todo-list for quite some time. I'm hoping one can build on what piwik have collected wrt bot agent strings, ips etc - and combine with a simpler collector (adding php to the stack just for analytics isn't very appealing, never mind a php codebase of somewhat questionable quality).
Snowplow looks good, but I'm not sure if they have a supported "self-host" stack yet (they started out very awz/s3 centric).
I actually think there's room for a new product, that puts a little bit more thought into what questions it makes sense to ask, and how best to answer them (eg: does collecting metrics on every visitor even make sense if you can answer the same wuestions just as well by doing random sampling? You might want to quantify where your bandwith goes - but simple log analysis might do that easily enough - and it might have very little to do with your human visitors etc).
If you make decisions with money riding on the answers, it costs a lot more than CPU and DB.
Perhaps systems administration is somehow very cheap for you, but I'm willing to bet it is still not "nothing" - even if the cost is you personally not watching a TV show you like because you're patching the web server on your analytics box for your personal vanity domain, that's still a cost.
For most operations, sysadmins are somewhat expensive, and because of that, busy. This is why Urchin was such a good idea, and why Google bought them - the proposition is to trade your users' privacy for the admin time it takes to support another internal app. There's an absolute no-brainer, assuming you don't care about your users' privacy (IIRC, they were going to sell the service before Google ate them, but that's ancient and trivial history).
>because you're patching the web server on your analytics box
If you're business is so small that an additional low-volume web server just to display your analytics (you don't need one for the actual tracking) is a big deal, then the same web server that serves your product can serve your analytics. Not a big deal.
I don't think it's a matter of laziness. More so where is it best to spend your expensive/valuable developer resources, on the product or some home-baked analytic's framework?
I applaud SpiderOak, but they are much different from most other sites. They have privacy conscious customers to begin with, this is something that is good press for them and probably a net positive on their bottom line for doing it, not the case with most other sites. Also it's something they are doing after having a very mature product for many years, clearly not the first or most important thing they needed to tackle as a company.
Agreed - for some cases just pasting the GA snippet onto a site is sufficient. For others you should add events and such. For others you must roll your own.
It's not laziness, it's opportunity cost. For SpiderOak, it makes sense to spend a few weeks of a few developers' time to roll their own analytics. For me, it doesn't. Our customers aren't privacy-focussed. In fact, our app depends on them explicitly sharing [quite a lot of very personal] data with us. I would rather spend that time building something that delivers value to them and us than indulging my personal beliefs about privacy.
Piwik is incredible. But it should be noted that it does provide a scaling challenge for high traffic use cases (> hundred million actions per month), and hosting your own analytics is expensive.
I bring this up because people had been slamming moot for using GA on 4chan instead of piwik without understanding why.
We have much lower traffic than that and our Piwik servers, with paid support from the Piwik team, often struggles to generate reports etc. Not convinced Piwik is that easy to scale.
People have scaled it to over a billion actions per month. No clue how much of that includes customizations though... It sounds way past the out of box limit.
Look at the comments from sandfox and afterlastangel in this thread. afterlastangel is pushing a billion, sandfox is around 300 MM per month.
I'm looking into replacing GA Premium ever since Easylist blocked GA tracking for Adblocked users and self-hosted Piwik seems like the best solution. I'd be well into the billions.
They're using an open source analytics software package to analyse the very data it was designed to analyse.
I don't find it using poorly implemented hashing in the administrative interface to be at all relevant to what they're doing, or why they shouldn't be using it.
Information on who visits WikiLeaks - and what they read and upload - is an incredibly high value target. I don't see how you can argue otherwise, when Britain's top intel agency has an expensive line item in their budget just to get at that info.
Given these known security flaws, it's not a stretch to assume anyone who can see the GCHQ's Piwik server can have that data too, regardless of whether they are authorized.
See below for a small preview of what an attacker could exfiltrate (dissident IPs redacted for a reason):
While we're talking about poor security practices: the privileged username in the screenshot is apparently still the default ("admin"), so I hope the password isn't still "changeMe" ... http://piwik.org/faq/how-to/faq_191/
Strangely Microsoft's one is missing: Application Insights.
Pretty much works like Google Analytics but utilises both client JavaScript and embedded runtime code to generate a richer picture of what is going on.
Too bad the interface on the Azure Portal is terrible. They spent too much time making it look fancy, and not enough time getting the 101s of usability right (which is a criticism I'd lay at the feed of the new Azure portal in general).
Probably the vendors of the software concerned. Perhaps it started out as a list of three with a major bias towards a particular product. And then the competitors responded, moderators did their things and eventually an accurate list was evolved.
Self-hosted means that it will be served from your own servers, and thereby your own domain. So unless your domain is on a block list, it will be loaded.
EDIT: Sorry, I've been dealing with uBlock Matrix for too long, and forgot how advanced the other blockers pattern matching is. See the many responses to this for better information.
The EasyPrivacy block list contains an entry that will block the piwik.js file. Of course, when you're self-hosting, it's trivial to serve that file with a non-default name.
That's an interesting choice. I mean, it's not like you can hide from the web server that you are making the request. But then again, I'm assuming -- by the sheer necessity of having a JS file -- that they are collecting some additional metrics not available to the server in the request.
And I never quite grasp why many people working in the tech sector are insistent on reinventing things that already exist. Such thinking thrives on developers' personal sense of exceptionalism in my opinion.
Yeah, a nontrivial app is comprised of so many parts, if you tried to reinvent a few of them yourself you'd never get anywhere. Also, try looking at the commit history and issue lists of seemingly trivial libraries. It's incredibly easy to underestimate how complex something that looks simple at first can be.
That starts going down the path of the "not invented here" mindset. You could then attribute not hand-rolling every bit of infrastructure yourself as "laziness". Yes, I am lazy to the point that I don't want to hand-roll an industrial-strength RDBMS myself, or the operating system, or the networking protocol, or the key/value store, etc etc.
If all you want to know is who accessed a site, with which browser, how long for, and which pages they looked at then you could get all that from your webserver's log files without writing any code. On the other hand, to build something that's robust, relatively scalable, works across browsers and devices, and can give you an event watching platform like GAnalytics gives you (eg the useful bit), that is far from trivial.
Most developers don't develop (major) libraries, languages and OSs in house, it doesn't mean they are lazy, it means the company need to focus limited resources on their core business.
>> Google Analytics thrives on developers' laziness in my opinion.
Every service does. Pingdom, GA, Olark, Github...
It took them a few weeks to write their own analytics. What features did they not implement? How many people worked on it?
Does your 1 or 2 person startup have 4 weeks to write their own analytics package or do you have more important stuff to do? (I'm betting you do. Like launching your product instead of re-inventing the wheel with analytics)
Isn't GA's main draw its close integration with adwords and whatnot? The dashboard and UI seem pretty clearly aimed at someone who needs to manage their spending on google marketing services, not on someone who needs to count pageviews.
So it's not hard to imagine marketing wanting it; presumably it provides them a lot of value that wouldn't be easy to recreate in-house.
If you can reimplement GA in a few weeks, you need to do this over December, then enjoy your FU money.
GA is rather deep, with tons of integration and ways to slice and segment data.
Yeah, maybe in a few weeks you can get _something_ that'll give you something that'll make some manager not too unhappy. Seems like a terrible value prop for almost all companies since, unfortunately, approximately no one cares (or they run adblock anyways).
I mean implementing a analytics tool that does what you need. If you do it just for yourself, you don't need all those fancy things, so it is often doable in a few weeks.
If it takes you more than a few days to put together a basic analytics platform and reporting system, you're a script kiddie.
Not hard to track page hits, time on, time off, and arbitrary events.
EDIT:
Seriously? Folks, it's a table for analytics events, a few SQL queries to do basic reporting (at least in Postgres), a little bit of client-side JS to post the events, and a bit of server-side code to create the routes and maybe display the report page.
I guess if it doesn't include Kafka, Mesos, Kubermetes, Neo4j, and Docker, it isn't delivering business value.
It is quite costly to write to the database for each hit, I guess most downvotes is because of this. If you limit writes by keeping them in some memory cache it's doable for slightly higher loads.
You criticize people for _premature optimization_ while in the same breath advocating rolling your own, shitting implementation for page views? Right...
Frankly most of what i read out of the tech world these days seems to be about pandering to developer laziness.
All manner of APIs and services seem to exist in their current form simply to extract rent from developers that don't want to do back end "dirty work".
Being paid for doing work has nothing to do with extracting rent, which is the practice of inserting yourself as a middleman so other people have to pay you "rent"[1] where none should be required.
The entire idea behind writing a Service as a Software Substitute[2] is about extracting rent.
I understand Stallman's dislike of SaaSS in [2], but I fail to see how it meets any definition of rent-seeking. People who provide SaaSS are using economies of scale to offer services that are desirable to some, because they're offered at a cost that is less than the cost of developing and maintaining their own private solution. There is certainly a loss of freedom in using these services, as Stallman points out. But rent-seeking, not so much. Users of SaaSS need to decide whether the cost savings of using SaaSS is outweighed by the freedom they give up. Nothing more, so far as I can tell.
Perhaps you should have read that wikipedia page before so helpfully linking to it. There's nothing about "middlemen" there. "Rent" is political economy jargon; it's not just a synonym for "distasteful practices". Adam Smith wasn't complaining about shopkeepers or shipping companies, and he certainly wasn't talking about "back end" software services. There is no royal decree enforcing how such services shall be provided. If you don't like AWS then use GCP.
I feel like someone needs to rewrite Stallman's missives to eliminate the term redefinition and the connotation management. His usage of these rhetorical techniques is far too ham-handed to be persuasive to those who aren't already convinced, even when his message is important.
I would add wisdom to that list. Wisdom to know which modifications will allow you to be the lazy in the future and produce the best results before the user realizes they needed them. I think wisdom is a very important one.
> I never quite grasp how the above isn't just a matter of intuition to anyone working in the tech sector. Google Analytics thrives on developers' laziness in my opinion.
Unless I'm mistaken, one big difference is that not using Google Analytics means you don't know which Google search pages people used to access your website. That can be a really important difference for some websites.
A lot of people are replying to the suggestion of implementing your own analytics by calling out it's NIHness.
I've recently been faced with this problem, and a solution doesn't have to be too complex.
There are roughly two parts to an analytics solution: event logging and, well, the actual analytics.
Writing your own logger in javascript is super simple, you're just sending off json objects to be inserted into a elasticsearch cluster. Since you have to define that logging anyhow, the only extra work you need to do is the layer to do the actual ajaxrequests.
What's left is running and defining your queries in elasticsearch.
BAM! Analytics
I realize it's not fit to be used for every situation, but it can so some pretty complex things this way without the hugest amount of effort ...
I get what you are trying to say and I was one of the NIH-sayers, it totally makes sense in some cases and looks like it made sense in your case. Great! :)
I don't think anyone was saying that GA is always better, it's just more often than not it is. It takes some skill and quite a bit of experience to draw the line at a reasonable place and correctly recognize the trade offs.
I've replaced Google Analytics in all my projects with my CouchDB-only web analytics service, Microanalytics[1], which I could access from a CLI[2] and worked very well.
But then I started to fall short on disk space for storing too many events. This is a problem.
Don't the ad blockers disable Google Analytics by default? If I am not wrong, I think uBlock Origin does.
So, I think, as more and more people will start using ad blockers, site owners will start getting less and less accurate stats from Google Analytics, forcing them to implement their own solutions. Hopefully, open source solutions will start providing the best features that Google does.
Anything that is widely used (open source or not) will be blocked because of common names or other patterns that can be recognized and blocked. If you need exact statistics you need to roll your own sooner or later. Or at least heavily customize some other product.
And GA is inscrutable. I don't use it very much because it's got way too many layers of abstraction. It was fine before as Urchin. Maybe this is a category like email clients — there should be a sustainable paid product that doesn't suck.
Not strictly on topic so I apologise if this is unwanted but I thought I'd share my experience with SpiderOak in case anyone here was thinking of purchasing one of their plans.
In February SpiderOak dropped its pricing to $12/month for 1TB of data. Having several hundred gigabytes of photos to backup I took advantage and bought a year long subscription ($129). I had access to a symmetric gigabit fibre connection so I connected, set up the SpiderOak client and started uploading.
However I noticed something odd. According to my Mac's activity monitor, SpiderOak was only uploading in short bursts [0] of ~2MB/s. I did some test uploads to other services (Google Drive, Amazon) to verify that things were fine with my connection (they were) and then contacted support (Feb 10).
What followed was nearly __6 months__ of "support", first claiming that it might be a server side issue and moving me "to a new host" (Feb 17) then when that didn't resolve my issue, they ignored me for a couple of months then handed me over to an engineer (Apr 28) who told me:
"we may have your uploads running at the maximum speed we can offer you at the moment. Additional changes to storage network configuration will not improve the situation much. There is an overhead limitation when the client encrypts, deduplicates, and compresses the files you are uploading"
At this point I ran a basic test (cat /dev/urandom | gzip -c | openssl enc -aes-256-cbc -pass pass:spideroak | pv | shasum -a 256 > /dev/zero) that showed my laptop was easily capable of hashing and encrypting the data much faster than SpiderOak was handling it (Apr 30) after which I was simply ignored for a full month until I opened another ticket asking for a refund (Jul 9).
I really love the idea of secure, private storage but SpiderOak's client is barely functional and their customer support is rather bad.
Many of these types of services seem to intentionally cap upload speeds to reduce their potential storage liability (since they're likely over-selling storage to be able to offer 1 TB for $12 with the level of redundancy, staffing costs, etc, needed).
I wonder if that is happening in this specific case? Although if it were the case the vendor should still be honest about it. Just saying they limit uploads to 2 Mbps is better than giving the run-around.
Its to reduce their maximum bandwidth capacity required. I don't see it as a problem, considering their price points. They're selling you storage, not "slam 1TB of your data into our storage system in a day". If you're looking for that, ship a hard drive to Iron Mountain.
EDIT: Even AWS limits how fast you can upload to S3, and built an appliance for you to rent and ship back and forth if you need to move data faster. That station wagon full of tape is still alive and well.
> Even AWS limits how fast you can upload to S3...
I'm on gigabit fiber and use S3 to backup hundreds of gigs per month to S3. I've never seen them limit upload speeds, it is clearly saturating the connection for the entire duration of my upload. I would expect that because I am paying for the storage, they would be happy to let me write data to their machines as fast as I like. Is there a citation you can provide from their docs that supports your statement? Genuinely curious, because my experience has been different.
To the point that some of these sync or backup providers limit bandwidth, I have definitely experienced that. Tested SpiderOak and Dropbox and upload speed was horrid. Dropbox in particular was disappointing because they can't even claim to have the extra encryption overhead SpiderOak does, it was just shit speed every day. I'm paying a premium for gigabit fiber to the home and you really can tell who over-promises and under-delivers quickly. Fortunately my 'roll your own' backup + sync works well and is price competitive so I'll stick with that.
> I would expect that because I am paying for the storage, they would be happy to let me write data to their machines as fast as I like.
I don't understand why you'd think this. You're paying for storage, not an SLA as to how fast you can fill it.
> I'm paying a premium for gigabit fiber to the home and you really can tell who over-promises and under-delivers quickly. Fortunately my 'roll your own' backup + sync works well and is price competitive so I'll stick with that.
This is the preferred solution if a) commercial services are too slow for you and b) you're willing to spend the time to implement and manage it. It appears, based on commercial services out there, that there is no competition based on upload speeds.
He thinks this because it's in Amazon's interest to let him dump as much data as possible. It's not a matter of an agreement, it's a matter of aligned incentives.
> Its to reduce their maximum bandwidth capacity required.
They should be looking to partner with someone who has bandwidth problems in the other direction. By combining a backup service's upload bandwidth and a streaming video service's download bandwidth into one AS, you can get a more balanced stream, and qualify for free peering.
Yeah, agreed. The problem is, you're limited to partners in the same DC as you (unless you're going to bite the bullet and start using fiber loops between datacenters to accomplish this). Backblaze (for example only) is only in one DC in Northern California if I recall, which limits them to whomever is in that datacenter.
A great model would be to parter with CDNs; they pour content out to eyeball networks, but you could run a distributed network of your storage system across all of their POPs.
> if they are selling 1TB of storage, shouldn't we get 1TB of storage?
You do, they're just not allowing you to store it in 24 hours. Some services (Backblaze, if I recall) allow you to ship a drive to get around this limitation.
Notice that all services do this? If you can do better, build one! Prepare to go broke from the peak bandwidth requirements you'll need to build your networking architecture to support such transfer rates, but I always encourage experimentation and learning lessons over complaints.
The appliance is so that you don't need to send terabytes of data over a 10 Gbit/sec connection for example to their datacenter.
The limitation is actually the pipe that connects you to Amazon, not an inherent limitation within S3 or other services within Amazon on connection speed. If you have a good enough connection, or peering with Amazon things go amazingly fast.
When I worked at an ISP, we slammed about 20 Gbit/sec into S3 without issues, but even then data we were backing up -- about 300 TB of data a day -- at that rate took 1.4 days to upload to the cloud, so we ended up backing it up in-house instead. (we needed to store the data for 7 days, after that it went bye bye).
> When I worked at an ISP, we slammed about 20 Gbit/sec into S3 without issues, but even then data we were backing up -- about 300 TB of data a day -- at that rate took 1.4 days to upload to the cloud, so we ended up backing it up in-house instead. (we needed to store the data for 7 days, after that it went bye bye).
Seems like the perfect usecase for S3; inbound transfer is free, and you're only paying for a rolling 7 day window of storage with lifecycle rules :/
A good upsell, yes. But initial seeding to "affordably priced" online services at full data rate can never be economically viable to the provider. Bandwidth is cheap(er) these days, but routers which can handle big bandwidth are still big bucks.
Hold on, this is hacker news. VCs, this is a great idea!
No, no of course it's not. Initial seeding is a competitive moat for the first mover. Moving a few hundred gigs to a new backup company just to save a few bucks? I don't think I could be bothered, because I KNOW how long it will take.
Pricing is falling rapidly for storage. Consider that S3 - IA is $15/mo for a TB, and backblaze B2 can offer 1 TB for $5/mo. I would assume both are making some profit at those price points, so $12/TB/mo should be workable if the service is doing their own hardware.
Backup services especially have low operational requirements for their hardware and network connection, since once the files are uploaded they only need to be verified periodically.
> Many of these types of services seem to intentionally cap upload speeds to reduce their potential storage liability (since they're likely over-selling storage to be able to offer 1 TB for $12 with the level of redundancy, staffing costs, etc, needed).
SpiderOak is definitely overselling the 1TB as well as another one that pops up once in a while called as the "unlimited" plan for $149 a year. This is clear from the disproportional pricing structure - $79 a year for 30GB that jumps to $129 a year for 1TB and then to $279 a year for 5TB - which entices users to go for the higher amounts because they appear to be great deals. What people with residential broadband connections may not realize is that a) uploading even 1TB of data will take a long time and b) SpiderOak cannot, and does not, provide any minimum guarantees on the upload or download speeds (assuming everything else in between SpiderOak and the user looks fine).
The thing that is silly about that is related to cost of acquisition and retention of customers. If a company is able to get more data quicker they are more valuable to the customer and will most like be used and retained by the customer. If organizations are offering storage as a solution while at the same time trying to minimize the costs of that solution, by minimizing the utilization of that storage; they are exchanging fixed costs associated with storage (that should be easily built into pricing) for large variable costs related to customer acquisition, retention, and branding.
Yup, I've noticed the same with Wuala. The uploads were pretty slow. I've heard similar complaints from people using OneDrive. I would be very willing to switch to a smaller competitor even if it meant paying more than I do at Dropbox. But from my experience Dropbox is the only provider capable of synchronizing large amounts of data 24/7.
It's definitely possible to offer that on a monthly basis if you model that each customer stays for 36-39 months. Also, I doubt that they are using replicated storage, but are using erasure coding instead. Also, they dedupe before upload, so more cost savings there.
Spideroak can and does dedupe client side before uploading. It can't dedupe across multiple clients, but it does dedupe within the client. It also tracks syncs so that data synced between multiple client machines only has to be stored once (with appropriate redundancy).
That doesn't sound good. On the other hand, I use SpiderOak with not a lot of cloud storage use, with clients on OS X, Linux, and until this morning Windows 10. The only problem I ever had was more or less my fault - trying to register a new laptop with a previously named setup.
BTW, why store photos and videos on encrypted storage? For that I use Office 365's OneDrive: everyone in my family gets a terabyte for $99/year and I really like the web versions of Office 365 because when I am on Linux and someone sends me an EXCEL or WORD file, no problem, and I don't use up local disk space (with SSD drives, something to consider).
I prefer to store photos and videos on encrypted storage because I want to control who sees them. Storing them on unencrypted storage means I don't have that control, the storage provider does and is kind enough to let me make suggestions.
As for OneDrive, I tried it for a while but it didn't work out. Their clients and web interface were terrible and their API was severely lacking. I expect more functionality when I'm sacrificing my privacy.
I ended up going with Google Drive in the end, as you can get 1TB for $9/month with an Apps for Work Unlimited account (I actually seem to have Unlimited under that plan, which isn't supposed to happen until 4 users). That of course means sacrificing encryption but I trust Google enough to make the privacy tradeoff in exchange for extra features (OCR, Google Photos etc.).
I also buy extra storage from Google but I have had some problems downloading large backup files (50 GB, or so) that I have stored on Google Drive, so no system is perfect.
A little off topic, but Google really seems to be upping their consumer game lately with Google Music, Youtube Red, Google Movies + TV, etc. I am now less a user of other services like GMail and Search, but Google gets those monthly consumer app payments from me. I have the same kind of praise for Microsoft with Office 365.
This has been my experience as well, not to mention how much the client slowed down my machine. It's been really slow going but the client is getting better.
I never tried doing the encryption on my side, though, they also do diffs on each file you upload so I imagine that has something to do with the lag.
I still use spideroak, they're the only company I'm aware of that encrypts locally and also has done a lot to progress personal security for all of us.
So I've gotten used to the slow speeds and buggy software, it keeps getting better so that's a big plus :)
I was going to post a comment about how cloud storage is more of a means to move data around rather than back it up, until I dug a little deeper and saw that SpiderOak actually pitches itself primarily as a backup provider. I agree, it needs to be much faster than that.
Is it possible that they are working on batches, and not doing any hashing/compression in parallel with the uploading? It seems feasible from your screenshot that they are getting ~10GB of data at a time, compressing(?) and hashing, and then uploading, and then starting on the next ~10GB.
The only issue I have, which is similar to what I see with some other providers, is that the first non-free plan is a huge jump in storage space and price. If I want a Dropbox replacement, I'd be looking at a 25GB or 50GB plan (just comparing what I have with all kinds of free storage bonuses accumulated over years). Having some more "in-between" plans that are more linear in storage and price would've been an incentive to try this out since I'm not willing to fork $49 a year for 500GB while knowing that my Dropbox usage is less than one-tenth of that.
It is off-topic, yes. For me personally it was very valuable however since I’m in the market for a backup application, and I will definitely take Veratyr’s comment into consideration when choosing between the available offerings.
> easily capable of hashing and encrypting the data much faster than SpiderOak was handling it
I can believe that there was upstream congestion somewhere outside my network (speeds to Google, Amazon indicated that there were no issues inside) or that their server was overloaded but the engineer who investigated seemed to attribute it to the client:
> Additional changes to storage network configuration will not improve the situation much. There is an overhead limitation when the client encrypts, deduplicates, and compresses the files you are uploading"
Trivial to set-up, immune to adblockers affecting the completeness of data, prevents the write of tracking cookies, leaves data and utility of the GA dashboard mostly complete (loss of user client capabilities and some session-based metrics).
One may argue that Google will still be aware of page views, but the argument presented in the article is constructed around the use of the tracking cookie and that would no longer apply.
I'm shifting to server-push to restore completeness, I'm presently estimating that client-side GA represents barely 25% of my page views (according to a quick analysis of server logs for a 24hr period). I'm looking to get the insight of how my site is used rather than capabilities of the client, so this works for what I want.
I agree. Server-side analytics were actually fairly mature before Google came alone. It's just more complicated in some cases, but manageable. The biggest downside these days would be SPA apps since they are not necessarily touching the server in any regular way.
People don't care about the cookie or any of the details of the implementation. They care about being tracked across the whole internet. If you are still contributing to that then you are disrespecting your customers. I hope that I am not one of them.
Except that basically nobody cares about "being tracked across the whole internet", as shown by GA, Facebook, etc being on virtually every popular website and nobody noticing or caring at all. If you care enough to make even the most trivial change in behavior, then you're optimistically 1 in 1000.
I said that I care and that I hope that I am not a customer of businesses that track me and contribute to google's tracking. In that case I am 1 in 1000 and if a site doesn't work without GA and I don't have to use it (as in to file my taxes have to) then I won't I will purchase from a competitor.
EDIT
Most people do notice and do care this has come up in countless conversations. They just accept it as a necessary evil that they can't do anything about and accept (wrongly) that they as a individual can't change the world.
You will have no GA cookie from any of my sites, I am not recording client identifying things or capabilities. It is a server-side push of GA and avoids all client-side interactions.
It is merely, "A page has been viewed, this one: /foo/bar?bash".
There's nothing in there that is tracking you. I'm not even embracing the session management aspect.
I get to use the tool that is best-in-class, in a way that lacks capability to track you.
Without any "client identifying things" how would GA be able to chain several page hits into a session then? That is, do basic visits vs. hits split.
If you are in fact anonymizing everything about a client as you claim you do, then it won't be able to. Unless, of course, you are feeding GA some opaque client ID that you then internally map to and from actual clients that hit your server. However something tells me that you aren't doing that, or you would've mentioned it already.
(edit) I re-read your comment. You aren't apparently interested in session counts. But what's good the GA summary then if you can't tell 10 bounced visitors from one visitor with 10 hits? This makes no sense. If you want to look at just page hit numbers, there are dramatically simpler ways to do that.
In the test I've done, sending no session/user data over, I lose all sense of a "session".
But I do retain insight into what content has been viewed, how much, what is rising and falling, etc.
The question really is what info are you really reporting on? AdBlockers make us blind and tracking is horrible, but I get to have a far more complete view over the simple stuff Urchin used to be great at.
Ah, so you are passing some client IDs over the GA after all. An IP address perhaps? You know that's a leading question, right?
Incidentally, I ran similar experiment with gaug.es few years ago - pulled on their tracking API from our server side. While it worked as expected, these sort of shenanigans are good for only one thing - hiding the fact that you are using 3rd party analytics from your visitors.
On a more general note - the thing is that you either care about other people's privacy or you don't. It's not a grayscale, it's binary. And if you do, there's no place for GA in the picture.
I am not passing IP. I am not passing a client-id. I am not passing any kind of correlation identifier from which a session can be inferred or created. I am not passing user-agent information. I am not passing a cookie ID.
I am only passing a page view event. "Page /foo/bar?bash has been viewed".
But isn't the same kind of data you could extract from Apache logs? Since from what you describe is basically a log of all your requests.
GA has many utilities, mainly is to follow the user and see the funnel they go and second to monitor the marketing campaigns. If you don't need this, then Apache log + webalyzer is perfect for everyone.
I persist with GA, because every now and then I work with partners who would like to verify the activity on my websites (and yes my user agreements and privacy policy allow this) and have a means to compare this with historical data or data from other sites.
Those partners frustrate me, in that they won't trust me to provide stats generated from server logs, but they all default trust GA.
This technique allows me to use GA, produce the view of the content they need, export the PDF, and share that... and they trust it.
GA is the de facto store of trusted data when it comes to web site activity. For my sites that is tracking content page views.
This whole conversation started with you saying why abandon GA when you can use it without compromising clients' privacy. An exchange that followed shows that one can't actually derive not just the same function from GA that way, but virtually any function at all. Yes, you can feed data in, but the usefulness of what you can get back out is next to zero. What am I missing?
From your opening comment:
> Why not move to push GA data server-side?
Because it renders GA largely useless if clients' privacy is actually observed.
> I am only passing a page view event. "Page /foo/bar?bash has been viewed".
I would like to say, as someone extremely hostile to tracking of any kind, that if this is all you're sending to google, that sound perfectly fine from a privacy perspective. (Google gets your information, but that's between you and Google)
Thank you for choosing a method that respects the privacy of your readers.
> (edit) I re-read your comment. You aren't apparently interested in session counts. But what's good the GA summary then if you can't tell 10 bounced visitors from one visitor with 10 hits? This makes no sense. If you want to look at just page hit numbers, there are dramatically simpler ways to do that.
I do not care to track users/sessions, page views are enough for me. I am tracking content and content views... and I get this big tool that is awesome at slicing data and presenting trend information... for free.
The only issue I can see with this is a lot of HTTPS connections with your analytics platform from your web service. If you choose to use a work queue/proxy to do it, it's additional work/point of failure, etc.. It's not as 'simple' as adding a JS at the bottom of your page.
How about open-sourcing your product before worrying about improving other products? SpiderOak has been "investigating a number of licensing options, and do expect to make the SpiderOak client code open source in the not-distant future" for a very, very long time now. It's no trivial thing to have a closed source client for a "zero knowledge" service.
I came here for this exact thing. They said they were going to go open source in 2014 IIRC, and failed to deliver. I have stopped using SpiderOak - how am I supposed to trust them with my most private files when I can't verify that they're not doing anything shady on my machine?
The opening line of this post is amusing. They ought to give thought to fixing their core product first.
I am also concerned with that. That message has been there unchanged for some time now. To be fair, there's a lot of stuff on the Github page, including the Android client under Apache license. Although as far as I can tell, desktop client is not there yet.
The other thing is that google analytics is on many adblockers lists, precisely for that reason. As adblockers are getting widespread, the analytics is going blind.
I've been running a blocker to block GA and other junk on my PC, but I imagine I'm in a statistically insignificant minority. And I still can't block them on my iPhone unless I disable JavaScript entirely (though I'm running iOS 9, I'm not able to install a blocker for some reason; I guess Apple arbitrarily doesn't support them on my older iPhone model).
Ah, is that the differentiator? I see. Still strikes me as somewhat arbitrary, though - is content blocking such a strenuous task that it requires a 64-bit CPU? Wouldn't using a blocker cause the CPU to do less work in most cases since it doesn't have to download so many ad media files or execute as much JavaScript?
Yeah, I guess it's just time to get a friggin' new phone already, but this one ain't broke yet, ya know?
If anyone is looking for a good blocker for stuff like this, I recommend ghostery. I set it to block everything by default, and whitelist the few things I want. It doesn't block scripts served by the site you are on, so it doesn't totally break your browsing experience, like others do.
If your device is jailbroken (not sure if there's a jailbreak for iOS 9), you could add entries for GA to its hosts file. I use these on my desktop PC:
Eh. The analytics data is pretty low value as far as hacker targets, and this can be mostly mitigated anyways by sane segregation of the admin backend from the publicly accessible site.
There's an open ticket for it, but it looks like it hasn't been addressed in a while since they don't want to break all existing passwords.
A low value target maybe, but having a critical security ticket open for seven years is unforgivable. If they don't want to break compatibility it's pretty simple: use something like PHPass and upgrade the hash when the user next logs in. i.e. what every halfway sensible web app did at least five years ago.
I'm not interesting in further dehumanizing myself with participation in a bug bounty program.
I'll write an exploit for it (the general case, not just Piwik in particular) and drop it on OSS Sec some day, but here's a theoretical attack:
1. Guess a username somehow. Maybe "admin"? Whatever, we're interested in the security of the hash function. Let's assume we have the username for our target.
2. Calculate a bunch of guess passwords, such that we have one hash output for each possible value for the first N hexits.
3. Send these guess passwords repeatedly and use timing information to get an educated guess on the first valid MD5 hash.
4. Iterate steps 2 and 3 until you have the first N bytes of the MD5 hash for the password.
5. Use offline methods to generate password guesses against a partial hash.
The end result: A timing attack that consequently allows an optimized offline guess. So even if their entire codebase is immune to SQL injection, you can still launch a semi-blind cracking attempt against them.
Its nowhere near as centralized as Google Analytics though - at least if you're self hosting that data is confined to the silo which is your own analytics, rather than Google being able to aggregate that with their behaviour on every other site they visit as well.
That silo is still aggregating data. Trying to argue its "less" centralized by using quantification of the amount of centralization is still akin to dissonance. Clearly people here don't agree with this, but that's to be expected when the topic is so polarizing. Traffic analytics must be important, so we rationalize our actions, or inactions around how we collect them.
Any centralized solution, at any scale, can possibly violate someone's privacy. Period. If we want to really fix things, we should stop circle jerking ourselves and do something about it.
Not at all. The entire point is that Google is able to track one person across many, many sites. That is simply not possible if each site had its own self-hosted analytics.
It's more than just the tracking cookie, though. It's also about Google aggregating all its website data into a unified profile. The data they have on everyone is frightening—all because of free services like GA.
Yes, thank you SpiderOak, even though I don't use you: High profile companies quitting GA means we get aware of alternative solution. Today, I've learnt about http://piwik.org .
Spideroak user here. I stopped using Dropbox and started using Spideroak about a 18 months ago. I really like the product. It's not as good as Dropbox in some ways (like automatically syncing photos from my phone) but it really is easy to use. I still have a mobile client on Android and I can keep my files in sync across multiple computers. I pay for the larger storage size and I'm not even close to using it all.
It syncs fast too. Just thought I'd share my experience with people.
It is. It's no big deal to stop using Google Analytics. It is, however, a big deal not to use Google Search, something I am considering for my company.
Well, the title says they stopped using Google analytics, and the article explained that they stopped using Google analytics, why they did it, and what they're doing instead. You may not find it interesting, but the title clearly reflects the content, so I'm not sure how it's click bait.
> Like lots of other companies with high traffic websites, we are a technology company; one with a deep team of software developer expertise. It took us only a few weeks to write our home-brew analytics package.
I'm a little curious why they decided to go this route instead of using one of the open-source solutions. Aren't there good solutions to this problem already?
I was curious as well and just assumed the usual NIH (not invented here) syndrome. Web analytics was so mature before Google bought Urchin and turned it into Google analytics. Since that time countless open source projects have sprung up (pwik was the first that came to mind). Google for open source alternatives brings up thousands of pages of projects.
Writing your own is easy for the basic stuff. When you want to move beyond the basics, as Spider Oak will find, it becomes much more difficult.
I'm doing my part. I'm moving to DuckDuckGo for searching more and more. It's a process. Google does have better results. For work I still rely on Google, for private stuff I use https://duckduckgo.com/
And for the sake of ducks, I'm eating less meat as well. No more chicken - too much antibiotics, and as little meat as possible, only when it's worth it, so great taste and good quality.
I'm a big DDG fan too. I don't really notice their results being "worse" than Google's (but maybe that's just because I haven't used Google for so long). The Bang feature is also very handy once you get in the habit of remembering to use it. https://duckduckgo.com/bang
Do you also experience slower response times at DDG?
Here in Europe 'ping -c 5' gives an average of about 10ms for google.com and 30ms for duckduckgo.com. Since search is such a fundamental part of browsing, this is very noticeable.
> Sadly, we didn’t like the answer to that question. “Yes, by using Google Analytics, we are furthering the erosion of privacy on the web.”
The only thing "wrong" with using an analytics service to better understand your customers is that it places all knowledge of visits, including ones that wished to be private, in a centralized location. This can be useful in providing correlation data across all visitors in aggregate, such as which browser you should make sure your site supports most of the time.
In other words, there exists some data in aggregate that is valuable to all of us, but the cost is a loss of privacy for smaller sets of personal data.
If individuals don't want certain behaviors analyzed by others, then they shouldn't use centralized services which exist outside their realm of control. These individuals would be better off using a "website" that is hosted by themselves, inside their own four walls, running on their own equipment. A simple way for SpiderOak to address this is to put their website on IPFS or something similar.
I appreciate the fact that SpiderOak is thinking about these things. It's important!
>why does Google and their advertisers need to know about it I would ask
Google is pretty clear about this. The only reason they track you is for advertising, and there isn't any evidence of them using the info for anything else. In fact there is a lot of evidence pointing the other way, such as their insistence on encryption data flowing between their datacenters.
This is Google we are talking about, not Kazakhstan, China or Russia.
Google could eventually use this information to determine your eligibility for a home loan. They have already dipped their toes in this area [1]. With all this data, we have to ensure that it is used fairly (or not at all). There is enough concern about digital redlining that a 2014 report to the white house reports on this [2]. As we know machine learning is quite capable inferring sensitive attributes [3].
This inference doesn't even need to be intentional, machine learning is capable of accidentally picking up on latent variables. Even if your neighborhood (the original redlining) isn't a feature in the original variable, it could be inferred from the other variables.
TL;DR: Your surfing behavior could be used to deny you a home loan one day.
Lots of speculation about what they might do. You could also say that the US government could use all of your data to spy on people who criticise the government, so they shouldn't have any of that data either.
Their other purpose is "dont be evil". There may be some debate about that at times, but they certainly aren't going to screw their customers. They know that their customer base would evaporate pretty quickly if they tried to screw them.
> It took us only a few weeks to write our home-brew analytics package.
Unfortunately, there's no way to replicate what Google Analytics currently offers (for free!) within a couple of weeks (or even months). Not with big data sets. Yes, GA does enforce sampling if you don't pay for GA Premium, but the free edition is still one of hell of a deal (if you don't care about privacy).
If you only use Google Analytics as a hit counter, sure, you can do that yourself within a couple of minutes. The advanced features are way more complicated, though (think segmentation and custom reports).
I suspect most of the people saying "you don't need Google Analytics! Do it yourself!" have never used GA for anything that meaningful. As you begin to really familiarize yourself with your website traffic and understand how to look at your clickstream data in a more investigative an analytical way, you'll start to see how nice GA is and how easy it is to answer your questions.
You also underestimate how ubiquitous GA is because it's free and extremely popular. I'd consider myself an intermediate to advanced user of GA, but for people less experienced, I can easily share stuff with them for complicated tasks or they know how to do a lot of the basics themselves.
In hiring digital marketing people, GA is pretty much on par with Word in terms of familiarity. It's something a lot of people have a basic competence with.
To me, it is the cost that matters. Most other Analytics cost $30 - $50 / 1 Million Pageview / Datapoint. To me this expensive. Even when you scale to 100M it will still cost ~$20/Million.
Piwik doesn't scale. At least it doesn't scale unless you spend lots of resources to tinker with it. Its Cloud Edition is even more expensive then GoSquared which i consider to be a much better product.
What we basically need is a simple, effective, and cheap enough alternative to GA. And so far there are simply none.
Instead of rolling your own look at Piwik. It works very well and is basically a GA clone. I actually like it better than GA in some ways. It's easy to set up and you can run it on your own site so you're not contributing to a global tracking fabric.
I don't get it. SpiderOak states that they dropped GA because it furthers "the erosion of privacy on the web.”, but then they just started tracking in house.
How is tracking in house more private than GA? The user is still being tracked.
I believe their point was that they want to track their traffic, but when they use a third party like Google, Google provides tracking services for SpiderOak, Google also tracks you as well, which SpiderOak has no control over.
With it in house it is under their control: they can anonymize it, not collect certain information, they can't cross index it with your traffic from other sites, etc.
I haven't checked my GA in months since it became clear that Google won't bother doing anything to fix the referer spam problem that makes the stats useless if you don't have a high-volume site. It's not like these abusers are hard to track down but I'll be damned if I'm going to manually add filters to get rid of them every time they come in from a new domain.
For anyone here looking for a really good, free,self-hosted, hackable, open source alternative to Google Analytics that's been around for a long time, please consider Piwik.org.
I've been using it for prob 8-10 years and it has never missed a beat. I use it on all my personal / business sites as well as some client websites that are super high traffic.
Analytics, fonts, css. We include it everywhere by default. Then I realized hey we are all giving away too much. My sites now happily run self-hosted piwik, for the last six or so months.
I won't be surprised if in the coming years we hear much more about using google fonts being base to count site access, if there is no analytics in place.
It should also be noted that SpiderOak has opensourced many components of their product stack, including Crypton, which is the encryption framework underpinning many of their clients.
Usually I start with Google Analytics but continue to add to our own in-house analytics solution targeting the specific metrics we're interested in tracking. GA often doesn't provide us with the real insights we're looking for, but it's good for the vanity stats.
Random fact: GA cookie is distinct from adwords(google.com) cookie, and it is illegal for Google to join those (not sure if it is even technically feasible).
At Cloudron, our vision is to allow companies to host their own apps easily. We dogfood and don't use Google. We don't use analytics on our website (a conscious decision). Our emails are based on IMAP servers and we use thunderbird. We selfhost everything other than email (which is on gandi).
Cloudron and sandstorm are similar projects. I think the main difference is the user experience (also how we handle domains, how apps are packaged etc). You can see a demo of the cloudron here - https://my-demo.cloudron.me/ (username: cloudron password: cloudron). All apps use same credentials (because of single sign on)
How exactly would they stop themselves from being listed on Google? A right to be forgotten request? Their decision was to ditch Google Analytics, not to disappear from the web.
Robots.txt. I think it's interesting that you say that removing yourself from Google is the same as "disappearing from the web" implying that the web is Google. It's not. Perhaps they should use alternative forms of outreach, something I am considering with my company.
Google is not the web, you are correct; but for all intents and purposes, it's the web's phonebook. You remove yourself from the phonebook, you make it very difficult for people to find you, or your business.
There are other ways to promote a business outside of search engines. The point is you can't say you have "ditched Google" while still being part of their systems that collect user data.
For a less obnoxious answer: if users choose to use Google, it's not Spideroak's problem. By dropping GA, they're no longer forcing users to be subject to Google's tracking.
How does being listed on a search engine affect their users’ privacy? The users chose to use Google themselves, they could well use something like DuckDuckGo if they wanted to. They’re avoiding 3rd party tracking on their users, not boycotting Google.
I don't understand your complaint. 1) They freely admit that they use Google products, and 2) Google search is a way to be discovered by curious users. SpiderOak is looking to avoid giving excessive user data to Google, which is a totally separate issue.
The irony for me is that I am mostly invisible to Google Analytics, and thus the companies that rely on GA, because I mostly browse with JavaScript disabled. (When I need JavaScript, I usually crank up an incognito instance of Chrome and close it immediately when done, so I'm mostly anonymous to GA even then.)
When they go "old fashioned" and datamine their web server logs, they uncloak me. :-/
And to echo other posters: SpiderOak deserve thanks. If I find myself with any need for a service like theirs, I know I'll be looking at them.