Moving The New York Times Games Platform to Google App Engine

_pfwi · on Aug 23, 2017

Two things keep coming up while comparing GCP and AWS:

* This accomplishment would not have been possible for our three-person team of engineers with out Google Cloud (AWS is too low level, hard to work with and does not scale well).

* We’ve also managed to cut our infrastructure costs in half during this time period (Per minute billing, seamless autoscaling, performance, sustained usage discounts, ... )

vira28 · on Aug 24, 2017

This thread here came at the right time. Today whole day I attended the DynamoDB training. Honestly, one thing that I understood is its cost based on reads and writes per second. Irrespective of the amount of read data per operation (whether its 1 bytes or 100 bytes), its always charged for 1KB. So, as a work around what they suggested is using a Kinesis, a Lambda and an another service to make the write operation as a batch, in such a way the reads are near 1KB always. He pitched it like thats the perfect way to do. The problem I see is too many moving pieces for a simple thing to achieve. If the Dynamo team makes the reads cost based on the actual data we are all set.

_pfwi · on Aug 24, 2017

Yep, same problems with many other services:

* Kinesis Streams: Writes limited to 1K/sec and 1 MB / shard, reads limited to 2K/shard. Want a different read/write ratio? Nop, not possible. Proposed solution: use more shards. Does not scale automatically. There is another service called Kinesis Streams that does not offer read access to streaming data.

* EFS: Cold start problems. If you have small amount of data in EFS, reads and writes are throttled. Ran into into some serious issues due to write throttling.

* ECS: Two containers can not use same port on same node. Anti pattern to containers.

AWS services have lots of strings attached and minimums for usage and billing. Building such services (based on fixed quotas) is much easier than building services which are billed purely pay per use. This complexity + cost optimization pressures lead to complexity and require more human resources and time as well. AWS got good lead in Cloud space, but they need to improve their services without letting them rot.

vira28 · on Aug 24, 2017

Agree totally. The solution to overcome those shortcomings in AWS, is to sort of put bandaid's with more services (at least their suggestion). I do understand, its not feasible to provide service which fits for everyone, however it will be good if they solve the fundamental problem.

One more to add in the list.

In DynamoDB during peak (or rush hour) you can scale which increases the underlying replica's(or partitions) to keep the reads smooth. However, after the rush hour there is no way to drop those additional resources. May be someone can correct me, if I am wrong.

BoorishBears · on Aug 24, 2017

You can scale down, but it is limited.

You start each 24hr period with 4 chances to scale down, and after those are depleted you can scale down once every 4 hrs regardless

bonesss · on Aug 24, 2017

You can scale DynamoDB down the same way you scale it up.

Perhaps you're thinking of some kind of autoscaling that only works in one direction?

vira28 · on Aug 24, 2017

Thanks. Not the auto scaling part. I thought even manually if you scale up with new replica's, we can't scale down. I should read the manual and get a clear picture.

oblio · on Aug 24, 2017

> * ECS: Two containers can not use same port on same node. Anti pattern to containers.

Could you elaborate for this? I'm not sure I understand, are you saying that 2 containers cannot be mapped to the same host port? Because that would seem normal, you can't bind to a port where there's already something listening. But I guess I must be missing something.

steven_h · on Aug 24, 2017

The OP is talking about how when using a classic load balancer in AWS, your containers will be deployed all exposing the same port, kind of like running "docker run -p 5000:5000" on each ec2 in your cluster. Once the port is in use, you can't deploy another of that container on the same ec2 node.

The solution is to use AWS's Application Load Balancers instead, which will allow you dynamically allocate ports for your containers and route traffic into them as ECS Services.

alexeldeib · on Aug 24, 2017

I'm not familiar with the details of AWS here, but maybe the OP means mapping two different host ports to the same port on two different containers? That's all I can imagine that would be a container antipattern in the way described.

hobofan · on Aug 24, 2017

That is perfectly possible with ECS, so I don't know what OP was referring to. The thing I remember though is that you have to jump through a lot of hoops like making 4 APIs calls (or worse with pagination) for what should have been a single call to make such a system work on ECS.

eicnix · on Aug 24, 2017

Nowaday you would often run containers with a container network(flannel, calico, etc.) that assigns an unique IP per container thus avoids conflicting port mappings regardless how many containers with the same port run on a single host.

kuschku · on Aug 24, 2017

Or you have them on a physical private network, each bound to a separate IP, but yes.

teoruiz · on Aug 24, 2017

> * ECS: Two containers can not use same port on same node. Anti pattern to containers.

That's solved if you use Application Load Balancer instead of ELB classic for your ECS service.

tim333 · on Aug 24, 2017

>too many moving pieces for a simple thing to achieve

Maybe put all the users data in a blob with a JSON.stringify type thing and then JSON.parse it when you get it back?

fha · on Aug 23, 2017

> AWS . . . does not scale well

GCP may have some advantages over AWS, but the reverse is also true and it's hard to take what you say seriously when you say something like that.

_pfwi · on Aug 23, 2017

Adding more context. Sorry for missing it out in first place. I mostly work in Big Data Space. Google Clouds Big Data stuff is built for Streaming / Storing / Processing / Querying / Machine Learning at Internet Scale data (PubSub / Bigtable / Dataflow / BigQuery / Cloud ML). AWS scales to terabyte level loads. But, beyond that, its hard and super costly. Google's services autoscale to Petabyte levels / millions of users smoothly (for example BigQuery / Load Balancers). On AWS, it requires pre warming / allocating capacity beforehand and that costs tons of money. In companies working at that scale, that usual saying is "to keep scaling, keep throwing cash at AWS". This is not a problem with Google.

look_lookatme · on Aug 23, 2017

You need petabyte scale for the NYT Crossword?

_pfwi · on Aug 24, 2017

Quoting from the article: "This accomplishment would not have been possible for our three-person team of engineers to achieve without the tools and abstractions provided by Google and App Engine."

Talking about the use case from the article, they release the puzzle at 10 and need to have infra ready to serve up all the requests. On AWS, you need to pre warm load balancers, increase the quota of your Dynamo DB, scale up instances so that they can withstand the wall of traffic, ... and then scale down after the traffic. All this takes time, people and money. Adding few other things author mentioned: Monitoring/Alerting, Local Development, Combined Access and App Logging ... will take focus from developing great apps to building out the infrastructure for apps.

sigmaml · on Aug 24, 2017

I concur.

Currently, I am working on projects that use both Amazon and Google clouds.

In my experience, AWS requires more planning and administration to handle the full workflow: uploading data; organisation in S3; partitioning data sets; compute loads (EMR-bound vs. Redshift-bound vs. Spark (SQL) bound); establishing and monitoring quotas; cost attribution to different internal profit centres; etc.

GCP is - in a few small ways - less fussy to deal with.

Also, GCP console - itself not very great - is much easier to use and operate than AWS console.

Of course, YMMV!

jacquesm · on Aug 24, 2017

Could you please post the URL for the resource and the number of hits it receives? I'm interested in high load websites and I have a hard time picturing how this could lead to petabytes.

joatmon-snoo · on Aug 24, 2017

The impression I'm getting is not that GCP scales better, but it scales with less fuss - the anecdotes here all suggest that with AWS, once you hit any meaningful load (i.e. gigabytes), you need to start fiddling with stuff.

I don't know if this is actually true, I've never done any serious work in AWS.

zjaffee · on Aug 24, 2017

Hold on, please do not say google cloud scales well, yes they do have services that make a ton of claims, but unlike AWS, things don't work as promised which is magnified by the fact that their support is way worse.

Additionally, Big Query is far more expensive than Athena, where you have to pay a huge premium on storage.

The biggest difference is that what amazon provides you in infrastructure, where as google provides you a platform. While app engine is certainly easier to use than elastic bean stalk, you have very little control over what is done in the background once you let google do its thing.

manigandham · on Aug 24, 2017

GCP support can sometimes be bad but these other claims don't add up. What isn't working as promised? BigQuery can do a lot more than Athena and it's storage pricing is the same or cheaper than S3.

We've used 5 different providers for a global system and GCP has won by both performance and price. We still use Azure and AWS for some missing functionality but the core services are much more solid and easy to deal with on GCP, which is also far more than just app engine.

CharlesW · on Aug 23, 2017

Totally. Also…

> AWS is too low level

It seems very strange to paint AWS with such a broad brush, considering that AWS has tons of services at various levels of abstraction (including high-level abstractions like Elastic Beanstalk and AWS Lambda).

_pfwi · on Aug 23, 2017

Sorry to not add context. I was referring to the use case the author of article was talking about: running a website: You need to stitch: ELB / EC2 / Database / Caching / Service Splitting / Auth / Scaling / ... Where as on Google Cloud, App Engine covers most of the points.

tomcam · on Aug 24, 2017

Jeez, you're taking a lot of hits. Thanks for the patient answers

_pfwi · on Aug 24, 2017

Thanks for your kind comment!

ciguy · on Aug 24, 2017

As a DevOps consultant I've actually worked with clients migrating stacks to and from GCE/AWS (Yeah, both ways, not the same client).

What I've found in aggregate is that GCE is a bit easier to use at first as AWS has a LOT of features and terminology to learn. When it comes down to it though, many GCE services felt really immature, particularly their CloudSQL offering.

One client recently moved from GCE to AWS simply because their CloudSQL (Fully replicated with fail-over setup according to GCE recommendations) kept randomly dying for several minutes at a time. After a LOT of back and forth Google finally admitted that they had updated the replica and the master at the same time, so when it failed over the replica was also down.

There were other instances of unexplained downtime that were never adequately explained, but overall that experience was enough for me (And the client) to totally lose faith in the GCE teams competence. Even getting a serious investigation into intermittent downtime and an explanation took over a month. By that time our migration to AWS was in progress.

GCE never did explain why they would choose to apply updates to replica + master SQL at the same time and as far as I know they are still doing this. I asked if we could at least be notified of update events, was told that's not possible.

There were other issues as well that taken together just made GCE seem amateurish. I'm sure as they mature a bit things will get better, and it is cheaper which is why I wouldn't necessarily recommend against them for startups just getting going today. By the time you are really scaling it's like they'll have more of the kinks worked out.

funkjunky · on Aug 24, 2017

(GCP support here) This is a known bug, I've worked at least a few cases where this happened. There is a feature coming out soon that will allow different maintenance schedules to be set for masters/replicas, which will likely be automatically set for different times. And, once the kinks get worked out, hopefully we'll be able to re-deploy the feature that shifted traffic to failovers while the master is being updated, and eliminating maintenance downtime altogether.

ringaroundthetx · on Aug 24, 2017

Yeah Azure is like that too.

Stuff is just really immature and you'll get blocked by lots of things that make no sense.

If you have to get vendor locked in your quest to not have to manage servers while learning unintuitively misnamed parts of a computer, choose Amazon.

wutwutwutwut · on Aug 24, 2017

I use Azure all the time (app services, storage, cloud services,VMs, SQL, CDN etc) and almost never run into this issue. Can you share some example on what you mean?

mattmanser · on Aug 24, 2017

I use all that stuff and constantly run into it, are you using it in anger?

Here's a few I can remember off the top of my head:

- A (relatively) huge 6ms lag between the website and the DB

- One (random) site will mysteriously max out on memory on app-pool startup and take all the others down

- Their scheduler has no concept of timezones

- Their scheduler uses your local time when setting up the job, but UTC for other parts (this has been an open issue for over a year I think)

- Files will get mysteriously locked in deployments and the deployment process will silently fail

- Deployments will suddenly take an absolute age for no reason

- The entire admin UI will slow to an absolute crawl for hours on end

- Some admin tasks always claim they've failed, even though they've succeeded

- Their API wrapper is just wrong on almost every level

Add on top of that the worse management UI I've ever had to deal with and it makes Azure very painful to use at times. Some-one thought nesting menus in a standardised format was a good idea. It wasn't. Everything is fairly terribly named too. Want to see how your deployments doing? That's under "Deployment Options".

Performance is also dogshit compared to the cost, my 4 year old laptop is faster than their "premium" offerings.

wutwutwutwut · on Aug 24, 2017

No I'm using it happily. We don't use the scheduler and create new deployment slots when deploying (which maybe prevents locking issues). Sometimes I experience oddness in Azure portal and has to refresh but has never had it slow down. As for SQL latency it's been insignificant to us so I'm not sure if what we experience is better or worse than yours.

Portal i agree is partly confusing/messy but we set up things once and then do deploys via CI infrastructure. And even the initial setup we try to automate using PS instead (to make it reproducible).

When it comes to pricing I agree, but my laptop does not do multi-datacenter so well.

Im not questioning anything you say of course. Maybe I've gone blind or don't see the issues as critical as you, or maybe I'm just more lucky.

tscanausa · on Aug 24, 2017

Hi, I am not sure you will read this as it's 10 hours after posting. The cloud sql updating both copies sounds like a bug. If you want to email me your case number I can look into it. I know you don't work with GCP anymore but I like to resolve the issue for other users.

Email: tsg@google.com

Disclosure: I work on gcp support. Not paid to be here.

vira28 · on Aug 24, 2017

As a Database Engineer working for one of the largest e-company in the market, I can clearly see your point. Definitely AWS RDS is very matured when compared to CloudSQL. I think CloudSQL only provides MySQL and Postgres (still in beta?). So GCE needs to build their Database Arsenal soon.

Next, your client faced issues with replication in GCE, thats not good to hear, but we do face issues in our AWS RDS MySQL and Aurora very frequently. RDS MySQL error logs not generated properly. Aurora has weird memory leaks, connection spikes, starting to behave sporadically when the memory crosses 80% and so on. We are working with AWS to figure out the issue still (credits to the AWS Support for trying to help us). So, to conclude whether you are in AWS or GCE this is the trade-off of "cloud". We need to live that, if you are moving to cloud !!

_wmd · on Aug 24, 2017

Cloud Postgres is also hilariously hard limited to 100 simultaneous connections (the default). Doesn't matter how much RAM you give it.

My experience with GCP in the past 4 months has led me to revise my "friends don't let friends use App Engine" motto to "friends don't let friends use Google Cloud", there isn't a single service I touched (except maybe Compute Engine) that didn't have half-baked client libraries, documentation, bugs in the server part, or a complete failure by Google to even have their engs use the competitions tooling before inventing their own shitty clone (DNS)

robotmay · on Aug 24, 2017

I was super keen to switch to GCP (for cost saving etc) but this mirrors a lot of my experiences. Deploys to App Engine took 20 minutes, not 2 minutes, and I have absolutely no faith in their firewall settings actually working. I have no idea what the problem is, but it's basically impossible to boot a Rancher master node on Compute Engine. Even with all ports open. In the end I just bailed on the platform as a whole, and I'm moving to a hybrid approach on smaller providers like Packet.net and Digital Ocean.

And I would have been totally fucked over by that Postgres connection limit when we went into production, I'm glad I dodged that bullet! I hadn't bumped into that when playing with dev environments, and I haven't seen that limit mentioned anywhere.

funkjunky · on Aug 24, 2017

App Engine Flex takes a long time to deploy, and always has. App Engine standard is what deploys quickly, and also scales quicker.

Firewall settings work just fine for our platinum clients with complex network architectures, I don't see why it wouldn't in your case unless something was misconfigured.

robotmay · on Aug 24, 2017

Isn't Flex the newer of the platforms? Is there a reason why it's so slow to deploy? I deployed an app via it that deploys in a couple of minutes anywhere else, including build time, but it took an insane amount of time on GAE, and I never managed to find a good reason why.

Normally I'd think I had configured something wrong, except in this case it was insanely simple. A network label that allows all ports both ingress and egress, to any destination/source, definitely applied to the servers, and yet they had constant connection issues with each other.

It probably was something I did, but the combination of those issues, surprise egress bills, very laggy UI, and various other little niggles just made it not worth my time for now. I'm keen to avoid vendor lock-in anyway, so GCP and AWS don't have that many extra features over smaller providers for me.

karmakaze · on Aug 24, 2017

Why CloudSQL instead of Cloud Spanner? If for existing SQL workloads I can understand but for new services I'm admiring Spanner over DynamoDB

themacguffinman · on Aug 24, 2017

Cloud Spanner is unnecessarily expensive if you don't have the kind of performance requirements it was designed for.

vira28 · on Aug 24, 2017

Looks like Spanner is a relational database. DDB is just a key-value pair. So, is it fair to compare against them or am i missing something?

ubernetes · on Aug 24, 2017

Why wouldn't it be "fair"?

dullgiulio · on Aug 24, 2017

Because a key-value store is a foundamentally simpler data structure (it's an hash) than a relational database, which tracks the relations between different data types. If you make an advanced use of the key-value store, you have a lot of logic in the application (for example to key management, cascade operations between related data...) which a relational database should do for you. It's not fair because there is a development cost in using the key-value you are ignoring.

ubernetes · on Aug 30, 2017

Dynamo is not a k-v store.

vira28 · on Aug 24, 2017

Apart from the points @dullgiulio mentioned.

DDB -> NoSQL, No Automatic backups, No support for ad-hoc querying, eventual consistency (though you can set to get consistency with few tradeoffs) Spanner DB -> RDBMS, Automatic backups, Enriched SQL, Strong consistency.

Let me if you still think its fair to compare these 2 databases.

neya · on Aug 24, 2017

Hey community, let me share my experience with AppEngine. I work in a small firm, where we've developed a massive Software Application comprising of 12 medium-sized apps. I went with Phoenix 1.3 w/ the new umbrella architecture.

With AppEngine, the beauty is that you can have many custom named microservices under one AppEngine project and each microservices can have many versions. You can even decide how much percentage of traffic should be split between each of these microservices.

What's awesome is, in addition to the standard runtimes (Ruby, Python, Go, Java, etc.) Google also provides something called custom VMs for AppEngine, meaning you can push docker based setups into your AppEngine service, with basically any stack you want. This alone is a HUGE incentive to move to AppEngine because usually custom stack will require you to maintain the server side of things, but with Docker + AppEngine, zero devops. Their network panel is also very intuitive to add/delete rules to keep your app secured.

I've been using AppEngine for over 4 years now and every time I tried a competitive offering (such as AWS Beanstalk, for example) I've only been disappointed.

AppEngine is great for startups. For example, a lesser known feature within AppEngine is their real-time image processing service API. This allows you to scale/crop/resize images in real time and the service is offered free of charge (except for storage).

Works really well for web applications with basic image manipulation requirements.

https://cloud.google.com/appengine/docs/standard/python/imag...

The best part is, you call your image with specific parameters that'll do transformations on the fly. For example, <image url>/image.jpg?s=120 will return a 120px image. Appending -c will give you a cropped version, etc.

I really hope to see AppEngine get more love from startups as it's a brilliant platform, much more performant than it's competitors' offerings. For example, I was previously a huge proponent of Heroku and upon comparing numbers, I realized AppEngine is way more performant (in my use case). I'm so glad we made the switch.

If you're looking/considering to move to AppEngine, let me know here and I'll try my best to answer your questions.

neya · on Aug 24, 2017

Full disclosure: I DON'T work with Google, I DON'T sell their services/products. I run a startup myself, with an Saas on top of AppEngine. This is just my documented (positive) experience with the stack above. I get paid nothing by Google, nothing in $ nor credits this is just my own personal documented experience.

Being a long time HN member, I would responsibly disclose if I were somehow affiliated with Google (trust me, I wish I was).

smn1234 · on Aug 24, 2017

The way you've delivered this "experience" makes you sound like you either work for Google or were asked to make a sponsored statement - for credits or $.

dang · on Aug 27, 2017

Accusations of astroturfing or shilling are against the HN guidelines, so please don't comment like this here.

https://news.ycombinator.com/newsguidelines.html

The reason is that the accusation is false orders of magnitude more often than it is true—because people falsely assume someone else can't possibly be holding an opposing view in good faith—and false accusations damage the community.

sunsetMurk · on Aug 24, 2017

or a passionate user who had a great experience?! I love finding solutions which require less 'square peg round hole'. Unfortunately, rare these days when piecing together a stack w/ the myriad of platforms/frameworks/etc.

BoorishBears · on Aug 24, 2017

I'm usually not skeptical of comments but this comment definitely feels "artificial".

I think Google has better things to do than to pay people to comment on HN, but I do think either this person is trying too hard to sell us on Google Cloud because they like it (which isn't a bad thing per say)

Edit: I thought about it and they probably aren't related to it, probably just really enthusiastic about it (good thing) but they want to sell us on it (eh, not sure how I feel about)

sunsetMurk · on Aug 25, 2017

Yep - I guess I'm a bit empathic to the comment. I'm always trying to sell what I'm using to others, to get more into that camp, to generate more discussion and innovation. But, it's all just like ice cream [1] anyways.

1- https://twitter.com/adamlaz/status/900621343347146752

neya · on Aug 24, 2017

My bad, I've added the disclosure.

orf · on Aug 24, 2017

The way you've delivered this "comment" makes you sound like you either work for AWS or were asked to make a sponsored reply - for credits or $.

londons_explore · on Aug 24, 2017

I would hope someone on HN would disclose a conflict of interest when presenting their opinion.

Disclosure: Work for one of the cloud providers, but not on cloud itself.

kuschku · on Aug 24, 2017

Sadly, this is rarely true. Many of the people defending AMP for example turn out to be Google employees that didn't disclose it (found via their comment history), and people that complain about AMP are not just privacy activists, but often also involved in publishers or advertisers that lose money from it.

I doubt it would be much different in this thread.

nrjames · on Aug 23, 2017

I migrated a big data stack to GCP from AWS. Reasons: GCP has better documentation, the AWS console and various services confuse the heck out of me (I guess I'm getting too old), and the security integration between GCP services saves a huge amount of time. It's super easy and very fast to used the Google Compute Engine VMs. Given that the company I work for uses G Suite, it's a piece of cake to implement SSO and other integration pieces. It's also cheaper for us than AWS and more performant.

jsudhams · on Aug 24, 2017

Wait until GCP becomes older with bigquery and other versions. Also GCP does not have regions ready and it takes forever to get a new region. Waiting for Mumbai region for more than 3 to 4 months now.

But one thing i like about GCP is that it allows to the limit setting in terms of cost and ensure you wont cross it. In case of AWS it can give alerts but for some reason say all you team in in one location and there is emergency like flood etc and you dont check email then you are done. I stopped using AWS after i learned that there is simply no way to set limit. Waiting for GCP to open their Mumbai region. sigh.

Also AWS is very deceiving with free tier , there is simply no way to understand which products get free and worst case is after free tier you will get charged.

yeukhon · on Aug 24, 2017

AWS doesn't add a new region every 3-4 months either. Adding a new region is very complicated. Normally vendor does not actually build the DC themselves. They would source from other data centers in the region whenever possible. Building a new DC is not something can be taken lightly. Then finally local laws.

vs2 · on Aug 23, 2017

"Due to the inelastic architecture of our AWS system, we needed to have the systems scaled up to handle our peak traffic at 10PM when the daily puzzle is published."

WT... I had to reread this to make sure I didnt misunderstand... why not work on making the current arhictecture elastic?! #cloudPorn

tyingq · on Aug 23, 2017

The "inelastic" might have been a shot at AWS. When pressed, the AWS people do use phrases like "pre-warming", "over provisioning" and "advance notice" around their ELB/ALB setup and ECS.

Google's cloud salespeople pitch that they don't require any of that.

londons_explore · on Aug 24, 2017

I think it depends on the nature of your spike.

AppEngine instances can typically start in 30 seconds or so. So if your spike is because your video went viral on facebook and lots of people are looking it it, that's fine.

If your spike is because you have 10 million clients with an app set to do an HTTP request at exactly 10:00:00pm, and they all arrive within a quarter second, thats a problem.

spyspy · on Aug 24, 2017

30 seconds is actually a gross overestimate for AE startup. Your description of our problem is very accurate, though. We still need to over-provision just before the spike, which we do with a cron that scales up to several hundred instances 5 minutes before 10pm and then back down to normal levels 5 minutes after. Autoscaling takes care of the rest.

remus · on Aug 24, 2017

It's funny they'd mention that, and then later in the same article say they had to use a Cron job to scale up their gcp solution for a daily spike.

spyspy · on Aug 24, 2017

Using a cron to scale an AE service to 500 instances 5 minutes before 10PM > bugging the infra team to bring up our 19th MySQL replica.

g09980 · on Aug 23, 2017

Curious if the need for pre-warming ELB/ALB still applies. Last time this came up, an AWS employee mentioned it is no longer necessary (https://news.ycombinator.com/item?id=14052079), but would be nice if this was documented.

callalex · on Aug 23, 2017

I don't want to dox myself but about a year ago when my employer forgot to notify AWS about switching our production traffic (about 5K rps at that time) from one ELB to another we failed requests for several minutes before we decided to just switch back to the old ELB then ask them to do a prewarming before we switched again.

tyingq · on Aug 23, 2017

Possible. It is still in the documentation: https://aws.amazon.com/articles/1636185810492479#pre-warming

The "advance notice" and "over provision" advice is still being given for things that could scale up fairly large. (where fairly large isn't anything that exciting, really)

PaulRobinson · on Aug 24, 2017

The poster was talking about ALBs. I'm not experienced enough with them to know whether the claim is true.

I do know on ELBs though, pre-warming is essential for high throughput.

gtaylor · on Aug 23, 2017

It definitely still applies.

bpicolo · on Aug 23, 2017

Imo, one really killer bit is the first piece they mentioned:

"Google provides an SDK that enables users to run a suite of services along with an admin interface, database and caching layer with a single command."

I really wish AWS had a decent local dev story, rather than relying on 10 separate half-baked OSS solutions

rstupek · on Aug 23, 2017

Have you tried the documentation for google's sdk? I stopped in disgust trying to track down what I needed and then trying to manage through multiple different admin interfaces.

bpicolo · on Aug 23, 2017

I've been using google bits recently. I agree that the docs for e.g. all the python sdks need a lot of work.

That said, boto3 has thorough docs, but I wouldn't consider them particularly well organized. I can only really navigate AWS sdk docs because I already know what I want to do, and can google the specific terminology

CobrastanJorji · on Aug 23, 2017

Are you talking about this Python SDK doc? https://googlecloudplatform.github.io/google-cloud-python/la...

bpicolo · on Aug 23, 2017

Yeah. In particular, I was really looking for something for "here's how to do most of the basics for Cloud Storage".

If you look at https://googlecloudplatform.github.io/google-cloud-python/la..., for example, there's not even a top level navigation index that I can read through to guess what function I might need by name.

funkjunky · on Aug 24, 2017

(Google cloud support here) The pages you linked to are supposed to serve as client libraries reference only. If you want higher level instructions and examples, always start with the main Google Cloud docs first. On any page that offers instructions, the top of the code windows offers a selection of client library languages/CLI tools/REST API available to do whatever that task is. For cloud storage, start here:

https://cloud.google.com/storage/docs/how-to

Choose a topic, and select "Python" at the top. It should provide instructions and examples using the Python libraries.

Also, we have a repo of demo projects and examples for nearly every GCP product/service, and then some. Great examples to be found here (some might be out of date though):

https://github.com/GoogleCloudPlatform/python-docs-samples

bpicolo · on Aug 24, 2017

Heya,

Thanks for the link! You're right that this is what I was looking for. Unfortunately, that hadn't shown up in a convenient place while I was googling around. Would be good to add direct links to those from the client lib references, because those pop up for e.g. "google storage python" first. (Unless they're already there and I didn't see them).

Links from the READMEs of the repo might be useful too to help people get there? https://github.com/GoogleCloudPlatform/google-cloud-python

funkjunky · on Aug 24, 2017

That's not a bad idea, I'll bring it up with our client library maintainers today.

bpicolo · on Aug 24, 2017

Of course not :)

funkjunky · on Aug 25, 2017

It's up there right at the top of the README:

https://github.com/GoogleCloudPlatform/google-cloud-python

tziki · on Aug 23, 2017

The main story here is probably this line: "The system is generally at that peak traffic for only a few minutes a day".

AWS charges by hour. Google by minute. If your peak traffic only last a few minutes, hourly billing is inflexible, simple as that.

theDoug · on Aug 23, 2017

Hourly billing is also equally inflexible if you need 61 minutes, or 74, or 110- paying 120 for any of those isn't too rad.

(Disclosure: I work on GCP)

random023987 · on Aug 23, 2017

AWS lambda charges by the 1/10th of a GB/CPU/second.

I'd bet they could have seen significant cost savings on AWS by migrating to lambda, and gotten continuous scaling to boot.

tziki · on Aug 23, 2017

Serverless is probably far too immature for companies like NYT.

bastawhiz · on Aug 24, 2017

I don't disagree with you, but it's also true that you don't have to go all-in. I run services that run server-ful-ly on EC2, while requests to the "front-end" are handled by Lambda. If you've got one or two very hot endpoints and the majority of your server load spike is thanks to the processing for those endpoints, moving just those endpoints to Lambda could give you reassurance that you don't have all of your eggs in one basket (not having to go full-serverless) while also getting the benefit of being able to scale up and down essentially effortlessly.

Of course, there are lots of other consideration that you should be making when going even partially serverless, and it could be that the NYT chose not to experiment with Lambda for other reasons. For instance, you're effectively tied to CloudWatch for logging and monitoring, which could be a deal-breaker. Much of the processing could be happening in the DB, which would make Lambda moot. It may simply be that their estimated usage of Lambda was too costly.

sdenton4 · on Aug 24, 2017

Three devs seems like a bit of a bottleneck... Why maintain two separate architectures for a single problem if there's another option?

mattbillenstein · on Aug 24, 2017

I'm not sure why people are so enamored with ELB -- just terminate SSL at your web boxes using nginx and publish the public ips of all of these machines in your DNS records.

You remove a bunch of ELB per-request costs doing it this way and you can scale it however you see fit.

bastawhiz · on Aug 24, 2017

That works great until a machine starts failing health checks and you need to take it out of rotation ASAP. It's also the case that DNS gets cached and not all users create the same load: one user making many requests will burden one server instead of having the load evenly distributed.

mattbillenstein · on Aug 24, 2017

Clients will try another ip if they can't connect -- partial failures may be a problem, but in my experience as long as nginx is alive, you can load balance to a different web backend if the app processes on that machine are wedged.

I've deployed this solution in 50k req/s environments and not seen a single user be a problem like you mention -- any motivated bad actor could cause problems in either scenario I expect.

bastawhiz · on Aug 24, 2017

Clients might not fail to connect. That's even worse. They connect and the server hangs and returns no response (perhaps due to bad configuration changes). Now you're stuck, and your server is too oversaturated to SSH in and fix it.

It out depends on your application and users. Building a website? Probably not much of an issue. Building a low-latency API? YMMV. Keeping your load evenly balanced across your front end cluster also can keep your cost low, since you are able to distribute load more evenly.

That's another point: if you scale your cluster size up and down frequently to accommodate load, doing that with DNS is a nightmare.

awj · on Aug 24, 2017

> It's also the case that DNS gets cached and not all users create the same load

It's also, also the case that DNS gets cached and propagating the removal of a broken server could take ages.

pgrote · on Aug 23, 2017

Interesting they are using Medium instead of in house publishing tools. It is the first time I've noticed then open.nytimes.com articles.

devrandomguy · on Aug 24, 2017

Didn't you hear? Medium pays per clap now, so the NYT authors are all jumping ship /s

Jedi72 · on Aug 23, 2017

Getting pro GCP articles to the top of HN must no-doubt be a high priority for the Google marketing team. This is the nature of modern advertising, sneakily trying to subvert your thinking by masquerading as something else.

theDoug · on Aug 23, 2017

We actually spent the entire day at the Giants game today. ¯\(ツ)/¯

There's no incentive for high ranking HN posts, or any HN posts, actually. If there were, you wouldn't see others continually submit our news here before we do. This was a nice and unprompted post for everyone in GCP to read, as well.

(Disclosure: I work on GCP as a product marketer.)

azinman2 · on Aug 24, 2017

Hey I was at the giants game today with my company as well! ¯\(ツ)/¯

</noise>

azinman2 · on Aug 24, 2017

I’m now realizing this might have come off as sarcasm. It wasn’t! I was buzzed from the game when I wrote this lol!

moonbug22 · on Aug 24, 2017

since you are disclosing your affiliation: do GCP actually have any account managers or similar business devs in EMEA?

ben_jones · on Aug 24, 2017

A GCP product marketer responding to negative commentary on HN within ~20 minutes of it being posted strikes me as automated.

EDIT: It is a double standard though, HN readers want access and responses from people on the GCP team but at the same time tinfoil hat subliminal marketing etc.

jacquesm · on Aug 24, 2017

> A GCP product marketer responding to negative commentary on HN within ~20 minutes of it being posted strikes me as automated.

Nonsense, there are many Googlers on HN and all it would take is a colleague poking another with a relevant thread.

sb8244 · on Aug 24, 2017

I would be reading this every 10 minutes for new comments if it was my product up for discussion.

bastawhiz · on Aug 24, 2017

Ten minutes even seems like a lot. If I saw my website on HN, I'd have alerts set up for new comments.

boulos · on Aug 24, 2017

The first statement is partially true (it's certainly nice) but that's true of any marketing team. To be clear, I don't think the NYT people were pushed to do this. Engineering blogs commonly say what they did to take pride in their work. I'm not in marketing, so I don't know if we we're involved. Maybe they sent it over for review, but I kind of doubt it. (To the comment below, we don't pay people for content, and it's demeaning to the engineers at NYT to suggest that).

Disclosure: I work on Google Cloud (as an engineer, not in marketing, despite how I like HN).

vgt · on Aug 24, 2017

I sometimes ask Google Cloud clients who post this type of stuff, what's your motivation? The one consistent theme is that telling the world that you're doing cool stuff with cool tech raises your organization's profile and helps in recruitment.

From my experience working with NY Times they are certainly a top-notch engineering organization. They should be free to advertise that.

I don't know if these engineers are being forced to release these types of blogs, but the far more likely (and respectful to said engineers) scenario is that they just want to talk about their work. This isn't the first time NY Times has done this [0][1].

(Work at Google Cloud on same team as boulos and worked with NY Times on some of their migration pieces like BigQuery)

[0] https://thenewstack.io/caching-hadoop-new-york-times-embrace...

[1] https://open.nytimes.com/faster-simpler-workflow-analytical-...

joatmon-snoo · on Aug 24, 2017

I imagine the resulting discussion may also open the team's eyes to alternatives that others have succeeded and failed with, which in turn helps the team in further refining their process.

_jgvg · on Aug 24, 2017

> Engineering blogs commonly say what they did to take pride in their work

Yep, engineering blogs are also marketing, aimed right at HN readers. Attracting good engineering talent isn't easy, so companies have to do marketing on that front as well.

But, as usual, people being marketed to (in this case, HN readers) don't realize they are being marketed to.

Osmose · on Aug 24, 2017

Heaven forbid a software engineer post about a major technical change their team just accomplished. It couldn't be because they want to raise their team's profile for hiring, or discuss the pros and cons of their experience, or raise their own profile as an engineer.

matt4077 · on Aug 24, 2017

That's the nature of conspiracy theories: sneakily insinuating a well-organised plot to harm, while masquerading as someone who thinks Occam's razor is a hairdresser.

PS: I don't work Google

<#org.google.subversion#ref-hn5!impact-8!factor-5!aID636TZ>

sidlls · on Aug 24, 2017

Really why would they care? HN is a tiny, tiny slice of the programming world.

notyourday · on Aug 24, 2017

It is over-saturated with people who either make spend decisions or seriously affect spend decisions

ocdtrekkie · on Aug 23, 2017

Google likely offers significant discounts to companies which write these sorts of pieces for them. And obviously a lot of Googlers participate actively here, so getting upvotes naturally is likely not a challenge.

NightlyDev · on Aug 24, 2017

This doesn't really make much sense to me. How many peak users are there? What's the number of requests per second?

I can't imagine that the load would be so high that it wouldn't be possible to do it without GCP with three developers.

It would be way more interesting with performance details. :)

bsaul · on Aug 23, 2017

Has anybody had successful experience deploying docker containers on appengine ? Last time i tried, i had such a bad experience in terms of deployment speed ( time to build the image, then upload it, then waiting for the stuff to deploy) that i reverted to managing my own gce instance.

But maybe i had bad luck..

tarr11 · on Aug 23, 2017

I just tried to move a rails app to appengine. It uses the flexible environment (eg, docker). It took like 10 or 15 minutes to deploy, each time.

Heroku does my deploy in about 5 mins.

https://groups.google.com/forum/#!topic/google-appengine/hZM...

GCloud is cheaper though.

Also, VMs spin up in GCloud amazingly fast. Like 5 seconds. Feels like somebody at Gcloud just needs to go and fix this. No reason this is so bad...

brianwawok · on Aug 24, 2017

Deploy is 15 seconds on GKE. Wonder why app engine is so bad.

lftl · on Aug 24, 2017

With GKE you don't necessarily update your load balancer rules with a deploy correct? The linked thread points the blame for app engine deploys on waiting for GCLB to update.

brianwawok · on Aug 24, 2017

Correct, you have software based load balancing under the covers.

My Google Load Balancers never move.. It is a single thing that points each node (physical machine) in the cluster, and distributes traffic between then.

Each node knows how to route traffic to each app. So when I deploy that app, the software load balancer at the node level will slowly move traffic over from old app to new app. Entire thing is MAGICAL. And 0 downtime, very very fast deploys.

Edit - But yes this explains iy. Changing the google load balancers is like a 5 minute ordeal. Total pain. Nice that with GKE you only need to touch them when your node count changes, which can be very rare (~monthly for me)

tlarkworthy · on Aug 23, 2017

Yes uploads are fast and you can spin up a VM directly from an image now

ledgerdev · on Aug 24, 2017

How many minutes/seconds do you mean by fast? Can you point to guidance on how to make that happen because I am seeing 8+ minute re-deploy times.

ledgerdev · on Aug 24, 2017

Funny you mention that, I was just experiencing this yesterday in the flex environment. Re-deploy time for a 3 character text change on one page is 8+ minutes for each re-deploy. This is insane.

Does anyone know how to speed that up?

funkjunky · on Aug 24, 2017

(Google cloud support) Unfortunately, this is a pain point that has no easy solution at the moment. It doesn't matter how trivial a change is, it isn't the docker deployment that is taking all that time. As mentioned above, the bottleneck is updating the GC load balancer with the new routing rules, which takes time to propagate throughout the system. This is a high priority issue internally, but updating the load balancers is no trivial task, and will take a lot of time and testing.

In the meantime, I recommend the following mitigation strategies:

1. Try to get into the habit of carefully reviewing and testing new versions locally before deployment. Client libraries should still work if you have a valid default application credential set up. I say this because I have a hard time remembering to do this as well.

2. Static content and templates for your site should be hosted on GCS, not deployed with your app in a "static/" folder or something. Easy to fix typos, HTML/CSS, and JS errors by simply using gsutil to copy the fixed file over, takes only a second.

3. Always keep a stable version of your app available in case you broke something in a new deployment. It's quicker to route traffic to an older version than it is to track down a bug and wait for the fix to finish being deployed.

Not ideal, but again, Googlers have to suffer this too, and are very motivated to find a way to fix this.

ledgerdev · on Aug 24, 2017

Great answer, thanks!

foxylad · on Aug 24, 2017

OT. How nice not see a single "avoid all Google services because reader" comment. Maybe we are finally moving on.

CaveTech · on Aug 24, 2017

Yet here you are.

merb · on Aug 24, 2017

Uh thanks to this article I've seen that AppEngine know supports Java8, this is really really cool.

revelation · on Aug 23, 2017

I was sure this was about some multiplayer game thing, but no, it's a crossword. Not entirely sure what they are even scaling here, I was expecting an article about a CDN..

jprob · on Aug 23, 2017

The game allows you to sync your game progress across multiple devices and it's subscription based so a CDN wouldn't do much help there.

Also, realtime multiplayer crosswords are coming! I'll be speaking at GothamGo this year about that exact topic.

jonknee · on Aug 24, 2017

It’s a paid service that hundreds of thousands of people use. Not world changing stuff, but considering competitors are still using Java applets and Flash, I for one applaud their efforts.

kennethh · on Aug 24, 2017

Anyone know how much it cost to add a custom domain and SSL to AppEngine(Standard og Flexible)? I have been looking and not able to find out how much it cost.

iamgopal · on Aug 24, 2017

SSL is free. Custom domain also free.

kennethh · on Aug 24, 2017

That explains it, would be nice if they could write it down somewhere as a feature:)

rhodysurf · on Aug 24, 2017

Its free for custom but only to use existing certs. You still have to set up the cert and such with your domain as you normally would. You also get the *.appengine.com domain with automatic SSL>

iamgopal · on Aug 25, 2017

SSL cert provided by Google is free and is in private alpha stage. It's coming soon with GA.

zitterbewegung · on Aug 24, 2017

Nice advertisement that Google bought from the NyTimes.

mbesto · on Aug 23, 2017

This reads eerily like a press release for GCP...

s17n · on Aug 24, 2017

But check out OPs posting history.

lallysingh · on Aug 24, 2017

4 posts about NYT crossword. So they're talking about their work, and their work uses GCP.

spyspy · on Aug 24, 2017

Quite

colordrops · on Aug 23, 2017

A third of the posts to HN are disguised press releases.

paulcole · on Aug 24, 2017

well duh

dang · on Aug 24, 2017

Would you please stop posting unsubstantive comments to HN?