Hacker News new | past | comments | ask | show | jobs | submit login
MangaDex infrastructure overview (mangadex.dev)
567 points by kokada on Sept 7, 2021 | hide | past | favorite | 229 comments



I run an Alexa top-2000 website. (Mangadex is presently at about 6000.) I spend less than $250 a month.

I have loads and loads of thoughts about what they could be doing differently to reduce their costs but I'll just say that the number one thing Mangadex could be doing right now from a cursory glance is to reduce the number of requests. A fresh load of the home page generates over 100 requests. (Mostly images, then javascript fragments.) Mangadex apparently gets over 100 million requests per day. My site - despite ostensibly having more traffic - gets fewer than half that many in a month. (And yes, it's image-heavy.)

A couple easy wins would be to reduce the number of images loaded on the front page. (Does the "Seasonal" slider really need 30 images, or would 5 and a link to the "seasonal" page be enough? Same thing with "Recently Added" and the numbers of images on pages in general.) The biggest win would probably be reducing the number of javascript requests. Somehow people seem to think there's some merit to loading javascript which subsequently loads additional javascript. This adds a tremendous amount of latency to your page load and generates needless network traffic. Each request has a tremendous amount of overhead - particularly for dynamically-generated javascript. It's much better to load all of the javascript you need in a single request or a small handful of requests. Unfortunately, this is probably a huge lift for a site already designed in this way, but the improved loading time would be a big UX win.

Anyway - best of luck to MangaDex! They've clearly put a lot of thought into this.


Hi, we're trying to lower the requests:pageview ratio in general, but for what it's worth this article essentially:

- ignores the vast majority of "image serving" (most is handled by DDG and our custom CDN)

- the JS fragments thankfully should load only on first visit and then get aggressively cached by DDG/your browser

One of the pain points is that there are a lot of settings for users to decide what they should or shouldn't see (content rating, original language of origin, search tags, etc) and some are already specifically denormarlized (when querying chapter entities, ES indices for those contain some manga-level properties to avoid needing to dereference that first too) -- however this also makes caching substantially less efficient in many places, alas

Thanks!


Hi, I'm a performance tuning expert, and this thread piqued my interest.

The first thing that I noticed is that even with caching enabled, you're loading "too much data". After loading the main page and then clicking one of the tiles, there are several JSON API calls.

Here's an example, 195 kB transferred (528 kB size): https://api.mangadex.org/manga/bbaa17c4-0f36-4bbb-9861-34fc8...

Oof. Half a megabyte of JSON! Ignore the network traffic for a moment, because GZIP does wonders. The real problem is that generating that much JSON is very "heavy" on servers. Lots and lots of small object allocations, which gives the garbage collector a ton of work to do. It's also expensive to decode on the browser for similar reasons.

On my computer, this took a whopping 455ms to transfer, nearly half a second. That results in a noticeable latency hit to the site.

In my consulting gig I always give developers the same advice: "Displaying 1 kilobyte of data should take roughly 1 kilobyte of traffic".

In other words, there's isn't 500 KB of text anywhere on that page! A quick cut & paste shows about 8 KB of user-visible text in the final HTML rendering. That's a 1:60 ratio of content-to-data, which is very poor. I bet that behind the scenes, this took a heck of a lot more back-end network traffic and in-memory processing to generate. Probably tens to hundreds of megabytes of internal traffic, all up.

This is one of the core reasons most sites have difficulty scaling, because for every kilobyte of content output to the screen, they're powering through megabytes or even gigabytes of data behind the scenes.

Can this API query be cut down to match what's displayed on the screen? Can it be cached for all users? Can it be cached precompressed?

Etc...


> The real problem is that generating that much JSON is very "heavy" on servers. Lots and lots of small object allocations, which gives the garbage collector a ton of work to do. It's also expensive to decode on the browser for similar reasons.

For what it's worth, this isn't generated live but a mix of existing entity documents

Most of it is page filenames which indeed could be made optional and fetched only by the reader, but that'd be us actively nulling them out in the returned entity, since they are there in the ES documents for the chapters (a manga feed like this being a list of chapters)


You're basically dumping down a database to the web browser, including all of the internal metadata that's likely irrelevant to rendering the HTML.

For example, user role memberships:

   {
        "id": "c80b68c5-09ae-4a50-a447-df7c5a4a6d01",
        "type": "user",
        "attributes": {
            "username": "kinshiki",
            "roles": [
                "ROLE_MEMBER",
                "ROLE_GROUP_MEMBER",
                "ROLE_POWER_UPLOADER"
            ],
            "version": 1
        }
    }

Also record timestamp dates like created/changed, along with contact details that may be revealing sensitive info:

    "attributes": {
        "name": "SENPAI TEAM",
        "locked": true,
        "website": "https:\/\/discord.gg\/84e3j9b",
        "ircServer": null,
        "ircChannel": null,
        "discord": "84e3j9b",
        "contactEmail": "senpai.info@gmail.com",
        "description": null,
        "official": false,
        "verified": false,
        "createdAt": "2021-04-19T21:45:59+00:00",
        "updatedAt": "2021-04-19T21:45:59+00:00",
        "version": 1
    }
But let's just go back to your response:

> Most of it is page filenames which indeed could be made optional

Do that! If you strip them out, the 529 kB document shrinks to 280 kB, which hardly seems worth the hassle, but when gzipped, this is a miniscule 13 kB! This is because those strings are hashes, which significantly reduces their compressibility compared to general JSON, which usually compresses very well.

It's basic stuff like this that can make a website absolutely fly.

Avoid giving computers unnecessary, mandatory work: https://blog.jooq.org/many-sql-performance-problems-stem-fro...


As I said, it's not so much that we ask that data to be fetched -- it is there in the first place, and pulled from Elasticsearch, not a SQL database

Because of this model, we also make sure that Elasticsearch merely works a search cache, not as an authoritative content database (hence everything we add in there is considered public, on purpose, and what isn't meant to be public is just not indexed in ES)

However the gzip efficiency improvements would be really neat for sure

Fwiw I also don't work on the backend and there might be good reasons to not expressly filter out data (yet anyway, perhaps it will end up as a separate entity and be a include parameter)


I have to say I'm glad this is being talked about in a public forum. Outsiders rarely get to see brainstorming, troubleshooting & group discussion of technological issues like this.

Someone who is focused on the performance aspect & someone who is focused on stack stability discussing the real world input & output of a business system and showing why performance & UX are not the only metrics that matter is a good thing for us to see.


You can query Elastic for specific fields only: https://www.elastic.co/guide/en/elasticsearch/reference/curr...

Edit: As you said, there may be reasons on the backend not to filter things out of the query. Though it seems likely that the web response could be trimmed down.


This seems less like a performance problem and more of a security issue. Especially considering that this is a website that hosts unlicensed translations. How much of this information is actually intended to be made public?


> Displaying 1 kilobyte of data should take roughly 1 kilobyte of traffic

Is this to be taken literally? I don't consider myself a performance-tuning expert, but I'm not sure how can I make something useful out of this advice. Of course, "the less you transfer, the better" is an obvious thing to say (a bit too obvious to be useful, in fact), but does it really mean I should aspire to transfer only what I'm actually going to display right now? For example, there is a city autocomplete form on the page (well, a couple of thousand relatively short entries). In that case I would probably consider making 1 request to fetch all these cities (on input focus, most likely), instead of making a request to the server on every couple of characters you type. Is it actually a wrong way of thinking?


It's an aspirational goal, not a hard rule.

In your case, you're optimising for round-trips, which is also important. As long as you only send the city names instead of a huge blob that also includes a bunch of metadata, you're probably fine.

The most common example of my rule is that I often see SELECT statements on unindexed columns. This means that behind the scenes, the database engine is forced to do a table scan to find the row. If the query uses a wildcard selector, then it is also forced to return all columns, whether they are used by the application or not.

I commonly see scans over 100 MB tables returning 100 KB to the web tier, which then converts this to 200 KB of JSON to show 100 bytes of text to the end user. Simply adding an index to the table allows the database engine to reduce the data it has to process to 10-30 KB. Selecting specific columns can reduce that to a few kilobytes, and likely also shrink the JSON to match. Eliminating the JSON and directly generating the HTML on the server like in the good old days would cut the Internet network traffic down to minimum 100 bytes required also.

Similarly, you often see performance monitoring, logging, or graphing programs store data in fantastic detail and precision. Meanwhile, the graph needs only 16 bits of data, because screens are typically at most a few thousand pixels across in size! A case in point is Microsoft System Center Operations Manager (SCOM), which has a metric write amplification of something like 300:1, which is why it can't log metrics at a usefully high frequency. Not because that's impossible, but because it's wasting the available computer power to an absurd degree. Azure has inherited this code, and then layered JSON on top. (I guess when you bill by gigabytes ingested, the incentives are all wrong.)


> This is one of the core reasons most sites have difficulty scaling, because for every kilobyte of content output to the screen, they're powering through megabytes or even gigabytes of data behind the scenes.

> Can this API query be cut down to match what's displayed on the screen? Can it be cached for all users? Can it be cached precompressed?

This is why you want to bypass the JS realm, (or whatever language does the serdes) and send clients JSON or XML directly from the database, so the client is only getting the data at rest.


> the JS fragments thankfully should load only on first visit and then get aggressively cached by DDG/your browser

According to Alexa you have a 46.4% bounce rate. [1]

When 46% of your users aren't coming back, how does 31 round-trips to your server for 100% of first-page visitors save anyone time or bandwidth? Your pageviews per visitor is 6.8, meaning the 53.6% that stick around view an average of 11.8 pages each. Even if there are zero subsequent js requests on other pages (clicking a random page I see 8) you would be generating 31 requests up-front to save 10.8 subsequent requests for about half of your users. (And again - in any scenario where the number of js fragments transferred on subsequent requests >= 1 even this benefit goes out the window.) How does that save you or your users bandwidth, server load, or other overhead?

The scale is not quite linear, but generally speaking, if you get your number of requests down from > 100 to < 5, you'll be able to handle around 20x the traffic with the same number of web-facing servers. Or alternatively the same amount of traffic with around 1 / 20th the servers.

Would that have a material effect on your costs?

[1] https://www.alexa.com/siteinfo/mangadex.org


Definitely needs optimising for user experience indeed!

However the serving of this JS has nearly no cost to us (as they are cached at the edge by DDoS-Guard and the frontend is otherwise entirely static on our end)


It does have a cost it's just hidden. The cost is that it increases your bounce rate because of bad UX.


One issue I see is that flipping back and forth between chapters reloads images from different URLs which means they're uncachable. I guess that's somehow related to the mangadex@home thing, but if the URLs were generated in a more deterministic manner (keyed on some client ID + the chapter being loaded) then the browser could avoid redundant traffic.


That's very close to how MD@H works, but it also has a time component and tokens are not generated by our main backends, so it'd require a separate internal http call per chapter


Another thing. For each page that's being loaded there's a report being sent. Instead this could be aggregated (e.g. once a second) and then processed as a batch on the server side which should be faster.

And if your JS assets are hashed then you can add cache-control: immutable so that a browser doesn't have to reload them when the user F5s.


Do you manage to get as many buzz-words and OSS products into your system as they do? :)

In general the less moving parts you have in a system the more reliable, secure, efficient and cheaper the system becomes.

In their case they run a site that is probably under constant attack by the "hired goons", so they're going to need to have more moving parts than others. Plus they will want to optimise for minimal development time (it's a hobby) so just adding another tried and trusted system into the stack to do something you need makes sense.


lol

> In general the less moving parts you have in a system the more reliable, secure, efficient and cheaper the system becomes.

100% agreed. This is not my first high-traffic site, nor even the highest. (I built the analytics system for a an Alexa top-10 site in 2010, reaching some 30 billion writes / day off of a mere 14 small ec2 instances.) I've never seen a k8s implementation in production that was necessary.

I will note that my Alexa-2k site is also a personal site (no revenue) and under constant attack. In fact we frequently suffer DDOSes that we don't even notice until reviewing the logs later because it doesn't suffer any latency under pressure.


Interesting, wouldn't mind having a chat outside of HN if you're interested (see my profile for mail).

I've spent much of my career working on systems with active users from the hundreds to low thousands, but which process a huge number (50k/sec scale) jobs/tasks.

It's a totally different kettle of fish, and if I'm totally honest I'm shocked at how badly "web" scales and how common these naive and super inefficient implementations are (hint: my bare-metal server from 2005 was faster than expensive cloud VMs).

Recently I've worked on two high-usage systems (one of which was "handling" 30k requests/second for the first couple of week).


> I've spent much of my career working on systems with active users from the hundreds to low thousands, but which process a huge number (50k/sec scale) jobs/tasks.

MMO games, by any chance?


Would you mind outlining your approach?

Really interested to see how you think about this sort of thing =)...


My approach to what?


(1) Simple beats complex.

(2) You can spend weeks building complex infrastructure or caching systems only to find out that some fixed C in your equation was larger than your overhead savings. In other words: Measure everything. In other other words: Premature optimization is the root of all evil.

(3) Fewer moving parts equals less overhead. (Again: Simple beats complex.) It also makes things simpler to reason about. If you can get by without the fancy frameworks, VMs, containers, ORM, message queues, etc. you'll probably have a more performant system. You need to understand what each of those things does and how and why you're using them. Which brings me to:

(4) Learn your tools. You can push an incredible amount of performance out of MySQL, for instance, if you learn to adjust its settings, benchmark different DB engines for your application, test different approaches to building your schemas, test different queries, make use of tools like the EXPLAIN statement, etc. you'll probably never need to do something silly like make half a dozen round-trips to the database in a single page load.

(5) Understand your data. Reason about the data you will need before you build your application. If you're working with an existing application, make sure you are very familiar with your application's database schema. Reason ahead of time about what requirements you have or will have, and which data will be needed simultaneously for different operations. Design your database tables in such a way as to minimize the number of round-trips you will need to make to the database. (My rule of thumb: Try to do everything in a single request per page, if possible. Two is acceptable. Three is the maximum. If I need to make more than three round-trips to the database in a single page request, I'm either doing something too complex or I seriously need to rethink my schema.)

(6) Networking is slow. Minimize network traversal. Avoid relying on third-party APIs where possible when performance counts. Prefer running small databases local to the web server to large databases that require network traversal to reach. This is how I handled 30 billion writes / day: 12 web servers with separate MySQL instances local to each sharded on primary key IDs. The servers continuously exported data to an "aggregation" server, which was subsequently copied to another server for additional processing. Having the web server and database local to the same VM meant they didn't need to wait for any network traversal to record their data. I could've easily needed several times as many servers if I had gone with a traditional cluster due to the additional latency. When you need to process 25,000 events in a second, every millisecond counts.

(7) Static files beat the hell out of databases for read-only performance. (Generally.)

(8) Sometimes you can get things moving even faster by storing it in memory instead of on disk.

(9) Reiterating what's in (3): Most web frameworks are garbage when it comes to performance. If your framework isn't in the top half of the Techempower benchmarks, (or higher for performance-critical applications) it's probably going to be better for performance to write your own code if you understand what you're doing. Link for reference: https://www.techempower.com/benchmarks/ Note that the Techempower benchmarks themselves can be misleading. Many of the best performers are only there because of some built-in caching, obscure language hack, or standards-breaking corner-cutting. But for the frameworks that aren't doing those things, the benchmark is solid. Again, make sure you know your tools and why the benchmark rating is what it is. Note also that some entire languages don't really show up in the top half of techempower benchmarks. Take that into consideration if performance is critical to your application.

(10) Most applications don't need great performance. Remember that a million hits a day is really just 12 hits per second. Of course the reality is that the traffic doesn't come in evenly across every second of the day, but the point remains: Most applications just don't need that much optimization. Just stick with (1) and (2) if you're not serving a hundred million hits per day and you'll be fine.


"Simple beats complex."

In the very first lecture of the Computer Science degree I did in the 1980s the lecturer emphasised KISS, and said that while we almost certainly wouldn't believe it at first eventually we'd realise that this is the most important design principle of all. Probably took me ~15 years... ;-)


Sadly I think this is a lesson that we as an industry consistently keep unlearning.


> Simple beats complex. > Fewer moving parts equals less overhead.

Took me almost a decade to really comprehend this.

I used to include all sorts of libraries, try out all the fancy patterns/architectures etc...

After countless of hours debugging production issues... the best code i've ever written is the one with the fewer moving parts. Easier to debug and the issues are predictable.


"The best part is no part." is an engineering quote I heard.


Said in a slightly different way: No part is better than no part.

I know I’m not the first to use that phrasing, but I’m not sure where I picked it up. If someone wants to point out the etymology of that type of phrase, I’d be glad to read up on what I’ve forgotten/missed.


I'm sure I've heard something like "engineering is solving problems while doing as little new as possible".


> 12 web servers with separate MySQL instances local to each sharded on primary key IDs.

I don't understand this part. Hopefully you can clarify this to me.

If you're sharding by primary key, doesn't that mean that there's a high chance that the shard in your local DB instance won't have the data the web server is requesting?

I'm not familiar with DB management.


Imagine you have a system which services 50 states. In the vast majority of cases, states only look at or mutate information on their own state.

In that case, you can easily split the data between shards based on ranges of an integer key. It's very easy to code, test, deploy and understand such a design.


Thanks, this is a good list in general of things to think about =)...

I've not really ever applied 9 myself, I've run comparative benchmarks a couple of times, but not thought about using that as a basis for whether to roll my own on critical performance parts.


> But for the frameworks that aren't doing those things, the benchmark is solid.

Any example of such frameworks?


(ASP).NET is solid. Extremely fast, very reliable, and highly productive.

https://dotnet.microsoft.com/apps/aspnet


As long as you know what you're doing. If you're throwing an ORM like Entity Framework at a problem because you don't understand SQL, then you're going to see poor performance.


Can you share how you do logging/monitoring/alerting for your site?


Bash scripts and cron. Automatic alerts go out to devs via OpsGenie when resource availability drops so we can get out ahead of it. 0 seconds of downtime in the past 12 months.


To architecting a high traffic site =)...


He posted a reply to his own comment.

https://news.ycombinator.com/item?id=28443113


Actually, my reply was to Folcon. HN simply doesn't allow you to reply to comments beyond a certain depth sometimes.

Perhaps mods have the ability to extend this for active discussions and that's why I can reply now?


it's timing based. you can always reply by going to the permalink of the comment you want to reply to.


Couldn't reply to this comment - but sure enough, the permalink gives me the option. Thank you for the info!


Yeah, it's a somewhat well-meaning feature (supposed to slow down flamewars) that is extremely unintuitive


>In their case they run a site that is probably under constant attack by the "hired goons", so they're going to need to have more moving parts than others.

That's taken care of by the DDoS-Guard system they placed fronting their infrastructure. The design of their system has to take this into account, but that is mainly on a IP and DNS level. The design of their stack behind the loadbalancer is mainly driven by their functional and non-functional requirements, rather than by the need to prevent DDoS attacks.


The layering - defence in depth - is very much a security consideration. Especially if you're building a pure request/response/sync system you need that. Or you decouple with a queue for mutations and avoid a lot of issues.


That may be in terms of managing general security, especially with regards to the attack surface of the solution, but here we are talking about DDoS, which is mostly a separate topic and handled on the network level (for volumetric attacks) and load-balancer level (for non-volumetric attacks) or a combination of both.


I don't know about you, but they have 42 average Page-views per visit (HN has 3) so Alexa rank is going to be biased


>A fresh load of the home page generates over 100 requests.

I see 17 requests, all over either h2 or h3. 4 of them JS, and 2 images.


Then you're not doing a fresh load of the page. There are over 30 images visible on the front page, so your measure doesn't pass the smell test, does it?


>Then you're not doing a fresh load of the page

Nope. Different problem.

The article was linked to a page under the domain "mangadex.dev".

Without any other context, I had assumed "home page" meant http://mangadex.dev , or what I got when clicking "Home" on the linked article.

Apparently not.


I've done things at scale (5-10K req/s) on a budget ($1000 USD) and I've done things at much smaller scales that required a much larger budget.

_How_ you hit scale on a budget is one part of the equation. The other part is: what you're doing.

Off the top of my head, the "how" will often involve the following (just to list a few):

1 - Baremetal

2 - Cache

3 - Denormalize

4 - Append-only

5 - Shard

6 - Performance focused clients/api

7 - Async / background everything

These strategies work _really_ well for catalog-type systems: amazon.com, wiki, shopify, spotify, stackoverflow. The list is virtually endless.

But it doesn't take much more complexity for it to become more difficult/expensive.

Twitter's a good example. Forget twitter-scale, just imagine you've outgrown what 1 single DB server can do, how do you scale? You can't shard on the `author_id` because the hot path isn't "get all my tweets", the hot path is "get all the tweets of the people I follow". If you shard on `author_id`, you now need to visit N shards. To optimize the hot path, you need to duplicate tweets into each "recipient" shard so that you can do: "select tweet from tweets where recipient_id = $1 order by created desc limit 50". But this duplication is never going to be cheap (to compute or store).

(At twitter's scale, even though it's a simple graph, you have the case of people with millions of followers which probably need special handling. I assume this involves a server-side merge of "tweets from normal people" & RAM["tweets from the popular people"].)


> Twitter's a good example.

Mike Cvet's talk about Twitter's fan-in/fan-out problem and its solution makes for a fascinating watch: https://www.youtube-nocookie.com/embed/WEgCjwyXvwc


I appreciate the no-cookie embed.

Learned something new today.


Reads like a small excerpt out of "Designing Data-Intensive Applications" :)


This is an amazing book that improved my effectiveness as an engineer by an undefinable amount. Instead of just randomly picking components for a cloud application, I learned that I could pick the right tools for the job. This book does a really good job communicating the trade-offs between different designs and tools.


I have always wondered "what next" after having read data-intensive. Some suggested looking at research publications by Google, Facebook, and Microsoft. What do others interested in the field read?


I've heard in a few talks how at Twitter engineers have accidentally ran into OOM problems by loading up too big of a follower graph in memory in application code. I think it's a nice reminder that at scale even big companies make the easy mistakes and you have to architect for them.


The 1-7 list you mention definitely deserves it’s own blogpost and how to implement these. I’m currently not using any of these except 1, and probably don’t need the rest for a while but I do want to know what I should do when I need it. For example: what and how should things be cached? When and how to denormalize, why is it needed? Why append-only and how? Never ‘sharded’ before, no idea how that works. Heard some things of everything async/in the background, but how would that work practically?


> Never ‘sharded’ before, no idea how that works.

Sharding sucks, but if your database can't fit on a single machine anymore, you do what you've got to do. The basic idea is instead of everything in one database on one machine (or well redundant group of machines anyway), you have some method to decide for a given key what database machine will have the data. Managing the split of data across different machines is, of course, tricky in practice; especially if you need to change the distribution in the future.

OTOH, Supermicro sells dual processor servers that go up to 8 TB of ram now; you can fit a lot of database in 8 TB of ram, and if you don't keep the whole thing in ram, you can index a ton of data with 8 TB of ram, which means sharding can wait. In contrast, eBay had to shard because a Sun e10k, where they ran Oracle, could only go to 64 GB of ram, and they had no choice but to break up into multiple databases.


> you have some method to decide for a given key what database machine will have the data

Super simple example, splitting there phone book into two volumes, A-K and L-Z. (Hmmmm, is a "phonebook" a thing that typical HN readers remember?)

> you can fit a lot of database in 8 TB of ram, and if you don't keep the whole thing in ram, you can index a ton of data with 8 TB of ram, which means sharding can wait.

For almost everyone, sharing can wait until after the business doesn't need it any more. FAANG need to shard. Maybe a few thousand other companies need to shard. I suspect way way more businesses start sharding when realistically spending more on suitable hardware would easily cover the next two orders of magnitude of growth.

One of these boxes maxed out will give you a few TB of ram, 24 cpu cores, and 24x16TB NVMe drives which gives you 380-ish TB of fairly fast database - for around $135k, and you'd want two for redundancy. So maybe 12 months worth of a senior engineer's time.

https://www.broadberry.com/performance-storage-servers/cyber...


> So maybe 12 months worth of a senior engineer's time.

In America. When the salaries are 2/3 times lower, people spend more time to use less hardware.


Sharding does take more time, but it doesn't save that much in hardware costs. Maybe you can save money with two 4TB ram servers vs one 8TB ram server, because the highest density ram tends to cost more per byte, but you also had to buy a whole second system. And that second system has follow on costs, now you're using more power, and twice the switch ports, etc.

There's also a price breakpoint for single socket vs dual socket. Or four vs two, if you really want to spend money. My feeling is currently, single socket Epyc looks nice if you don't use a ton of ram, but dual socket is still decently affordable if you need more cores or more ram and probably for Intel sevees; quad socket adds a lot of expense and probably isn't worth it.

Of course, if time is cheap and hardware isn't, you can spend more time on reducing data size, profiling to find optimizations, etc.


Fair points, I'm just trying to push back a bit against "optimizing anything is useless since the main cost is engineering and not hardware", since this situation depends on the local salaries and in low-inome countries the opposite can be true.


> what and how should things be cached?

If something is read much more frequently than it changes, store it client-side, or store it temporarily in an in-memory-only, not-persisted-to-disk "persistence" layer like Redis.

For example, if you're running an online store, your product list doesn't change all that often, but it's queried constantly. The single source of truth lives in a relational database, but when your app needs to fetch the list of products, it should first check the caching layer to see if it's available there. If not, fetch it from the database, but then write it into the cache so that it's available more quickly the next time you need it.

> When and how to denormalize, why is it needed?

When you need to join several tables together in order to retrieve a result set, and especially when you need to do grouping to get the result set, and the retrieval & grouping is presenting a performance problem, then pre-bake that data on a regular basis, flattening it out into a table optimized for read performance.

Again with the online store example, let's say you want to show the 10 most popular products, with the average review score for each product. As your store grows and you have millions of reviews, you don't really want to calculate that data every time the web page renders. You would build a simpler table that just has the top 10 products, names, IDs, average rating, etc. Rendering the page becomes much more simple because you can just fetch that list from the table. If the average review counts are slightly out of date by a day or two, it doesn't really matter.

> Why append-only and how?

If you have a lot of users fighting over the same row, trying to update it, you can run into blocking problems. Consider just storing new versions of rows.

But now we're starting to get into the much more challenging things that require big application code changes - that's why the grandparent post listed 'em in this order. If you do the first two things I cover above there, you can go a long, long, long way.


It's hard to answer this in general. Most out-of-the-box scaling solutions have to be generic, so they lean on distribution/clustering (e.g., more than one + coordination) so they're expensive.

Consider something like an amazon product page. It's mostly static. You can cache the "product", and calculate most of the "dynamic" parts in the background periodically (e.g., recommendation, suggestions) and serve it up as static content. For the truly dynamic/personalized parts (e.g., previous purchased) you can load this separately (either as a separate call from the client or let the server pieces all the parts together for the client). This personalized stuff is user specific, so [very naively]:

   conn = connections[hash(user_id) % number_of_db_servers]
   conn.row("select last_bought from user_purchases where user_id = $1 and product_id = $2", user_id, product_id)

Note that this is also a denormalization compared to:

select max(o.purchase_date) from order o join order_items oi on o.id = oi.order_id where o.user_id = $1 and oi.product_id = $2

Anyways, I'd start with #7. I'd add RabbitMQ into your stack and start using it as a job queue (e.g. send forget password). Then I'd expand it to track changes in your data: write to "v1.user.create" with the user object in the payload (or just user id, both approaches are popular) when a user is created. It should let you decouple some of the logic you might have that's being executed sequentially on the http request, making it easier to test, change and expand. Though it does add a lot of operational complexity and stuff that can go wrong, so I wouldn't do it unless you need it or want to play with it. If nothing else, you'll get more comfortable with at-least-once, idempotency and poison messages, which are pretty important concepts. (to make the write to the DB transactionally safe with the write to the queue, lookup "transactional outbox pattern").


As a sibling comment mentioned, read DDIA: https://dataintensive.net/


I would like to notice that many of these techniques can incur significant cost of developer or sysadmin time.


Try to convert as much content as you can into static content, and serve it via CDN. Then, use your servers only for dynamic stuff.

Also, put the browser to work for you, caching via Cache-Control, ETag, etc. Only then, optimize your server...


Many manga fans have a love/hate relationship with mangadex. On one hand, it's provided hosting for countless hours of entertainment over the years. Their "v3" version of the site was basically perfect from a usability point of view, to the point that the entire community chose to unite itself under its flag.

On the other hand, directly because of the above, their hasty self-inflicted take down earlier this year nearly killed the entire hobby. Many series essentially stopped updating for the ~5 months the site was down, and many more are likely never coming back again.

The decision to suddenly take the site down for a full site rewrite feels completely inexplicable from the outside. (A writeup the above or the previous one[1], both of which read like they were written by a Google Product Manager, especially don't help as they conspicuously avoid any comment to the one question on everyone's mind: "leaving aside the supposed security issues with the backend, why on earth also rewrite and redesign the entire front end from scratch at the same time?")

[1] https://mangadex.dev/why-rebuild/


I didn't get that take at all from the why-rebuild link. It seems reasonable to me - legacy code base, hard to maintain, with security problems that led to the massive data leak a while back. They also don't owe anything to anyone, and as a hobbyist project, they wanted to try something new. I'm impressed as they seem to have managed it - and the new site feels a lot more responsive than the old one.


Yeah. This whole mess pushed me to moving everything I had (or could remember, anyway) to Tachiyomi¹, so I can hop between hosting websites freely without losing progress or access to old chapters (as long as I don't run out of local storage).

And while it works fine for reading, it kills any interaction with the hosting sites. No chance for monetization, socialization or anything else that can help sites survive long-term.

[1] https://tachiyomi.org/


Before MangaDex we had Batoto (the old Batoto before some sketchy company bought the name), that was kinda of the same: serving high quality manga for most scanlators that wanted it (and also avoiding hosting pirated chapters from official sources, so kinda the same as MangaDex nowadays). As far I remember Batoto closed because of pressure from companies and also because of the high costs related to run the site.

So yeah, considering how fragile maintaining a site like this is, it is always a good idea to sync your progress in a third party so it is easier to migrate if something goes wrong.

> And while it works fine for reading, it kills any interaction with the hosting sites. No chance for monetization, socialization or anything else that can help sites survive long-term.

BTW, MangaDex doesn't have monetization because it is strict a hobby and also because it is a gray area to monetize about this kinda of work [1]. Also, their Tachiyomi client is official (MangaDex v5 API was tested primarily via their Tachiyomi client before they finished the Web interface).

[1]: both for companies (that has the copyright from the works hosted on those sites) and the scanlators (the fans that does actual work of translating those chapters). Sites that host those chapters and monetize are pretty much monetizing on work from other people.


It's obviously not at the same scale as MangaDex, as they provide actual hosting for scanlation groups, but if you want to support the scanlator sites that do have a site - check out Kenmei, which is my take on tracking series you read. It specifically built with scanlator-first approach, so that you actually go and visit their sites, helping them survive long term, instead of hogging the traffic, like Tachiyomi does

https://www.kenmei.co/


Totally a self plug, but if you're looking to take it a step further, Kavita is a great program to host your own, Plex-like manga server.

https://kavitareader.com


That actually looks really interesting, thanks!


I never got the sense that the manga community hated Mangadex and I’ve been following their Dev of V5 and the rise of other sites to use in their absence.

It’s seems weird to attribute Mangadex taking their site down for valid security concerns to the end of scandalization of certain series. That seems like entirely a Scan team problem if they decide not to upload via Cubari like other teams have done. And it doesn’t even matter since a series can get sniped at anytime.

It’s makes entire sense that if you’re going to rewrite the backend and API from scratch , you might as well do the front end too since it was a Goal from the beginning.


> their hasty self-inflicted take down earlier this year nearly killed the entire hobby

It won't kill the hobby. Because these scanlators are making mad money from ads, patreon, crypto mining. I'll never get why they don't get more aggressive take down notices from Chinese/Japanese/Korean publishers.


Copyright enforcement is actually quite expensive, both for the litigant and the defendant. The only way for it to be actually profitable to sue someone who is stealing your work is if they immediately settle, which is how copyright trolls operate. Everything else is a massive money pit for everyone involved, even the lawyers. Since this is an international enforcement action, the costs go up more, because now you need multiple legal teams on the bar in each jurisdiction, translators qualified for interpreting laws in foreign languages, knowledge of local copyright quirks, and a lot more coordination than just asking your local counsel to send a takedown notice locally.

(Just as an example of a local copyright quirk that will probably confuse a lot of people in the audience from Europe: copyright registration. America really, really wants you to register your copyright, even though they signed onto Berne/WTO/TRIPS which was supposed to abolish that regime entirely. As a result, America did the bare minimum of compliance. You don't lose your copyright if you don't register, but you can't sue until you do, and if you register after your work was infringed, you don't get statutory damages... which means your costs go way up.)

Furthermore, every enforcement action you take risks PR backlash. The whole fandom surrounding import Japanese comic books basically grew out of a piracy scene. Originally, there were no English translations, and the scene was basically reusing what we'd now call "orphan works". There used to be an unspoken rule among most fansubbers of not translating material that was licensed in the US. All that's changed; most everything gets licensed and many fan translators absolutely are stepping on the toes of licensees. However, every time a licensee or licensor actually takes an enforcement action, they get huge amounts of blowback from their own fans.


They get plenty of takedown notices, but they mostly get to hide behind services like Cloudflare who won't take action regarding these notices anyway. From the publishers/creators side, there is simply no effective way to take scanlators down.


I suspect it's because the international market for print manga (the primary cash cow) is rather anemic, particularly compared to anime.

Publishers see the loss as minimal and creators see piracy as free advertising to drum up enthusiasm for anime adaptations, which actually do drum up decent profits internationally (the committee keeps the streaming licensing fees, not the animation studio).


Publishers definitely don't see it that way; that's mostly an extension of a myth in order to justify the piracy.

Most manga publishers will see relatively little revenue from international anime releases. Even for domestic anime releases of the vast majority of titles, the manga publisher is only a small part of the anime production committee, and the hope is mostly that popularity of the anime can lead to increased sales of the manga, merchandise, or other events. So when the anime is released internationally, they get an even smaller cut of that because the international licensee also has to take their profit.

But other than mega-hit titles where an international anime release may also lead to significant international manga sales, the popularity of an anime adaptation overseas is practically irrelevant to the original manga publisher.


I don't understand. Why is 2k requests/sec supposed to be massive?

Try this yourself: write a simple web server in Go, host it on a cheap VPS provider, let's say at the option that costs $20/mo. Your website will be able to handle more than 1k/s requests with hardly any resource usage.

ok, let's assume you're doing some complicated things.

So what? You can scale vertically, upgrade to the $120/mo server. Your website now should be able to comfortably handle 5k req/s

Looking at the website itself, mangadex.org, it doesn't even host the manga itself. The whole website is just an index that links to manga on external websites. All you are doing is storing metadata and displaying it as a webpage. The easiest problem on the web.

So, I really don't understand the premise behind the whole post.

The problem statement is:

> In practice, we currently see peaks of above 2000 requests every single second during prime time.

This is great in terms of success as a website, but it's underwhelming in terms of describing a technical problem.


> Try this yourself: write a simple web server in Go, host it on a cheap VPS provider, let's say at the option that costs $20/mo. Your website will be able to handle more than 1k/s requests with hardly any resource usage.

These people have never heard of Go, obviously. The likely scenario is not that you haven't fully understood their constraints or requirements, it's that you're just smarter than they are.

> So what? You can scale vertically, upgrade to the $120/mo server. Your website now should be able to comfortably handle 5k req/s

> Looking at the website itself, mangadex.org, it doesn't even host the manga itself. The whole website is just an index that links to manga on external websites. All you are doing is storing metadata and displaying it as a webpage. The easiest problem on the web.

Take that order of magnitude cheaper, single VPS server solution you're proposing and build something with it. Sounds like you'd make a lot of money. There has to be a business idea around "storing metadata and displaying it as a webpage" somewhere? Easiest problem on the web.

The peanut gallery at HN is out of control. People who don't do / build explaining to the people who do how easy, simple, better their solutions would be.


> The likely scenario is not that you haven't fully understood their constraints or requirements, it's that you're just smarter than they are.

I never claimed to be smarter. I just understand some things that I noticed a lot of people in the industry don't understand.

My understanding is not even that great.

But still, this is just one example that I keep running into over and over and over:

People opting for a complicated infrastructure setup because that's what they think you should do.

No one showed them how to make a stable reliable website that just runs on a single machine and handle thousands of concurrent connections.

It's not hard. It's just that they've never seen it and assume it's basically impossible.

There are areas about computing that I feel the same way about. For example, before Casey Muratori demoed his refterm implementation, I had no idea that it was possible to render a terminal at thousands of frames per second. I just assumed such a feat was technically impossible. Partly because no one has done it. But then he did it, and I was blown away.

> Take that order of magnitude cheaper, single VPS server solution you're proposing and build something with it. Sounds like you'd make a lot of money.

Building something and making money out of it are not the same thing. But thanks for the advice. I'm in the process of trying. I know for sure I can build the thing, but I don't know if it will make any money. We will see.

> People who don't do / build explaining to the people who do how easy, simple, better their solutions would be.

I do and have done.

This kind of advice is exactly the kind of thing I know how to do because I have done it in the past using my trivial setup of a single process running on a cheap VPS. And I have also seen other teams struggle to get some feature nearly half-working on a complicated infrastructure setup with AWS and all the other buzzwords: Kibana, Elastic Search, Dynamo DB, Ansible, Terraform, Kubernetes ... what else? I can't even keep track of all this stuff that everyone keeps talking about even though hardly anyone needs at all.

I've seen 4 or 5 companies try to build their service using this kind of setup, with the proposed advantange of "horizontal" and "auto" scaling. And you know what? They ALL struggled with poor performance, ALL THE TIME. It's really sad.


> No one showed them how to make a stable reliable website that just runs on a single machine and handle thousands of concurrent connections.

What is a reliable website? What does this website do?

If given a static constraint, like serve 2000 requests per second with 99.999% uptime, and enough time, I'm sure you can optimize it to be as efficient as you'd like. But that's not our exercise. Bespoke, custom solutions that are not the core of the business are not solutions. Repeat that a dozen times.

MangaDex's business is not to be the most efficient website possible. Their business, I assume, is content and features for their users. They pick off the shelf technologies to do it because it's well documented, proven, and most importantly already built.

They compose these technologies to solve their business problems. Often there's a mismatch or an overlap in functionality that introduces inefficiency (complexity, cost, performance), but that's a trade-off MangaDex and many other businesses make. We can judge how poorly or well they've made some trade-offs base on their business and overall requirements.

You coming here and telling people that you can run it on a VPS ignores all of the above. And Casey has a YouTube channel where he makes a game his way I would assume because working to solve uninteresting business problems (and possibly dealing with co-workers who may pull down N project instead of building it themselves) wasn't a space he was interested in.

There's a difference between a purely technical challenge, and working with complex, interacting systems like ... people and business requirements and laws and regulations and auditing and hiring and security and who the hell maintains this system when Bob quits. Conflating the two is the root of the comments like yours, I'd think.

> It's not hard. It's just that they've never seen it and assume it's basically impossible.

I'm sorry, but who thinks only serving 2000 requests per second is hard? Or do you assume they think it's hard because you misunderstood or are unaware of their 100s of other requirements that they need to solve for in addition to serving 2000 requests per second?

I'm going on and on in this thread mainly because I'm tired of people assuming they know better than the people in the trenches making these decisions. You're assuming you're more knowledgable and skilled than them in their own problem space! They're obviously unaware that using Go to make a server would solve their problems otherwise why wouldn't they have done it?

FYI I work (and have worked) for large tech companies (think silly acronym). I'm not even in this space, as in the type of problems I face are quite different, but I can respect the authors of that article enough to nod along, shrug, and not assume I know better.


> What is a reliable website? What does this website do?

It performs well and doesn't randomly go down when someone posts a link to it on HN or tries to put in bad input.

> I'm sure you can optimize it to be as efficient as you'd like.

Wrong! I've said nothing about optimizing things.

All I'm advocating is simple solutions that are proven to work.

A web server in Go is far from efficient. An optimized server in C/C++ can probably perform 20x better than a Go server. If not more.

However, a web server in Go makes far more reasonable use of system resources to achieve the desired goals. It's also pretty reliable.

> Bespoke, custom solutions that are not the core of the business are not solutions. Repeat that a dozen times.

I don't understand the point of this sentence.

Are you saying that Kubernetes or Elastic Search or AWS or any of the other buzzwords are at the core of their business?

Clearly they are not.

> MangaDex's business is not to be the most efficient website possible. Their business, I assume, is content and features for their users. They pick off the shelf technologies to do it because it's well documented, proven, and most importantly already built.

It's in the interest of their business to lower their cost of operations. Building on a complicated infrastructure when you don't need is incurring a lot of cost. Not just the monthly cost ($1500/mo) but the cost of the staff needed to understand and maintain this infrastructure.

It's not the kind of thing that is easy to maintain.

To be completely frank with you, I myself am not capable of understanding or maintaining such a system. And every company I've been almost had no one who understood how the system really works. Someone set things up sometime by following some tutorials. When things go wrong, people panic and go into fire fighting mode. They spend hours trying to make sense of what's going on, usually involving multiple people - because it's not a task that a single individual can handle.

> They compose these technologies to solve their business problems. Often there's a mismatch or an overlap in functionality that introduces inefficiency (complexity, cost, performance), but that's a trade-off MangaDex and many other businesses make. We can judge how poorly or well they've made some trade-offs base on their business and overall requirements.

You are talking as if these off the shelf technologies are reliable and easy to implement or integrate.

From what I've seen, these solutions are a lot more complicated than what I'm proposing.

Every place I've been to that tries to take this approach ends up burning too much money and resources trying to make their thing work.

It's not as if these companies don't have to write code to make their product work. You still have to write code anyway. So, why not, instead of writing tons of glue code and configuration files to hopelessly integrate a hodge podge of tools and frameworks ... why not just write the simple code that just does the thing you want?

> I'm going on and on in this thread mainly because I'm tired of people assuming they know better than the people in the trenches making these decisions. You're assuming you're more knowledgable and skilled than them in their own problem space! They're obviously unaware that using Go to make a server would solve their problems otherwise why wouldn't they have done it?

The first company I've been to that was doing this kind of thing was spending upwards of $10k/mo on the most beefed up server that AWS provides to host the database server, and they still struggled to server more than 1000 users concurrently.

According to you, I'm not in a position to give them suggestions or adivce about how to fix this problem!!

> I'm sorry, but who thinks only serving 2000 requests per second is hard? Or do you assume they think it's hard because you misunderstood or are unaware of their 100s of other requirements that they need to solve for in addition to serving 2000 requests per second?

What are the other 100 requirements that are not fulfilled by the thing I'm proposing?!


There are plenty of people who build here on HN (more than most other sites) and the requirements are pretty clearly described in the article.

While it's not as simple as a Go program on a VPS, there is certainly a lot of unnecessary overhead here. I think you underestimate just how much poor and wasteful engineering there is out there.


> While it's not as simple as a Go program on a VPS, there is certainly a lot of unnecessary overhead here. I think you underestimate just how much poor and wasteful engineering there is out there.

I don't under estimate poor and wasteful engineering at all, but that's not what I saw in the article.

Serving traffic is a single element of their design. They also designed for security, redundancy, and observability. All with their own solutions because using a service or a cloud provider would be too costly. With that in mind, it's not a charitable view to think they didn't explore low hanging fruits like "make the server in Go". If you think you can do better, detail in depth how and solve all of their requirements vs. the single piece you're familiar with.

And if you can do the above holistically, for an order of magnitude below their costs, it sounds like I need to get in touch to throw money at you.


My background is in adtech, which is a unique mix of massive scale, dynamic logic, strict latency requirements, and geographical distribution. I've built complete ad platforms by myself for 3 different companies now so I can confidently say that this is not a difficult scenario. It's a ready-heavy content site with very little interactivity or complexity to each page and can be made much simpler, faster and cheaper.

> "detail in depth how "

This thing seems to be little more than a very complex API and SPA sitting on top of Elasticsearch. These frontend/backend sites are almost always a poor choice compared to a simple server-side framework that just generates pages. ES itself is probably unnecessary depending on the requirements of their search (it doesn't seem to be actual full text indexing of the content but just the metadata). The security and observability also tends to be a problem of their own making and a symptom of too much complexity.


> My background is in adtech, which is a unique mix of massive scale, dynamic logic, strict latency requirements, and geographical distribution. I've built complete ad platforms by myself for 3 different companies now so I can confidently say that this is not a difficult scenario. It's a ready-heavy content site with very little interactivity or complexity to each page and can be made much simpler, faster and cheaper.

I don't dispute this or your credentials. You've built critical systems in a space where it was a core of the business. If given time, and resources, I have no doubt you could build a custom solution to their problem that was more efficient.

Unstated in this is the type of business MangaDex is, which I have the following assumptions about. I don't think it's unfair to assume that we're mostly on the same page here:

- Small to mid size, at most

- Small engineering team. Need to develop, deploy, support, and maintain solutions.

- Lacks deep systems expertise, or is unable to attract talent that has that expertise ($)

These characteristics are very common in our space. To solve their technical problems, most of the time, they reach for an open source solution (after examining the alternatives like a service).

Now the question is given those constraints, and their other business requirements, how do they best optimize for dimensions they care about? Everything is a trade-off. Everyone who builds knows this. It's unkind to pretend this is a purely technical exercise. And after reading their article, it's obvious they know some of trade-offs they're making, so it's unkind to suggest a naive solution that does nothing but make you feel smarter. I'm not saying you did the above, but some of these comments are outrageous.


If you have a lot of extra money to throw I'd be happy to oblige.


> People who don't do / build

I can and do frequently advise on certain topics in comments specifically because I do build and can in fact speak of such topics authoritatively. Isn't that what this website is for?

That said, the post you are replying to is perhaps overly dismissive of the criteria that this website operates under. Other comment chains have some really good advice though.


You ignore the weight of requests and general situation of this project. This is not your average mommy-blog whose does not care much how many downtimes it has. This is a website with illegal content, under constant attack, with a some pretty dynamic content on top and likely the main goal to satisfy their community. So most of their budget will go to security and redundancy, to protect themselves and allowing a high uptime.

Where you can use 1 server, they will need to have something around 20 servers. Where you can use a cheap VPS provider, they must use an expensive shady provider who will take the heat of legal attacks. And so on and on... because of their situation they have a bunch more requirements which eat their budget than your average website, leading to a rather heavy, complex and thus expensive architecture.

Surely there is still room for optimization, but it seems this is a rather new redesign from scratch(?), so not details need time.


They also host the manga. It’s not just an link farm. Because they host… that’s why they use ceph.

Their goal is for scanlators to have a place to post their new translated manga, rather than always linking it off from some Wordpress instance.


>Looking at the website itself, mangadex.org, it doesn't even host the manga itself.

They do seem to. Clicking on a random manga on there the images are hosted on their server[0]. Also I guess some of those are much bigger images which is less trivial to serve at that rate than a 10kb static page.

0. blob:https://mangadex.org/e78bd61a-e761-4a73-a27c-5f58394e7ea4


Blob links are scoped to your browser tab, they're not real internet URLs.


> This is great in terms of success as a website, but it's underwhelming in terms of describing a technical problem.

A bit of an intro punchline, even though I agree it admittedly doesn't say much on itself :)

Fwiw most of the work is that there's little "static" traffic going on -- images and cacheable responses are not very CPU intensive to serve -- but what isn't static (which is a good chunk of it) is more problematic, but more to come on these


The only releases that link to external websites are the ones from sites such as MangaPlus and BiliBili (And delayed releases if you count those)


I agree, just had to read the article again, and took it as a fancy way of wasting money really.


Not familiar with the project but it is great to see a counterpart to over-provisioned enterprise infrastructure. $10 in 2021 can do what $100 in 2011 did, what $1000 in 2001 did, and that is not solely due to hardware. Well-designed deployments of K8s, KVM/LXC, Ceph, LBs like this project can handle so much more traffic than poorly configured Wordpress storefronts.

They're using battle-tested tech from Redis and RabbitMQ to Ansible and Grafana. Nothing super fancy, nothing used just for the sake of being modern. Not sure how long it took them to end up with this architecture but it doesn't look like a new dev would have a hard time getting familiar with how everything works.

Would definitely like to hear more about their dev environment, how it is different from prod, and how they handle the differences.


> Would definitely like to hear more about their dev environment, how it is different from prod, and how they handle the differences.

It's honestly quite boringly similar (hence why it's only vaguely alluded to in the article)

Take out DDoS-Guard/External LBs (no need for publicness of it), pick a cheap-o cloud provider to get niceties like quick rebuilding with Terraform etc, slap a VPC-like thing to make it a similar private network (do use a different subnet so copypasting typos across dev and prod are impossible) and scale down everything (ES node has 8 CPUs and 24GB ram in prod? It will have to do with 2vCPUs and 2GB RAM in dev)

One of the annoying things is you do want to test the replicated/distributed nature of things, so you can't just throw everything on a single-instance-single-host because it's dev, otherwise you miss out on a lot of the configuration being properly tested, which ends up a bit costlier than necessary


I think enterprise and more optimize for business flexibility and ability to A/B test very rapidly vs a finely crafted piece of efficiency, for better or worse. The people behind this probably do this for their day job, or are teens that are about to do it for their day job.


I agree with you. I mostly work in enterprise and understand that it has different needs and ROI requirements. However, my personal mindset is that computers and networks are really really fast now and it's a tragedy that most of these gains are nullified due to unoptimized layers of abstraction or over-architecting. So it's a welcome sight to read about well-designed infrastructure like this.


It happens because business are optimizing for resources that are ultimately more expensive or slower, which is staffing levels and the ability to respond to the market so the business can grow or survive longer. Inefficient computing architecture as a side effect is a worthwhile tradeoff in light of that to them.

But as a craftsman, it is definitely nice :)


What kills me is that this was a rather pedestrian outcome on a much cheaper 2-core virtual machine back in 2007 or so.

I easily got 3K requests / sec out of my laptop at the same time, and it was not a trivial app!

People's expectations have shifted so much it's absurd. If you look at the TechEmpower benchmarks, ordinary VMs can easily push 100K requests per second, no sweat, even with managed languages.

Trivial stuff like static content being treated as static content (files on the disk!) not as a distributed cache in front of a database can do wonders.

Am I just old and jaded?


Excellent post, good technical content, amazing feat.

That said, I echo that the amazing feat is that they can fit modern inefficient tool choices with poor mechanical sympathy into that budget. The last decade of web-dev tooling has been pushing the TCO of systems through the roof and this post is all about how to struggle against that whilst using those tools.

If they went old-school they'd get another order of magnitude savings. Many veterans know of systems doing 10x that in 10x less cost. Remember C10K was in 1999.


> If they went old-school they'd get another order of magnitude savings. Many veterans know of systems doing 10x that in 10x less cost. Remember C10K was in 1999.

How to learn more about the old-school way without getting a job related to it? Like, topic or book recommendations.


I'm also curious about this, since in many areas of computing (not only webdev) the old-school guys take some stuff as so obvious that they don't even bother writing about it or explaining it beyond "This is obviously possible, dude". They know how to achieve this level of performance, but for everyone else, we have to cobble together fragmented insights. So if anyone out there reads this and thinks like the GP, please do write about it ;)


What topics specifically are you interested in? And where do all the people like you hang out?

I'm not an old-school guy by any means .. but I might have something to contribute.


> What topics specifically are you interested in?

Well, for example, what's the old-school alternative to mangadex's solution?

> And where do all the people like you hang out?

We are here on HN.


No. I think you have a healthy perspective and we should all be questioning if current trends are beneficial/sustainable.

I haven’t read the article, but the headline alone to me seems alarming, $1,500 a month is a lot of money for only 2k rps.


Maybe its a lot of money just for the web servers, but for the entire infrastructure stack its pretty reasonable IMHO.


There is more to it that http request response. Mangadex also need to store a lot of images and distribute them.


CDNs have already solved this problem and are much cheaper than $1500/month.

I've ran far more complex sites with much higher traffic for less.


Mangadex can't use cloudflare because of privacy reasons. They may be facing similar issues with other popular CDNs.

I am sure they must be using some kind of CDN for sure, however, those options are unlikely to be free


Privacy reasons? It's all static content that is publicly accessible. I don't understand what the privacy reasons could be under this context.

Are they worried about CDNs logging the images their visitors access? Seems like an absurd edge case to worry about in my opinion.

> however, those options are unlikely to be free

I wasn't even talking about free CDNs :)


They’re basically hosting illegal content, or at least a good chunk of it is copyright-infringing so they cannot use cloudflare or any of the other off the shelf offerings


Not disputing your statement, just made me laugh a bit because almost every single site I visit recently that offers links to copyrighted content [stored in file lockers] sits behind Cloudflare


I wouldn’t say illegal content, since most of that gets removed relatively quickly. But definitely a lot of content in the grey zones of copyright law.


Hm? There are lots of copyright-infringing sites using Cloudflare, and Cloudflare seems pretty content to generally ignore infringement notices.


I see. That does complicate things somewhat.

I wonder if there's merit in them approaching studios with a proper business plan?


Now with the way the manga, Manghwa, and WEBTOON industry is tied up. But I believe that is their end goal eventually.


I think privacy as in the mangadex team don't want to get sued. So they avoid popular services who are more willingly share their identity.


Then you should read the article ;)


I think even a distributed cache in front of a database shouldn't have any trouble handling 2000 requests per second.

The issue is not really the number of requests per second, probably, but the number of bytes, which they don't talk about at all in the article; reading manga with no ads is a pretty static kind of application, which could be satisfied amply with a web browser or even a much simpler program loading images from a filesystem directory.

Valgrind claims httpdito runs a few thousand instructions per request, but that's not really accurate; what happens is that the kernel is doing all the work. httpdito on Linux can handle about 4000 requests per second per core, nearly a million clock cycles per request, almost all of which is in the kernel. Of course it doesn't ship its logs off to Grafana. In fact, it doesn't have logs at all. But it would work fine for reading manga.


> The issue is not really the number of requests per second, probably, but the number of bytes, which they don't talk about at all in the article; reading manga with no ads is a pretty static kind of application, which could be satisfied amply with a web browser or even a much simpler program loading images from a filesystem directory.

I assume they are talking about their more dynamic content serving in this post (for things like search, tracking which chapters are read, new chapter listing based on what user follows etc.).

They have a custom CDN that is hosted by volunteers to serve the images for the manga pages. They provide some metrics for that at https://mangadex.network, there are also some older screenshots where they hit 3.2GB/s.


Interesting! Thanks! It still doesn't sound like the kind of thing that would require load balancing, but maybe it was easier to write it in Python or PHP or something, and that made it so heavy that it did.


> distributed cache in front of a database

Already an over kill.

Think smaller. Think simpler.

A single machine serving files directly from the file system (yes, from the SSD attached to the machine) will be able to handle a LOT more.


Well, that's what httpdito does: it serves files directly from the filesystem. That's why I mentioned it. But, for some applications, such as the website you're using right now, it's useful to display pages that haven't been previously stored in the filesystem.


Also that 2000 request per second has to happen 24/7 not only quick demo session.


httpdito is nothing if not consistent in its performance. It doesn't have enough state to have many ways to perform well at first and then more poorly later, or for that matter vice versa. (Not saying it couldn't happen, but it's not that likely.) Linux is pretty good about consistent performance, too, though it has more state.


In 2011 a company i contracted for was testing some new dell 1U servers with around 1-2TB of ram. There was a postgres database with 4000qps that could fit into tmpfs, and so i restricted postgres to 640Kb of memory and we got replication working, it took about 6 hours of babysitting.

We threw the switch and watched as postgres, with 640Kb of ram and a tmpfs backed store proceeded to handle all of the query traffic. There were some stored procedures or something that were long-querying or whatever - i'm not a DB person at all, so we switched back to the regular production server about 8 minutes later.

Yes, we did it in production.


Postgres handles low memory situations well. It'll kill memory intensive queries before it crashes. I wonder if your application was getting a lot of errors back instead of successful queries :)


the application performed fine, even though we made the switch around 15:00 PST. The DBA was concerned because of the few long queries.

Obviously the tmpfs was doing the heavy lifting, there - and if i had to do a postmortem, i'd wager that filling the OS caches was the main reason the long queries took so long. We didn't do any sort of performance tracing.

The main purpose was to show that these $35k servers could essentially replace the older machines if need be, even though the old ones had FusionIO. I just removed the middleman of the PCIe bus between the application and the memory. It was a near constant argument on the floor about whether or not we could feasibly switch to SSDs in some configuration over spinning rust or even FusionIO, i wanted a third option.

Basically, serve out of registered, ECC memory in front, replicate to the fusionIO and let those handle the spindled backups, which iirc was a pain point.


For real, 640 kilobits?


K isn't the abbreviation for kilo, so if you're going to rag on the fellow for the 'b', then you should at least be asking what a Kelvin*bit is.


640KiB is very little and I'm wondering if it's a typo, given that the servers had 1-2TiB available. Postgres 9.0 released in 2010 already had 32MiB as the default for shared_buffers (with a minimum of 128KiB): https://www.postgresql.org/docs/9.0/runtime-config-resource.... and 8.1 released in 2005 used 8MB (1000*8KiB): https://www.postgresql.org/docs/8.1/runtime-config-resource....


i interpreted it as "we wanted to turn the shared buffers ~off, but in a hilarious way that would suggest to someone reading the configuration file that something was going on" (bill gates, mumble mumble)

but, wtf do i know, i'm the crazy guy who tries to interpret comments generously.


Yes, it was a direct reference to Bill Gates "640 kilobytes is enough for anyone" and i typed the comment right before i fell asleep.


“The binary meaning of the kilobyte for 1024 bytes typically uses the symbol KB, with an uppercase letter K.” [0]

0. https://en.m.wikipedia.org/wiki/Kilobyte


The question was more about the kilo part, even though I didn't clarify. Seems orders of magnitude too small?


2k sockets on a test bed vs 2k real user request in production is very different. I doubt you ran a top 1000 Alexa site on your laptop. Today we need to deal with SSL which eats from the performance budget.


> SSL which eats from the performance budget.

That was a short-lived thing, and has now become a myth perpetuated by companies like Citrix and F5 that sell "SSL offload" appliances for $$$.

Have you benchmarked the overhead of TLS?

In my experience, a single CPU core can easily put out multiple gigabytes of AES-256 (tens of gigabits). This benchmark shows 3 GB/s (24 Gbps) for recent AMD CPUs, and nearly 40 Gbps per core for an Intel CPU: https://calomel.org/aesni_ssl_performance.html

A multi-core server is very unlikely to have more than a 1-5% overheard due to TLS. Even connection set up is a minor overhead with elliptic curve certificates.

This is thanks to the AES offload instructions, which are present in all server CPUs made any time in the last 5-7 years or so. As long as the modern Galois Counter Mode (GCM) is used with AES, performance should be great.

Meanwhile, Citrix ADC v13 with a hardware "SSL offload card" actually slows down connections! I had a very hard time getting more than 600 Mbps through one. It seems to be the way the ASIC offload chip is architected: it seems to use a large number of slow cores, a bit like a GPU. This means that any one TLS stream will have its bandwidth capped!


The problem with these benchmarks is they measure the bandwidth you can push through an established tls connection. Try to build 2000 new tls connections a seconds (yes many are still active and dont need to be restarted) that is what is the really slow part. Not sending the data over already established channels.


This exactly. Most CPU time in our Haproxy is spent on crypto/new TLS sessions handling thousands of new connections per second.


To add to that, in 2005, a Cloudflare engineer showed that you can receive 1 million packages per second (https://blog.cloudflare.com/how-to-receive-a-million-packets...). Without processing though.


You're not alone in making comments like this, and as I read them I'm not sure if people are missing the point or being disingenuous.

Much of the work I've done professionally has happened in the webhosting space, and I'm not a new hand at it - the first "professional" website I ever ran was hosted off of a machine running on a Quantum R5000. I have served (and still serve) plenty of static content as files off the disk. My own first impulse when building anything is to use as few moving parts and as simple of a setup as possible.

Requests per second are not a good metric for the amount of work being done by a system, because not all requests are made equal. You say you got 3k requests out of your laptop and that it wasn't a trivial app, but you have only provided a trivial amount of information. Taking a quick look at the features provided by the site, they offer quite a bit of flexibility on both what is and how it's displayed. Filtering options based on original and translated language, adult content filtering, multiple types of search, request routing to low quality images to save on data, category and tag filtering, tracking of what you have read and what your progress is on it, follows and notifications, permissions systems for uploading and updating content, etc. This is all on top of the basic "Display images and metadata about the images" functionality.

They have concerns around privacy and being able to withstand active attackers due to the content they host. They have their own requirements about logging, analytics, etc. Their own concerns about data availability.

You could not meet the design requirements of this website and serve 2k requests per second on a 2 core VM in 2007. I'm not saying there aren't inefficiencies in their architecture and further places where they could save money/increase performance/etc., but acting like two cores of lower-clockrate lower-ipc compute from a decade and a half ago could do all of the work needed to support the features and design requirements they have for this website is pretty disparaging towards the people who built this infrastructure.


Not just you, but if it works for them, that's completely fine.

But there are many ways to achieve 20K RPS without this type architecture and especially without k8s, for less than $1,500.


>20k RPS.

If this metric is what you are chasing, there are ways to reliably break 1 million RPS using a single box if you don't play the shiny BS tech game. The moment you involve multiple computers and containers, you are typically removed from this level of performance. Going from 2,000 to 2,000,000 RPS (serialized throughput) requires many ideological sacrifices.

Mechanical sympathy (ring buffers, batching, minimizing latency) can save you unbelievable amounts of margin and time when properly utilized.


I frankly don't see where containers could lower the performance.

Basically a container is a glorified chroot. It has the same networking unless you asked for isolation, then packets have to follow a local (inside the host) route. It has exactly no CPU or kernel interface penalty.

Maybe you wanted to say about container orchestration like k8s, with its custom network fabric, etc.


> I frankly don't see where containers could lower the performance.

Have you seen most k8s deployments? It's not the containers, it's the thoughtspace that comes with them. Even just using bare containers invites a level of abstraction and generally comes with a type of developer that just isn't desirable.


So it's not due to container, but nowadays "container" in prod tend to mean k8s or similar clustering.


Even loopback is significantly slower than a direct method invocation.


They mention a $1,500 budget per month but then omit things critical to understanding how they achieve that cost point.

What is actually more interesting is to understand what portion is spent on servers versus bandwidth - and what hardware configuration they use to host the site. For example, Is $1,500/mo just paying for colo costs + bandwidth, with already owned recycled hardware (think last gen hardware that you can get at steep discounts from eBay / used hardware resellers...)

That would have been way more interesting to know given the blog title than the choice of infrastructure software they use.


I'm amazed that their architecture doesn't include a CDN. These days I expect nearly all high traffic websites to make use of a CDN for all kinds of content, even content that's not cached.

They cited Cloudflare not being used due to privacy concerns. It'd be interesting to hear more about that, as well as why other CDNs weren't worth evaluating too.


What they are doing is unfortunately not legal. There were precedents of Cloudflare ratting out manga site operators before which have led to arrests [1] (the person who ran mangamura got a 3 year sentence and a $650k fine [2]). And at some point they were going after mangadex via the same way too [3].

A lot of their infrastructure design choices should be viewed with OPSEC constraints in mind.

[1] https://torrentfreak.com/japan-pirate-site-traffic-collapsed...

[2] https://torrentfreak.com/mangamura-operator-handed-three-yea...

[3] https://torrentfreak.com/mangadex-targeted-by-dmca-subpoena-...


Which is interesting considering they take no issue with sites like KiwiFarms which harass people to literal death, terrorist groups, criminals (carders, phishers, etc.), racists & other forms of hate speech.

I guess it all depends on how much money you bring in for them really.


It's effectively a warez site. There's a reason why they host in the places they do and can't be too picky about providers.

CF will also pass through things like DMCAs easily.

Based on their sidebar, it's probably hosted at Ecatel or whatever they are called now (cybercrime host) via Epik as a reseller, the provider famous for hosting far-right stuff.


What’s the reason behind where they host and having issues with providers? I haven’t heard this before

Regarding DMCA’s, as an entity doing business where they’re legal, what should they do as a middle man?


> Regarding DMCA’s, as an entity doing business where they’re legal, what should they do as a middle man?

Don't use them and instead have your middleman be in a country that ignores intellectual property rights and copyright?

I'm not saying CF is wrong to pass them through. I'm just saying CF is not the right choice for a warez site for longevity.


They do have a crowdsourced CDN called Mangadex@Home. I participated in it from last year until the site was hacked. The aggregate egress speed was around 10 Gbps.

The NSFW counterpart of MD also has a CDN appropriately named Hentai@Home run by volunteers.

These 2 sites are the only ones rolling their own CDN for free that I know.


I think the usual argument re: Cloudflare on the privacy front is the fact that they pretty aggressively fingerprint users, and will downgrade or block traffic originating from VPNs or some countries. This is a natural side effect of those things often being tied to abusive traffic, and a lot of it is likely configurable (at least on their paid plans) but it often comes up around this.


What's the benefit of a cdn if nothing is cacheable? Slightly lower latency on the tcp/tls handshake? That seems pretty insignificant.


The CDN part is kind of pointless because they can't really have nodes in large parts of the western world since.. it's a warez site. The CDN providers will get takedowns, requests to reveal the backing origin, etc. You can't use a commodity CDN provider for this.


In their case (manga), seems like the vast majority of the content is cacheable.


Latency makes a bigger impact on UX than throughput for general browsing. A TLS handshake can be multiple roundtrips that greatly benefit from lower latency, especially mobile devices.

Modern CDNs also provide lots of functionality from security (firewall, DDOS) to application delivery (image optimization, partial requests).


Properly tuned NGINX on a physical server can handle incomparably large load for static content than some of the "cloud" storages around.

The "trick" has really been known for a decade, or more. Have as many things static as possible, and only use backend logic for the barest minimum.


That's the raison d'être of nginx, so it is performant for this kind of thing. However, the advantage of a CDN is that they have points of presence around the world, so your user in Singapore doesn't have to do a trip around the world to get to your nginx on a physical box in Lisbon.


I am running an app with 10.000 incoming rq/s on AVG. It's running on 8, 8 core Hetzner VMs. Most request are static data calls like images, JSON and text. About 5% is MySQL and other IO operations. I pay about 300 euros a month for this setup. Quite happy with it.


I had nothing but respect for the whole team. Dedicating their time to build everything from scratch, not to mention that they maintain everything for free.. It's a cool project, not sure if there's a way for anyone to contribute.

I"ll join the discord afterwork to see if they need any extra hand.

Gee, how do these people find other people online to work on all of the cool projects. I would love to join rather than playing games after WFH on the same pc over and over again lol


Okay, but isn't most of their content stolen? Why would you want to contribute to that?


The world of scanlations is always on edge. Usually, when publishers announce official translations of manga titles, fans drop translations of this title. It's not rare that publishers hire fans who were translating this title before for free as an official team.

To be more precise, the real reason why such sites are alive is that they delete titles that got licenses in Europe and the USA. Still, publishers can measure the popularity of titles and buy legal rights to publish it, because it's popular enough. It's harder to find manga "raws" than translated versions.

And by that, they're not 100% "illegal" for the western world, and asian companies are not so interested in fighting with scanlations because they need to combat piracy in their part of the world.


Heck no. As per the Berne Convention they are 100% illegal even in the western world and can only survive due to the neglect or lack of legal resources---I have seen multiple cases where artists were well aware of scanlations but couldn't fight against them because of that. A legal way to do scanlation would be always welcomed (and there have been varying degree of successes in other areas), but it is just wrong to claim that they are somehow legitimate at all.


No, intellectual "property" can not be stolen. You are thinking of copyright infringement.


Maybe because they enjoy the interesting domain and challenges of the area, look at a project like Dolphin for example.

Also, some people hold the view that things like information, media, code can not be “stolen” in the traditional sense, so that further reduces any qualms about associating themselves with it.


Just curious if anyone reading this knows the answer: Would it be illegal to contribute man-hours on e.g. implementing features or fixing bugs on a project like this, or does that only apply to whoever actually hosts the content?


MPA tried to get the source code for Nyaa.si removed from GitHub because the "Repository hosts and offers for download the Project, which, when downloaded, provides the downloader everything necessary to launch and host a “clone” infringing website identical to Nyaa.si (and, thus, engage in massive infringement of copyrighted motion pictures and television shows)".

It was a completely retarded play on MPA's part and they only managed to get the repo down for days until GitHub restored it even without hearing from the repo owners. So really they only brought about some minor nuisance alongside a bunch of headlines to advertise Nyaa.si for the rest of the world.

https://torrentfreak.com/mpa-takes-down-nyaa-github-reposito...

https://torrentfreak.com/github-restores-nyaa-repository-as-...


A good lawyer would probably say something like "it depends".

It's entirely possible for a copyright owner to construe some kind of secondary liability based on your conduct, even if the underlying software is legal. This is how they ultimately got Grokster, for example - even if the software was legal, advertising it's use for copyright infringement makes you liable for the infringement. I could also see someone alledging contributory liability for, say, implementing features of the software that have no non-infringing uses. Even if that turned out to ultimately not be illegal, that would be at the end of a long, expensive, and fruitless legal defense that would drain your finances.

In other words, "chilling effects dominate".


Depends on what you consider "stolen". In most cases, the manga that is available is translated and edited by fans to make it accessible to English-speakers when the IP owners do not see a reason to do it themselves. The amount of manga that actually get official English releases is very tiny and western licensing companies do not have many incentives to start picking up obscure manga that no one without the ability to read Japanese have heard of. They're much better off going after manga that have already been made popular by fan-translated manga, or have some other property that has caught traction (for example manga with an anime adaptation that has official or unofficial subtitles).


It's what most of the world considers stolen.

Scanlations are often viewed by fans as the only way to read comics that have not been licensed for release in their area. However, according to international copyright law, such as the Berne Convention, scanlations are illegal. [1]

This is a snippet about the Berne Convention:

The Berne Convention for the Protection of Literary and Artistic Works, usually known as the Berne Convention, is an international agreement governing copyright, which was first accepted in Berne, Switzerland, in 1886. The Berne Convention has 179 contracting parties, most of which are parties to the Paris Act of 1971.

The Berne Convention formally mandated several aspects of modern copyright law; it introduced the concept that a copyright exists the moment a work is "fixed", rather than requiring registration. It also enforces a requirement that countries recognize copyrights held by the citizens of all other parties to the convention. [2]

  [1] https://en.wikipedia.org/wiki/Scanlation#Legal_action
  [2] https://en.wikipedia.org/wiki/Berne_Convention


Yes of course it is stolen. And people claiming otherwise are the same people who come here and ask "What can I do, some Chinese company ripped of my website?!?!?!"


>And people claiming otherwise are the same people who come here and ask "What can I do, some Chinese company ripped of my website?!?!?!"

Something you made up in your head with literally not a single shred of evidence.


Find cool project. Contribute. :)


I do on some open source projects on github. Sorry what I meant is not just some open source projects but working products like this driven by volunteers / teams like theirs.


I loaded the front page of Mangadex and it made 114 web requests including 10 first-party XHR requests, 30(!!!!) Javascript resource requests and somehow 4 font requests, without me interacting with the page. Clicking one of the titles on the front page resulted in nearly 40 additional requests.

Perhaps if you are limited by requests per second you could consider how many requests a single user is making per interaction, and if this is a reasonable number.

The website is impressively fast though, I'll give you that.


Frontend framework they use (NUXT) uses code splitting [1], which means that:

- first request is fast, because you only need to download chunks required for a single page/controller (and you prefetch others in the background)

- changing some parts of codebase requires to re-download only affected chunks, instead of the whole bundle

[1] https://www.telerik.com/blogs/what-you-should-know-code-spli...


They're probably more limited by bandwidth than requests per second, but anyway you look the number of requests are still impressive considered the budget.

BTW, the site is not just fast: they serve images on high quality (same as the original [1], that can be multiple MBs per page [2]) at an pretty impressive speed too.

[1]: before someone asks why they don't optimize the images, this is by design since they want to serve high quality images. There is an optional toggle to reduce the image size, but this is disabled by default.

[2]: for those not familiar, the average number of pages on a manga is something like ~20, and this can be read in ~5 minutes depending on the density of the text. So you can easily consume 50MB+ per chapter.


> The only missing bit would be the ability to replicate production traffic, as some bugs only happen under very high traffic by a large number of concurrent users. This is however at best difficult or nearly impossible to do.

Not sure I'm missing something here. Surely you could sample some prod traffic and then replay it with one of the many load test tools out there. You might lose in the geographical distribution, but load testing a web server with 2k TPS sounds a bit trivial.


My cheap $20/month VPS serves tens of thousands a user per day without breaking much of a sweat. Using a good old LAMP stack (Linux, Apache, MariaDB, PHP).

I don't know how many requests per second it can handle.

Trying a guess via curl:

time curl --insecure --header 'Host: www.mysite.com' https://127.0.0.1 > test

This gives me 0.03s

So it could handle about 30 requests per second? Or 30x the number of CPUs? What do you guys think?


You need to do load testing to determine this - a request's time includes many delays that are not related to the work the server does, and thus it's not as simple as 1/0.03 - it's possible that 0.0001 second of that time is actually server time, or 0.025 - plus you also have to consider if there are multiple cores working, or non-linear algorithms running, or who knows what else.

Best way to figure it out is to use an application like Apache Bench from a powerful computer with a good internet connection, throw a lot of concurrent connections at the site, and see what happens.


I think it makes sense to test from the server itself because otherwise I would test network infrastructure. While that is interesting too, I am trying to figure out what the server (VM) can handle first.

I just tried Apache Bench:

ab -n 1000 -c 100 'https://www.mysite.com'

    Concurrency Level:      100
    Time taken for tests:   1.447 seconds
    Complete requests:      1000
    Failed requests:        0
    Requests per second:    691.19 [#/sec] (mean)
    Time per request:       144.679 [ms] (mean)
    Time per request:       1.447 [ms] (mean, across all concurrent requests)
Wow, that is fast. Around 700 requests per second!

Upping it 10x times to 10k requests ...

    Requests per second:    844.99 [#/sec] (mean)
Even faster!


I guess you basically run a load test of randomized or usage weighted list of API endpoints for increasing number of synthetic users and see when things start breaking. Many free tools help run these tests from even your laptop.


A day is 16 * 60 * 60 = 57,600 seconds (night time substracted). So tens thousands users per day is like 1-2 req/s, maybe 50 at peak time.

What is more important is what kind of requests your server has to serve. Nginx can easily serve 50-80k req/s of static content; 100ks range if tuned properly.


Does it serve 20-40 hi resolution images and uploads per user?


I am not sure how to interpret this para:

> In practice, we currently see peaks of above 2000 requests every single second during prime time. That is multiple billions of requests per month, or more than 10 million unique monthly visitors. And all of this before actually serving images.

If I am reading that correctly, 2000r/s does not include images, and makes it unclear if $1500/month does.


I'm pretty sure that includes images, that's why people visit the site. Prime time happens when a very popular manga gets released at around the same time every week.


Hosting static files isn't really that hard. I used to host a website that at its best served around 1000 GB of video content in 24 hours. Of course, it wasn't the fastest without a CDN but it was just 25 €/month.


I wanted to start a discussion about how to estimate the number of requests a given server can handle per second. So when I read "x requests/s" I can put that into perspective.

But it seems you think I wanted to start a dick measuring contest?

If your question is genuine: I would serve images via a CDN. The above timing is for assembling a page by doing an auth check, a bunch of database queries and templating the result.


This is complexity for complexity's sake. Pay no attention to the disclaimer at the start of the article. They threw every buzzword-heavy bit of tech they could find at it, creating a Frankenstein monster.


Completely disagree. How would you do it in a simpler way, while keeping the features like redundancy ( including storage), logs, metrics, etc?


Looking at their diagrams it seems that the k8s cluster exists solely to handle their monitoring and logging needs which would be extreme overkill, especially since 18k metrics/samples and 7k logs per second are nothing. Plus you now suddenly need a S3-compatible storage backend for all your logs and metrics. Good thing Ceph comes 'free' with Proxmox, I guess.

Deploying an instance of Prometheus with *every host is also unusual, to say the least and I don't quite understand their comment to that. If you don't like a pull-based architecture (which is a valid point) why use one at all!? There are many more push-based setups out there that are simpler to set up and less complex.


> k8s cluster exists solely to handle their monitoring and logging

Does image processing, runs our analytics, runs our Sentry, runs our gitlab-ci runners, and quite a few other things not mentioned expressly

> which would be extreme overkill

That's an interesting argument against k8s ; if anything I find it much easier to work with -- once accustomed to its idioms, ofc -- than alternatives like dedicated VMs, Docker Swarm etc

Getting HA and auto-healing for free is possible without it, of course, but does require much more work, especially if you aim for a somewhat minimised amount of statefulness (as in deviation from the template of your system)

Also S3-compatible storage backends are really aplenty, from commercial offering to simpler ones like MinIO. Ceph just happens to be a bit higher of a deployment investment with the benefit of fantastic performance, flexibility and resiliency. Somewhat like k8s itself, it's a bit daunting at first but does actually make things simpler in the long run (imo)

> 18k metrics/samples [...] are nothing

Well yes and no, the number of metrics isn't relevant per se, but its cardinality is very relevant, and managing that in a single prometheus instance will quickly require some serious vertical scaling, especially if you want to look at data on longer ranges (which, in contrary to logs, we are interested in)

> 7k logs per second [...] are nothing

That's an interesting take . Surely this isn't a world-record-shattering amount indeed, but no one seems to have such a great non-SaaS-or-cheap solution to storing, sorting and querying this amount of logs either (at the resource efficiency of Loki anyway), so maybe we just have a different set of expectations for log management

> If you don't like a pull-based architecture [...] why use one at all!? There are many more push-based setups out there that are simpler to set up and less complex.

Are there really? That is non-SaaS and with as widespread 3rd-party software support as Prometheus does? ie great integration with essentially any database, webserver, runtime, OS, etc?

Because if we talk only about node metrics like CPU etc then yeah, sure there are plenty of options. But (maybe not so) obviously the diagram showing only node exporter doesn't mean that this is the only integration we use -- we collect prometheus metrics for MySQL, PHP-FPM, Varnish, Nginx, HAProxy, Elasticsearch, Redis, RabbitMQ etc (essentially every single piece of software we use).

Fwiw I found very little in the way of open-source solutions to that problem that ticked as many boxes as Prometheus.

As for "simpler to set up and less complex", both Cortex and Loki would be really annoying to manage outside of Kubernetes, I'll happily give you that. But... being able to easily deploy and manage such systems once you have Kubernetes is precisely one of the reasons to use it. You can't say it's complex to deploy itself but then ignore the fact that it largely outweighs this by making reliable operation of complex-but-powerful software on top it, that is precisely one of the upsides of using it in the first place :)


>more than 10 million unique monthly visitors

>our ~$1500/month budget

I understand not wanting to show ads, but is there no way for the users to contribute to hosting costs?


From a quick glance it seems to host obviously copyrighted content for free. In some jurisdictions (like Spain) the companies would have a hard time at court against the website creators, since it's a not-for-profit* website sharing culture.

Now show an ad, or premium accounts, and it becomes a for-profit endeavour which is straight jail time. I'm unsure about donations.

(Based on previous rulings I followed ~10 years ago, laws might have changed IANAL yada yada)

*not for profit != non-profit


They might have a play at being affiliates for sellers of the original material. I suppose a link is an ad, but its also somehow a less dubious way to monetize in my mind.


There is the MangaDex@Home, where users can serve part of disk space/bandwidth to help serve (mainly old) manga chapters. It does need to be something that is running 24/7 (e.g.: not a PC that is shutdown frequently), so something like a VPS or a service is recommended.


Virtually every chapter is served via MD@H now. Client doesn't really need much availability, as long as it can do a graceful shutdown. Even in the event of a sudden shutdown, the trust penalties are much lower than H@H and in practice go away after a trickle of traffic to raise your score


Nice, didn't know about this (there isn't much information about MD@H after the rewrite).

BTW, how can I register my VPS on MD@H? Before we had a dedicated form on the page to register interest, at least after the rewrite I didn't find it. Is it only using something like Discord?


We had a dedicated page for signing up on the v3 version of the site. Currently, yes, it's via our Discord server's MD@H channels.


They had a bit under $80k in crypto in their list of BTC and ETH addresses "leaked" along with the source code when the site was hacked earlier this year.


A $5/mo premium plan would break even so quickly


Premium plan on content that can be considered as dubious in copyright context? Seems like a quick way to get shut down.


premium plan on a virtual badge, NFT or whatever crap you want. content would still be freely available for everybody but the infra costs would be a bit less.


Or even a patreon style 'you get nothing but a supporter badge' with that kind of traffic levels.


It's a nice article, I guess, but the site is down (the one discussed in the article, not the blog post itself) for me.


You probably have Verizon - they've started null routeing traffic to sites "like this".

https://old.reddit.com/r/mangadex/comments/nvj7qf/is_verizon...


Ah shucks I thought we had mostly avoided that stuff in the US.

I'm guessing though they're using some old spam ip/block though, there's a lot more obvious piracy sites then a Manga site. For instance, I can access all the major torrent sites.


They also block nyaa, an Anime/Manga focused tracker. It's not a very aggressive list though, as you're right that major torrent sites are still accessible.


Huh, right you are, on both accounts, it seems.

That's disappointing. If only I had some choice to ISPs, then I could express my disappointment by voting with my wallet…


Just call their service often enough, and tell them internet isnt working.


It works fine for me: https://mangadex.org/


I wrote business backend server that calculates various things and returns results as json. It serves in average up to 5000 requests/s for about $220CDN / month. Architecture - single executable written in C++ () running on rented dedicated server from OVH. 16 cores, 128GB RAM and couple of SSDs.

It can do much higher requests per second wise on simple requests but most common requests are actually heavy iterative calculations so hence the average of 5000 requests/s


How is the Wireguard VPN set up? Has anyone used Wireguard to set up a VPN into AWS VPCs?


Interesting to know more about news.ycombinator.com !


At some point in new re-design, they started to load full size images for thumbnails. The whole site feels slower due to that. Need an automatic re-scaler service.


Not correct, we generate 2 thumbnails sizes for every cover -- if the site loads full-size anywhere by default (rather than when you expand it) it's definitely a bug!


How can this be legal?

It's basically pirating content




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: