Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Log-Scale Covid-19 Plots (bitbucket.org/kkd)
110 points by 6b6b64 on April 4, 2020 | hide | past | favorite | 70 comments



Why all graphs are cummulative instead of new cases for that day? It's harder to notice how it is growing that way (i.e. more or less new cases than the previous days) and harder to see if it is going exponential, lineal or whatever. And, of course, forces to use log scales because the accumulated number is already high.

At least for networking graphs is more meaningful to see difference from the actual from the previous commulative total, than watching the total for a network interface.

Another, at least for me, misleading use of graphs is to show the cummulative total people that ever got sick, instead of the current amount of sick people (taking out the recovered, and maybe the dead ones).

I know that how the measurements are done in most places is rigged, as not everyone is checked, and there are a lot of people that are asymptomatic, but that happens with both the new cases and the cummulative ones. Showing the cummulative numbers don't take out that rigging.


One reason is that daily increases are very volatile/noisy. The standard way to handle this is to use n-day moving averages, but they have a time lag and are harder to interpret.

Financial Times has the best graphical trackers I've seen. They used to show cumulative counts, but switched to 7-day moving averages of daily increases" recently.

https://www.ft.com/coronavirus-latest

I agree that displaying increases/growth rates are more informative overall, but it does take more time to understand the metrics. (Just compare the descriptions: "Total Deaths" vs "7-day moving average of daily increase in Total Deaths")


Not really. Data for Spain can be perfectly used and fit without averages on daily increases => https://media-exp1.licdn.com/dms/image/C5622AQH1JmaVMs3mqQ/f...


It's unclear what that graph is supposed to show. It can't be cumulative cases (because there are days with decreases), but the scales are completely off if they're supposed to show daily increases/diffs (Spain certainly did not see daily increases anywhere near 100k in the past few days).


On the other hand I don't understand how bad it is from looking at the graph with the easy to understand metric.


Yeah I came by to say the same thing. I understand that "cumulative" and "dead" are super sensationalist, but what really matters isn't how many people have had it, but how many people have it right now and how that number's changing day by day. It's impossible to figure out and it makes this chart very misleading.

For instance, China's hovering around 10e5 cases on the chart, but they have 2e3. That's three orders of magnitude misleading, especially when 95-100% of people recover completely and are quite unlikely to get it again.

tl;dr: China features prominently on those graphs but their outbreak is contained and finished, they've re-opened internally and pollution (hence factory output) is returning to normal. The graph doesn't reflect that in a meaningful way. [1]

[1] https://www.forbes.com/sites/jeffmcmahon/2020/03/22/video-wa...


I agree! Cumulative charts are less useful. For my charts, I represent new cases or new deaths per day. It shows such a clear picture of the trend of the growth:

https://mobile.twitter.com/zorinaq/status/124578701743622144...


Do you have these charts for the main countries? From your chart it looks like Italy and Spain have reached the peak, US is still growing, Germany may be just about to peak. But it is hard to read.


The FT graphs show the increases/growths/1st-derivative. Italy and Spain are likely over peak-growth, Germany and France nearly, UK and US not yet. These are 7-day rolling averages, so there's roughly 3.5 days delay in the estimated signal.

https://www.ft.com/coronavirus-latest


Daily numbers have problems too, they can get quite noisy. Both have their uses, particularly seen together. I also think linear scales have their place alongside log (which can be hard for laypeople to grasp). Here are daily numbers for France for example:

https://coronavirus.projectpage.app/france?period=28

The jump recently is due to a change in reporting to include deaths outside hospital (something most countries have yet to do), and these are reported deaths not actual, and subject to all kinds of caveats. Reported cases is even worse. Recovered cases is worst of all and not being properly tracked in most areas.

A moving average of daily cases might be best. I think the ft (who have nice graphs of this along with commentary) are moving toward this.

https://www.ft.com/coronavirus-latest

There's a nice commentary on what they're doing here:

https://mobile.twitter.com/jburnmurdoch/status/1246184639540...


Just in case you haven't seen it, I'm a big fan of this [0] chart, which plots new weekly cases against cumulative total, both axes on a log scale. It has a whole video explanation [1] if you're interested in the logic behind it.

[0]: https://aatishb.com/covidtrends/

[1]: https://www.youtube.com/watch?v=54XLXg4fYsc


I'm not too fond of that chart because all you can really see clearly on it is when the increase stops being exponential. It is pretty hard to see the base of the exponential (in most countries the increase in confirmed infections has been slowing down, relatively speaking, but that is incredibly hard to spot from that graph, which makes the situations seem worse than it already is).


Apparently the right way to do it is to show Daily Total deaths (ALL deaths not just covid) per local region Vs daily historic total deaths of that region - so some kind of Excess Death indicator.

And that shows pandemic effect rising and falling, because many cases either don't reach a hospital, get misclassified etc

But getting a daily ticker of total deaths (all deaths not just covid) in most parts of the world gets shutdown quick by politicians of all stripes.


Log cumulative graphs make it hard to see whether the daily rates are increasing or decreasing. Here are non-log, non-cumulative, new daily deaths per 1M people to compare different countries:

https://colab.research.google.com/drive/1dNAVpgRGjEViK-9ULhE...


New cases may be calculated as a difference between cummulative prev and next day. This is how daily cases are shown here by JHU dataset: http://covid-19.seektable.com/report/fe66549d73c64773bd6712e...


For a while there, it made no difference. When cases are growing exponentially, the two graphs are proportional to one another. The cumulative graph made it easy to spot when a particular region or locality had gone exponential, which I took to indicate the onset of community spread.


The FT switched from cumulative to new. While taking that first derivative has some advantages, I for one was clamouring for the original plots, and am grateful that the author has brought them back. Thanks!


I'm not sure if the equation is right but I've done a daily case plot for Indian cases.

https://gist.github.com/theSage21/29608c666a378dcb276a9dd9e4...


Pet peeve - line charts with many lines, yet legend in another place. Even worse, when the order of labels is not the same as the order of values. They take more cognitive power to parse than needed.

Compare and contrast with labels next to the lines, vide https://www.ft.com/coronavirus-latest (this example is already quoted in some other thread, and I find it a gold standard of coronavisualizations).


Also, an interactive chart by the New York Times - it makes much easier to see a single country trajectory:

https://www.nytimes.com/interactive/2020/03/21/upshot/corona...


To be fair to the FT, they have much better graphs than most other newspapers.

I especially like the small multiples by country with the gray lines for different countries (UK, Italy, Spain last I checked).

It's such a wonderful chart idea that I'm already planning on stealing it.


The Economist has a good baseline for charts.

The New York Times has some stunning data visualizations (from the UI/UX perspective), see e.g. "You Draw It" series: https://www.nytimes.com/interactive/2017/01/15/us/politics/y...


Especially sucks if you are colourblind! I've pretty much given up on reading charts like this


https://aatishb.com/covidtrends/ is a pretty cool related resource


Here's a quick video explaining those graphs: https://www.youtube.com/watch?v=54XLXg4fYsc

Log-Log graphs of new cases vs confirmed cases (which that graphs) is by far the best way to represent the data that I've seen.


Agreed that's much better, and addresses all the concerns in my sister post. Thanks for sharing!


Nice use of matplotlib. I'd like to apply this to US states as well.

Although even between states the variation in test rate is so great that it's hard to gather much from it.

I personally use hospitalizations as a more accurate metric of total infections in the US. For example, Washington is being hailed as doing a great job to slow the virus down, but ~80% of the confirmed tests result in hospitalizations, because they still just don't test you otherwise [0]. Compare that with a more realistic hospitalization rate (many states with a lot of cases are around 10% -- who would have guessed it would be Louisiana and Florida doing the broad testing?)

[0] https://en.wikipedia.org/wiki/Timeline_of_the_2020_coronavir...


These graphs give a bit of an indication, but you cannot really trust them.

Since there is a shortage on tests, at some point countries might decide to only test people coming into the hospital.

Another thing about the deaths is also troubling: In the Netherlands doctors were complaining that deaths with symptoms of Corona were not counted as corona deaths, because they were not tested and found positive (again a problem with the shortage of tests).

So a bending of the curve might just be explained by a new strategy of who to test.

I think the best way to count is to look at total hospitalizations, and subtract the average of normal years. And with corona deaths the same way: subtract the total with the average in a normal year.


It also needs per capita figures, which is dramatically more important than absolute figures, unless everyone happens to know the population figures of each country by memory.

You end up missing critical data points like the per 100k population mortality rates (from Friday morning):

New York 12, Louisiana 6.6, New Jersey 6, Michigan 4.2, Washington 3.5, Connecticut 3.1, Massachusetts 2.2, Colorado 1.7, Georgia 1.67, Nevada 1.27, Illinois 1.23, Delaware 1.2, Pennsylvania 0.7, Ohio 0.7, Florida 0.68, Kentucky 0.68, Alabama 0.65, South Carolina 0.6, Wisconsin 0.53, California 0.5, Oregon 0.5, Maine 0.5, Idaho 0.5, Virginia 0.48, Arizona 0.45, Kansas 0.44, New Hampshire 0.36, Iowa 0.34, New Mexico 0.33, Minnesota 0.32, Nebraska 0.32, Missouri 0.31, Texas 0.24, North Carolina 0.15, Hawaii 0.14

Italy 23, Spain 22, Belgium 8.8, France 8, the Netherlands 7.8, Switzerland 6.2, UK 4.5, Sweden 3, Denmark 2.1, Ireland 2, Portugal 2, Austria 1.8, Germany 1.3, Norway 0.9, Canada 0.37, Finland 0.34, Australia 0.11, New Zealand ~0

Most of the US is seeing very low per capita mortality rates and no surge in cases. You wouldn't know that by the headlines though.


In this video John Burn-Murdoch (the creator of the FT charts) discusses why they decided against showing numbers per capita. https://mobile.twitter.com/janinegibson/status/1244519429825...

There's also this tweet additionally showing how population size of a country has no relationship to pace of disease spread. https://mobile.twitter.com/jburnmurdoch/status/1246185741304...


There's another surprising reason why per capita numbers aren't useful - for exponential growth on the typical type of graph starting at some "initial" number of cases, it makes no difference! For example, if one country was counted a two equal half-sized countries, their graphs would be the same shape but shifted to the right by a few days. However, they would also reach their "initial" number of cases where the graphs start at a few days later - shifting them left by the same amount! The result would be the same line as the full-sized country.


Good points, but that could be generalized to say that we might as well look only at the worldwide spread. But we look at countries because policies tend to follow those boundaries, so we can see how different choices affect outcome. In the case of the US, we should ignore the national total and look at individual states, because that is where nearly all the policy decisions are made. This might be true of other nations, as well.


Very true, but even then it remains very difficult to compare countries, because of the % of older generations.


Well done, log scale, rebased to common starting point - the best way of depicting international comparisons so far.

Zeit.de has an interactive version of these (in German), which adds a rectangle for "days since numbers last doubled", a good indicator of how severe the situation in a country is.

Scroll down past the map to the third graph with international data, click on "Todesfälle" (deaths): https://www.zeit.de/wissen/gesundheit/coronavirus-echtzeit-k...


> Well done, log scale, rebased to common starting point - the best way of depicting international comparisons so far.

Yeah, its what the Financial Times and many others have been using for quite a while, log graphs with a starting point of the Xth death or Yth case. Its a standard way of presenting this kind of data.


But the FT doesn’t display that graph anymore, only it’s first derivative. I liked the original graph (maybe I am better at deriving than integrating...)


It would be interesting to see graphs of deaths for other reasons to compare numbers. For example deaths from starvation/malnutrition are likely increasing in India due to lockdown (https://www.theguardian.com/world/commentisfree/2020/mar/29/...) and deaths due to cancer may increase due to patients' treatment being missed.


This repo replicates log-scale plots seen in the Financial Times and other news sources, here grouped by world region. Plots are updated nightly and visible on the repo's homepage.


If you look at the last few "confirmed" plots, all the lines are in the bottom left corner but the x-axis still goes to above 70 (I guess for the sake of consistency) - this means that most of the space on the screen isn't used and it makes it very difficult to make out anything from the lines.

And when you need to recreate the same plot with diffirent subsets of the data like you've done here, that's a great use case for an interactive dashboard which allows the user to select the countries/regions and also zoom in and out.


Same, but interactive and more powerful: https://www.datacat.cc/covid/


Alternatively a dashboard around covid-19 can be found at https://covidly.com. Including graphs.


FTR, There is simple Covid-19 plot utility for Ukraine data, but could be used for any other data set.[0]

[0] https://github.com/marianpetruk/covid19_Ukraine


Suggestion: use fixed style for the same label across multiple graphs. If a country is blue+dashed in the cases plot it should be the same style in the deaths plot. Use a style from the label hash or a global table or something.


If you're still working on it, I'd love to see the second derivative from the data. That's what I'm wondering about most these days: Are we at the inflection point or not?


The turning point is when the second derivative is zero, which would indeed be easy to spot with a second derivative graph. But it’s also very easy to spot with the first derivative graph (as published by the FT now): It’s when the first derivative hits its maximum.


Yes- but I'm talking about the inflection point- where the curve goes from concave up to concave down. These are modeled as gaussians- so if that modeling works we would be a standard deviation from the peak- assuming the crisis is being well managed.


By turning point I mean the inflection point. You turn from “driving left” (convex) to “driving right” (concave).

By the way, the derivative of the logistic function is the logistic distribution, and that’s not the Gaussian bell shape, it has much fatter tails: Gaussian tails drop much faster, with exp(-x^2), while the logistic drops with exp(-|x|). (Makes sense, as the logistic curve grows exponentially at the beginning, and the derivative of the exponential is the exponential.)


haha yes indeed. I had never heard the term "turning point" used like that, but I guess it makes sense if you think about it as the turning point of the first derivative.

But I learned what a logistic function is, so thank you. I guess that is needed for the cumulative count.

You might be interested in this. I posted it a few days ago- it's why I'm talking Gaussians.

https://www.medrxiv.org/content/10.1101/2020.03.27.20043752v...


I like this set of graphs with commentary:

http://nrg.cs.ucl.ac.uk/mjh/covid19/


Very cool. I’d love to see the same plots scaled by population.



I tried, it does not work that way if you think. Or at least your “population” spec should be not arbitrary current administrative region or country, but specific virus spread area, which is very difficult to get data. You’ll get weird absolute numbers for EU minicountries (San Marino and Luxembourg are top) and very different figures per China, Hubei and other regions there. And if you finally compare in graphs you’ll have same graph anyway.


I guess it depends on what you want to do. If you're trying to make sense of a given countries response, (cases) / (population of administrative region) seems like a decent metric. Also since policy tends to follow administrative lines it's a decent (but extremely noisy) metric to see what policies are working.

What do you want to learn from (cases) / (people in infected area)? The denominator seems difficult to define since it's not scale invariant. If you defined an "infected area" as being within 1 km of an infected person, for example, you'd get a very different answer from if you defined infected area as 10 km from an infected person.

In the limiting cases where you define infected area with very fine or very course granularity, you get an infection rate of 1.


Agree. I was confused by this at first but if the slope on lines without dividing by population is:

(log y2 - log y1) / (x2 - x1)

Then if we scale y2 and y1 both by c, which is 1 / population:

(log( c * y2 ) - log( c * y1 )) / (x2 - x1) =

(log c + log y2 - log c - log y1) / (x2 - x1) =

(log y2 - log y1) / (x2 - x1)

So scaling by population does not change the slope of the graphs, only the intercept.


It doesn't even change the intercept because bigger countries reach their threshold number of cases sooner so that shifts them back again. Hence most of the lines are all roughly on top of each other regardless of country size.


Do log scale on both axes, suddenly all the plots will be nearly straight lines and you can start to reason about uncontrolled growth and how well mitigation is working.


If you plot (exponentially growing) cases or deaths against time, you’ll get a straight line on a log-linear graph (as here). If you make both axes logarithmic, you get an exponential curve again (but squished).

Maybe you’re thinking of the double log curve of total cases vs new cases that was on HN recently.


How does this compare to https://studylib.net/coronavirus-growth ?



Thanks, those log-scale plots are way more interesting than linear one.

Small mistake: Israel isn't in Europe but in Eastern Mediterranean.


the region titles are ... unconventional. greece, clearly in the eastern mediterraneon, is not listed, while iran on the indian ocean is. this group is normally called mena, roughly.

likewise, the category called southeast asia contains countries normally considered to be south asian - like india - and omits the larger part of southeast asia.

the creation of “western pacific”, grouping australia-new zealand, east asia, and the larger part of southeast asia, deserves credit surely since it is far more useful than the neocolonialist category of oceania, which mostly serves to allow europeans to swamp pacific islanders with statistics from australia


From the Readme:

> The countries have been grouped according to regions defined by the World Health Organization.

And, yeah, they’re a bit weird.

https://en.wikipedia.org/wiki/WHO_regions


All of these need to be normalized per capita. Otherwise you don't see the true extent of the problem. Also from looking at the stats recently, here's what I find more useful than the raw number of "cases":

- Number of deaths per capita

- Number of "severe" cases per capita (good indicator of the future number of deaths)

- Number of tests per capita (good indicator for whether or not "number of cases" means anything at all)


> All of these need to be normalized per capita. Otherwise you don't see the true extent of the problem.

Normalizing per capita replaces the true extent of the problem with the true relative local impact of the problem; both are significant.


But relative local impact _is_ the extent. If you live in a village of 100 people and 10 die that's pretty bad. If you live in NYC and 10 die - that's statistical noise that nobody will even notice.


> But relative local impact _is_ the extent.

No, absolute scale is the extent, that's pretty much what “extent” means.

Relative local impact is the...well, relative local impact.

Both are important, though which is more important depends on what you are doing with the measure.


Why not relative to population size? And China and USA should split into regions/states.


These numbers are effectively meaningless.

Definition of deaths varies widely, testing policy varies widely, no statistical extrapolation is being done in an attempt to avoid undermining public health policies that might be based on total fantasy.


In outbreak areas, total morbidity increases by multiples to magnitudes during the peak (despite enormous, historic measures of containment). Areas with processions of army trucks full of bodies, or the defense department bringing in refrigeration trucks.

"But what about the co-morbidity?". People don't die "of" COVID-19. They have heart failure, kidney failure, or other triggers of death because COVID-19 pushes their body to the limit. Pointing to resources that claim that "only" some small percentage actually died of COVID-19 is pure ignorance, because then the declaring doctor was simply being efficient because if they truly looked in there would be another triggered cause of death. Just as no one died of "AIDS", they died of things like Kaposi sarcoma, but if someone said "see, it wasn't AIDS at all" they would be laughed out of the room.

The majority of deaths are people who are health compromised in some other way (not all deaths, and there have been an abundant number of completely healthy people who have perished), but that is known by everyone and is not news, nor does it diminish the tragedy.

"no statistical extrapolation"

This is the most interesting, and ridiculous, claim of all. Enormous statistical measures and extrapolations are being done daily...that's how we are where we are. What is this meaningless claim even trying to say, other than that you, easytiger, know more than every health authority.


Way to have an argument with things I didn't assert.

Categorically, in many countries (UK, Italy), the figures released represent people who tested positive for covid 19 before or after death. Those figures, without any question, do not offer an opinion on how, if at all, it had any effect on the death. Further to that in the UK official death (released monthly) stats only count mentions of covid19 on a report. It doesn't indicate anything

In some countries, the only demographic where you are guaranteed to be tested is if you die, and in a hospital.

Are those facts with which you have an issue?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: