Hacker News new | past | comments | ask | show | jobs | submit login

Seems like every week there's a new massive scale DB project or company getting announced on HN.

If they're looking for projects that create public value and demonstrate the power of their products at scale, digitizing this and making it searchable may be a good marketing project that's appealing to certain kinds of customers.




It would appear us SQLite zealots have encountered the final boss.

Petabytes uncompressed would be tricky if you need to slice those columns. SQLite caps out at ~281 terabytes of storage before it can't track any additional pages.

None of this is to say you couldn't partition the data across a lot of SQLite instances in varying ways. I will probably take a shot at it this weekend. Looking to see just how unlimited my AT&T fiber connection is anyways.


> It would appear us SQLite zealots have encountered the final boss.

That's cute. :)

There isn't much value in feeding it all into a conventional RDBMS. OLAPs and columnar stores are what is needed here. But first it will need a great deal of grooming and ETL work.


Yeah.. It would be much easier to copy the data to S3/any object storage (better to convert it into a columnar format like parquet) and query it directly using a SQL on lake engine like Dremio or Athena or S3Select would work too.


>It would appear us SQLite zealots have encountered the final boss.

Just wait. It's actually a multi-boss fight, since you have to wrangle the Pharmacy Benefits Management datasets, plus Medispan, plus Medicare, plus all the MedicAid datasets, plus VA.

Are you and all your mightiest boxen bad enough dudes to make sense of the entire U.S. Healthcare industry?

<Actuary Stormrage in the background>

You are not prepared!


Figuring out the size of this data was part of the research phase for doing just that: building out that database. I'm curious to know if other people are already working on it (maybe Turquoise Health?)


Yep, we have built this database at Turquoise Health. I agree, the data is massive - and don't forget that it is all refreshed monthly!


It's cool seeing that Turqoise Health exists. One of my first programming projects back in the day (when I was trying to get a jr role in 2014) involved building a simple version based on data.gov medicare data. The inputs were terrible and tiny (e.g. chest pain at hospital X costs ~$60k on average across 5 patients), so I was always curious what a real world version might look like.

edit: As I reflect, I'm amused to recall that this was early enough in my path that I didn't know about DB indexes, so I was very proud that I figured out how to basically roll my own indexes by pre-sorting the columns by lat and lon. I don't remember whether my solution actually prevented a full-table scan, but it felt like a major breakthrough at the time.


Is that from the hospital side or the insurer side?


We have built databases for both and can compare between them.


It’s my understanding these prices are negotiated to some degree, so it’s probably both sides at various times.


Very cool. Who do you see as the likely users of that database? Is it primarily for researchers/data journalists, or is there a commercial value to it?

I'd be very curious to read more about the data cleaning phase when you get there. Specifically, how hard it is to combine this data and construct good schemas.


As someone who's worked on the provider side in different capacities, I can tell you that there could be tremendous value on the provider side.

It's entirely possible that two surgeons with offices next to each other could be getting reimbursed at wildly different rates for their most common procedures for their most common procedures by the same provider.

If you're that provider, you ABSOLUTELY want to know what the surgeon next door is getting paid the next time your group is negotiating with the insurance provider.


Interesting. I'm kinda surprised this is handled by the doctors themselves. I'd expect there to be professional negotiators who parse this data themselves and then use it to negotiate on their behalf.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: