Hacker News new | past | comments | ask | show | jobs | submit login
Synthea: Open-source synthetic patient generation (synthetichealth.github.io)
88 points by johncole on May 19, 2023 | hide | past | favorite | 19 comments



I never expected to see MITRE on the front page of HN! We're actively adding more synthetic data sources to Synthea all the time.


Love Synthea, it's an amazing project and you should be very proud of it. My only gripe is how clean the data is compared to what many other EHR providers actually generate, but that's more on them than you guys.


It would be perhaps interesting to add a layer capable of injecting reasonable noise on top of these clean records.


I was thinking about earlier this month actually! I generate a lot of synthetic flight data, and we have to reproduce the noisiness of real data as well.


Doesn't MITRE maintain the CVE database?


Yes, MITRE is a non-profit organization that works with a number of US government agencies to cover a pretty large swath of areas: https://www.mitre.org/focus-areas


Played around with this in my soon-to-be previous health-tech job and its great.

Actually the entire hl7-fhir ( https://www.hl7.org/fhir/ ) standard seems to me quite solid. It would be wonderful if a new cohort of start-ups would leverage it to drastically improve the digital UX of healthcare generally.


That would be great except that the 8,000 lb gorillas of the medical data industry, at least as of a year or two ago, did next to nothing to really make their EHR's FHIR-compatible. Getting even some of the very basics on their demo environments were fundamentally broken.


Yeah so many cards stacked against potential start-ups who could potentially bring some quality to the industry :/ Curious to see though that Google cloud / AWS etc are building fhir store APIs.


How did you like the health-tech industry? It is something I have been considering but have heard mostly negative opinions.


Does anyone know if there is an equivalent for generating "random" viable products[1] in a PDM/ERP system?

I'm demoing some systems in this field for outside interests, but I can't use any "real" data due to ITAR and data restrictions like TC, NC, etc. Wait, what about the ERP? The ERP I'm developing against has "sample" data that's basically useless. Not much better than lorem ipsum pasted across ten thousand cells. Actually, it's worse than that, because . . ah hell, this is HN, I won't waste your time. People here know what the ERP ecosystem is like. I also don't want to build out from a bespoke, brittle ERP - that's how we got into this mess in the first place.

[1] Like a multi-level BOM that makes sense, or a Service BOM / Logistics Database that's meaningful. Anything for making pseudo-random PLs that follow MIL-STD-100, which is still considered frickin' Holy Ground by these people.


Building synthetic BOMs can be fairly straightforward if you can define the level of coherency you want to see. The only big trick in building structured data like this that I have built is to first build dictionaries of randomized data with very little coherence and then build larger structures that include elements of the dictionaries.

As an example, you might want to have a model of users interacting with a web site, ordering products and shipping them to their homes. This can start with building a dictionary of user records and orderable item descriptions. The user records would have an address and some "interest" variables that define what the user is likely to order. The item descriptions can have lots of a little information but would centrally contain a part number and some information that allows the part to be selected efficiently (a numerical vector may be enough). If you want to be crazy, you can use generative models to generate descriptions from random semantic starting points or use lower level tables to piece together these things.

At this point, you can pretty easily build a user model and run it for each user to generate coherent transactional histories.

Several of these ideas are present in a project I worked on called log-synth [1]. For instance, the VIN generator has tables of factories and such for BMW and Ford so it generates kind of coherent VINs that can be traced back with factory location, engine and body type. If you look hard these are nonsense, but if you squint the right way they look fine.

The commuter generator or the DNS query generator are examples of a higher-level transaction generators. For the commuter, there is a model of a user with a home location and a work location. These commuters go to work some days and run errands other day and there is a simple model to pick an activity. Digging in, each activity breaks down into journeys along entirely incoherent road structures but details like a physical model of the engine and car velocity is maintained so you can get realistic diagnostics from the vehicles from somewhat realistic life histories. The DNS query generator is similar but with less physics.

One nice statistical concept in all of this is the concept of a statistical distribution over a notionally infinite set. Some things in the set will be much more commonly seen than others and thus we are likely to see those sooner. The generator of these things can maintain an estimate of the frequency of all previously seen things and a probability of seeing something new (see the Chinese Restaurant process [2]). You only need to generate the specifics of a thing in this infinite when you first see it which gives you pretty realistic texture to the fictional transactional world.

Relative to your problem of multi-level BOMs, you could say that a BOM is a list of items. Pick the desired length from a suitable distribution. Then pick each item from a Chinese Restaurant process. As you generate new items, decide if the item is composite and if so, generate a BOM for it recursively. Constraints like forcing a composite item to not recursively contain anything of the same type can be enforced using a rejection method (sometimes).

If this seems at all interesting, ping me by filing an issue on the log-synth github repository.

[1] https://github.com/tdunning/log-synth [2] https://en.wikipedia.org/wiki/Chinese_restaurant_process


Synthea is great! We use it a ton at Medplum - and the sample data that conforms to USCDI is especially useful we recommend for those who are getting started. https://www.medplum.com/docs/tutorials/importing-sample-data


I actually had this idea when I worked for a local HIE. I just lacked the technical competency to make it real. I think this would be incredibly useful for the adoption of FHIR and also learning more about HL7. For security-minded folks this information could be a good tool for tuning DLP and other tools without using real patient data.


I've done this sort of thing before with home-rolled tools, it easily becomes a time sink. Having a centralized shared effort seems like it could be really valuable.

One thing that is tricky is that you often needs signals and/or image data as well.


I recommend the OMOP schema as a goto standard for EHR data. There's an ETL pipeline for converting Synthea output into OMOP.

https://github.com/OHDSI/ETL-Synthea


Neat! We made a synthetic patient generation prototype a few years ago: https://pau.treenotation.org/synth/

The challenge at the time was generating realistic correlations between the columns. How do you approach this?

I noticed LLMs are a huge breakthrough here with the downside that they currently rely on massive online models. I wonder if someone could train a tiny model that could fit on a local machine specifically to solve the synthetic data problem.


I’ve used Synthea for a whole assortment of small and large projects and it’s been boring in the best possible way: reliable and easy to use.

I’ve also had the pleasure of working directly with the team at MITRE that owns it on a consulting engagement (we needed some improvements to it) and they are a delight to work with.


I’ve worked at the intersection of AI & healthcare for years and this has been an excellent tool I’ve leveraged in the past; synthetic data is particularly helpful in the context of healthcare!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: