I am a theoretical Physicist (PhD, semiconductor research) and software developer. I have broad experience in scientific programming, web and application development with Python and the usual front-end stuff, and I've worked a lot with data storage and processing. At the moment I work for a large German car maker in the R&D department, but I'd like to do something more meaningful. My ideal job combines math/natural sciences and professional software development, with enough flexibility to spend time with my family.
I'm administrating a large HPC infrastructure at may day-to-day work and often need to check something on many or all of the nodes. Compare directories, files, system settings, or the like. As we have a total of around ~7000 nodes at different geographical sites, all other tools were unsatisfyingly slow when I want to run a command on all of them.
cash (with warm caches) takes less than 20 seconds.
I have a PhD in physics and worked as a post-doc for a few years, until I left for industry a couple of weeks ago. The last project I was busy with is developing a massively parallelized image simulation code for scanning transmission electron microscopy. It is open sourced here: www.stemsalabim.de
My new job in industry is consulting about HPC systems in the context of computer aided engineering.
Your quotation marks seem like you disagree. There are other reasons not to do screening, for example not be faced with the decision to abort or not to abort. Here in Germany, at least in my circles, people tend to avoid such screenings not because of religiousness.
This looks very interesting. Currently we are storing our dense simulation (and experimental) data in NetCDF/HDF5. Given correct chunking, this seems to be pretty efficient both performance and compression wise. What would we gain using TileDB? How does performance compare with HDF5?
Stavros from TileDB, Inc. here: HDF5 is a great software and TileDB was heavily inspired by it. HDF5 probably works great for your use case. TileDB matches the HDF5 performance in the dense case, but in addition it addresses some important limitations of HDF5, which may or may not be relevant to your use case. These include: sparse array support (not relevant to you), multiple readers multiple writers through thread- and process-safety (HDF5 does not have full thread-safety, whereas also it does not support parallel writes with compression - I am assuming you are using MPI and a single writer though, so still HDF5 should work well for you), efficient writes in a log-structured manner that enables multi-versioning and fault tolerance (HDF5 may suffer from file corruption upon error and file fragmentation - you are probably not updating, so still not very relevant to you). Having said that and echoing Jake's comment, we would love to hear from you how TileDB could be adapted to serve your case better.
A general comment: TileDB’s vision goes beyond that of the HDF5 (or any scientific) format. Considering though the quantities of HDF5 data out there (and the fact that we like the software), we are thinking about building some integration with HDF5 (and NetCDF). For instance, you may be able to create a TileDB array by “pointing” to an HDF5 dataset, without unnecessarily ingesting the HDF5 files but still enjoying the TileDB API and extra features.
Jake from TileDB, Inc. Performance wise I would look at the referenced paper in this thread which provides benchmarks for various workloads. As to what advantages TileDB may offer you that is problem dependent, esp. compared to dense simulation output data which is the use case HDF5 was designed for. If you have specific suggestions for ways to improve HDF5 for your use case we would love to hear about them.
Remote: Yes
Willing to relocate: No
Technologies: Python, C/C++, databases, Django, HPC (MPI, threads), web technologies, gRPC, Docker, comp sci (numpy/scipy/HDF5/FFTW/....)
Résumé/CV: upon request
Email: hn@jomx.net
I am a theoretical Physicist (PhD, semiconductor research) and software developer. I have broad experience in scientific programming, web and application development with Python and the usual front-end stuff, and I've worked a lot with data storage and processing. At the moment I work for a large German car maker in the R&D department, but I'd like to do something more meaningful. My ideal job combines math/natural sciences and professional software development, with enough flexibility to spend time with my family.