A look at the Mojo language for bioinformatics

fwip · 2024-02-11T20:35:26 1707683726

For what it's worth, I couldn't reproduce the benchmarks cited in the post, which claimed a 50% speedup over Rust on M1. The rust implementation was consistently about two to three times as fast as Mojo with the provided test scripts and datasets. It's possible I was compiling the Mojo program suboptimally, though.

  hyperfine -N --warmup 5 test/test_fastq_record 
  'needletail_test/target/release/rust_parser data/fastq_test.fastq'
  Benchmark 1: test/test_fastq_record
    Time (mean ± σ):      1.936 s ±  0.086 s    [User: 0.171 s, System: 1.386 s]
    Range (min … max):    1.836 s …  2.139 s    10 runs
  
  Benchmark 2: needletail_test/target/release/rust_parser data/fastq_test.fastq
    Time (mean ± σ):     838.8 ms ±   4.4 ms    [User: 578.2 ms, System: 254.3 ms]
    Range (min … max):   833.7 ms … 848.2 ms    10 runs
  
  Summary
    needletail_test/target/release/rust_parser data/fastq_test.fastq ran
      2.31 ± 0.10 times faster than test/test_fastq_record

(Edit: I built the Rust version with `cargo build --release` on Rust 1.74, and Mojo with `mojo build` on Mojo 0.7.0.)

chromatin · 2024-02-12T00:24:34 1707697474

It was later noted on Twitter/X by someone that the rust version was not compiled with `--release`

robinsonrc · 2024-02-12T05:24:07 1707715447

That’s a fairly big omission for such an attention grabbing performance comparison

MohamedMabrouk · 2024-02-12T16:02:59 1707753779

Hey, the Mojo parser author here. the test folder is just for the unit tests. All the benchmarking code is located in the /benchmark folder. It would be great if you can give it another go on your machine. https://github.com/MoSafi2/MojoFastTrim/tree/restructed/benc...

fwip · 2024-02-12T17:36:57 1707759417

Thanks for the pointer, I had only checked the 'main' branch, which doesn't have the benchmarking code present.

When running on the commit & code you point to here, here are my new results:

  $ hyperfine -N --warmup 5 './benchmark/fast_parser data/fastq_test.fastq'  './benchmark/needletail_benchmark/target/release/rust_parser data/fastq_test.fastq '
  Benchmark 1: ./benchmark/fast_parser data/fastq_test.fastq
    Time (mean ± σ):     675.0 ms ±   2.4 ms    [User: 399.3 ms, System: 269.4 ms]
    Range (min … max):   670.5 ms … 677.5 ms    10 runs
  
  Benchmark 2: ./benchmark/needletail_benchmark/target/release/rust_parser data/fastq_test.fastq
    Time (mean ± σ):     840.8 ms ±   3.0 ms    [User: 578.0 ms, System: 257.0 ms]
    Range (min … max):   837.0 ms … 847.7 ms    10 runs
  
  Summary
    ./benchmark/fast_parser data/fastq_test.fastq ran
      1.25 ± 0.01 times faster than ./benchmark/needletail_benchmark/target/release/rust_parser data/fastq_test.fastq

Which indeed shows your parser running about 25% faster than the needletail version.

MohamedMabrouk · 2024-02-12T23:28:32 1707780512

That's great to see, thanks a ton. care to share your system information? I am trying to understand where the difference is coming from and that would really helpful.

fwip · 2024-02-13T14:56:23 1707836183

Sure, it's the 14-inch 2021 Macbook Pro (Apple M1 Pro chip), 16GB. Connected to power, other programs running, but nothing actively working. Timings were pretty stable when running a few times.

MohamedMabrouk · 2024-02-13T19:15:08 1707851708

Thanks a lot!

melodyogonna · 2024-02-12T19:30:48 1707766248

Interesting.

WeatherBrier · 2024-02-11T20:43:44 1707684224

The language is far from stable, but I have had a LOT of fun writing Mojo code. I was surprised by that! The only promising new languages for low-level numerical coding that can dislodge C/C++/Fortran somewhat, in my opinion, have been Julia/Rust. I feel like I can update that last list to be Julia/Rust/Mojo now.

But, for my work, C++/Fortran reign supreme. I really wish Julia had easy AOT compilation and no GC, that would be perfect, but beggars can't be choosers. I am just glad that there are alternatives to C++/Fortran now.

Rust has been great, but I have noticed something: there isn't much of a community of numerical/scientific/ML library writers in Rust. That's not a big problem, BUT, the new libraries being written by the communities in Julia/C++ have made me question the free time I have spent, writing Rust code for my domain. When it comes time to get serious about heterogeneous compute, you have to drop Rust and go back to C++/CUDA, when you try to replicate some of the C++/CUDA infrastructure for your own needs in Rust: you really feel alone! I don't like that feeling ... of constantly being "one of the few" interested in scientific/numerical code in Rust community discussions ...

Mojo seems to be betting heavy on a world where deep heterogeneous compute abilities are table stakes, it seems the language is really a frontend for MLIR, that is very exciting to me, as someone who works at the intersection of systems programming and numerical programming.

I don't feel like Mojo will cause any issues for Julia, I think that Mojo provides an alternative that complements Julia. After toiling away for years with C/C++/Fortran, I feel great about a future where I have the option of using Julia, Mojo, or Rust for my projects.

adgjlsfhk1 · 2024-02-11T21:16:33 1707686193

> I really wish Julia had easy AOT compilation and no GC, that would be perfect

I pretty strongly disagree with the no gc part of this. A well written GC has the same throughout (or higher) than reference counting for most applications, and the Rust approach is very cool, but a significant usability cliff for users that are domain first, CS second. A GC is a pretty good compromise for 99% of users since it is a minor performance cost for a fairly large usability gain.

celrod · 2024-02-11T22:00:10 1707688810

Too bad Julia doesn't have this theoretical "well written GC". I do not like GCs, so I agree with OP's sentiment. Why solve such a hard problem when you don't have to?

I don't find ownership models that difficult. It's things one should be thinking of anyway. I think this provides a good example of where stricter checking/an ownership model like Rust has makes it easier than languages that do not have it (in this case, C++): https://blog.dureuill.net/articles/too-dangerous-cpp/

jakobnissen · 2024-02-12T01:16:05 1707700565

On the other hand, trying to represent graph structures in Rust (e.g. phylogenetic trees, pedigrees, assembly graphs) is absolutely horrible. The ownership models breaks completely apart, and while it can be worked around, it's just a terrible developer experience where I just wished I had a GC in those cases.

Practically speaking I rarely find GC pauses to be an issue, neither latency wise nor speed wise. Though of course that could be due to

1. I don't need low latency in research work,

2. I rarely work with massive complex data structures filling all my RAM where the GC has to scan the whole heap every time it runs, and

3. GC may have indirect performance effects that are not measures as part of GC runs, e.g. by fragmenting active memory more.

whyever · 2024-02-12T16:24:41 1707755081

The arguably idiomatic way to implement such structures in Rust is to use arrays and indices, see crates like petgraph. It's probably faster as well, because there are less allocations and memory locality is better.

hayley-patton · 2024-02-12T03:15:19 1707707719

There's work on porting MMTk to Julia, which will provide some well written GCs: https://github.com/mmtk/mmtk-julia

cole-k · 2024-02-11T23:58:57 1707695937

It's unfortunate indeed if Julia does not have a well-written GC as you imply.

While I feel like I have my head wrapped around ownership well enough to write (dare I say idiomatic) Rust without too much difficulty, I do find myself often in a position where I wish I just had a GC.

I think this speaks to what your parent comment is saying: I think there are many situations where the performance improvement over having fine-grained control of my code's memory management is not worth the extra time I have to spend thinking about it. As it stands, I will sometimes give up and slap a bunch of clones or Rcs on my code so it compiles, then fix it up later. But the performance usually is good enough for my use even with all of these "inefficiencies," which makes me sometimes wish I could instead just have a GC.

jdiaz97 · 2024-02-12T00:28:00 1707697680

I think Julia's GC is quite good now, it can even multithread.

adgjlsfhk1 · 2024-02-12T01:09:38 1707700178

I'd probably describe it as "moderately good". Julia has a pretty major head-start over languages like Java because it can avoid putting most things on the heap in the first place. The main pain point for the GC currently that Julia is missing some escape analysis that would allow it to free mutable objects with short lifetimes (mainly useful for Arrays in loops). The multi-threading definitely helps in a lot of applications though.

pjmlp · 2024-02-12T17:26:37 1707758797

Given the effort Swift, Chapel, Haskell, OCaml, D are going through for adding ownership without Rust's approach, not everyone feels it is that easy for most folks.

Someone · 2024-02-12T09:10:20 1707729020

> A well written GC has the same throughout (or higher) than reference counting for most applications

Reference counting has its own problems. The true comparison should be with code that (mostly) doesn’t do reference counting.

Then, the claim still holds, IF you give your process enough memory. https://cse.buffalo.edu/~mhertz/gcmalloc-oopsla-2005.pdf:

“with five times as much memory, an Appel-style generational collector with a non- copying mature space matches the performance of reachability- based explicit memory management. With only three times as much memory, the collector runs on average 17% slower than explicit memory management. However, with only twice as much memory, garbage collection degrades performance by nearly 70%. When physical memory is scarce, paging causes garbage collection to run an order of magnitude slower than explicit memory management.”

That paper is old and garbage collectors have improved, but I think there typically still is a factor of 2 to 3.

Would love to see a comparison between modern recounting and modern GC, though. Static code analysis can avoid a lot of recount updates and creation of garbage.

aldanor · 2024-02-12T02:30:47 1707705047

Well, there's some big DS projects written in rust that are now very widely used in Python world - e.g., polars.

pjmlp · 2024-02-12T17:23:34 1707758614

I just came from a CERN event, HEP seems to still be all about C++, Fortran, Python, Java, and some Go due to Kubernetes.

No Rust or Julia in their radar.

jdiaz97 · 2024-02-11T20:46:13 1707684373

Great post. I think Mojo's claims like the speedup over Rust are a problem, like the 65000x speedup over Python. How can we differentiate between good new tech and Silicon Valley shenanigans when they use claims like that? They do nice titles and slogans but are shady in substance

latenightcoding · 2024-02-11T23:04:44 1707692684

I can't take this language or company serious after reading stuff like:

"Mojo may be the biggest programming language advance in decades"

https://www.fast.ai/posts/2023-05-03-mojo-launch.html

mianos · 2024-02-12T03:01:28 1707706888

My answer to the deleted comment and to this.

It is a bunch of incremental improvements to the Python like language environment.

That's no big programming language advance to me. A biggie would be to Haskel or even Rust.

That's not to say it won't be wildly more successful as it gives a lot of what people want in a number of areas all in one go.

I'd jump on board except for the vibe around the current licensing. Maybe that will change and I'll be one of those Rust people who comment 'but Rust' on every C and C++ article, except I'll be saying "but Mojo" :)

breather · 2024-02-11T23:50:04 1707695404

Hard to remember the last language that felt so obviously sold by something other than an actual community. Even Swift tried its best to exist outside of xcode and mac/i os

EDIT: perhaps I'm being too harsh—this was literally just announced. I'm just taken aback by the blatant marketing as everyone else is.

pjmlp · 2024-02-12T17:16:32 1707758192

Swift isn't trying anything, Apple only cares about their platforms.

Open source Swift is as relevant as Objective-C was, never expect any big uptake if none of the key frameworks is open source.

Outside Apple platforms it only fulfills two goals, being good enough for Apple and iOS developers to deploy their server code on GNU/Linux, a bit of goodwill marketing, and that is about it.

breather · 2024-02-12T21:43:29 1707774209

I meant only that occasionally I run across someone who fervently believes in bringing Swift to other platforms or even just outside the Cocoa/UIKit/whatever blob. It is very reminiscent of Apple's half-assed promotion of Objective-C outside their investments, I agree, but both languages had communities pushing for expanding their usage well outside the corporate sphere. Small communities, but easy to find.

coderedart · 2024-02-12T02:27:13 1707704833

There was v lang in recent memory which made grand claims.

boxed · 2024-02-12T07:42:21 1707723741

A little bit of clickbait is what you need to get interest at all. That's just a fact of life.

As for this specific claim, it was coupled with a blog post that actually demonstrated the speedup on a specific problem. Getting several orders of magnitude speedup over plain python is often quite easy. That's why we have numpy and pandas after all!

whoami17357 · 2024-02-12T10:28:36 1707733716

Probably reasonable to label as a shenanigan if they try to differentiate with a emoji file extension.

ubj · 2024-02-11T20:57:16 1707685036

Great post, but I think the author missed a few advantages of Mojo:

* Mojo provides first-class support for AoT compilation of standalone binaries [1]. Julia provides second-class support at best.

* Mojo aims to provide first-class support for traits and a modern Rust-like memory ownership model. Julia has second-class support for traits ("Tim Holy trait trick") and uses a garbage collector.

To be clear, I really like Julia and have been gravitating back to it over time. Julia has a very talented community and a massive head start on its package ecosystem. There are plenty of other strengths I could list as well.

But I'm still keeping my eye on Mojo. There's nothing wrong with having two powerful languages learning from each other's innovations.

[1]: https://docs.modular.com/mojo/manual/get-started/hello-world...

jdiaz97 · 2024-02-11T21:09:45 1707685785

True, but the title of the blog is about Bioinformatics, and like another comment said:

> Bioinformatics is like 0.1% dealing with FASTQ files and the rest is using the ecosystem of libraries for statistics and plotting. Many of them in R

Considering that, do you need AOT, memory ownership for doing plotting and statistics? I'd argue not, and that's why R and Python are so popular in Bio.

beanjuiceII · 2024-02-11T22:10:14 1707689414

doesn't this make more sense to have a python like language then for speed? and python for all that other stuff. so learn one'ish language and get it all?

a_bonobo · 2024-02-12T01:11:32 1707700292

Yeah that's how it ended up for me: large datasets get churned through for speed in Python, but I then usually switch over to R with the summary data because there's just way more biology-specific ecosystem in R than in Python.

R/Bioconductor has packages for human genome-specific analyses so it's easy to download gene positions etc., there are packages for read simulation, amplicon sequence variant detection, gene distance simulations, any kind of RNAseq analysis you can think of... none of these packages exist in Python. If you'd rerun it in Python you'd save 10 minutes or hours of running time but you'd lose days or months re-implementing analyses that are in R packages (plus those R packages often call on C++ code, anyway)

plus ggplot2 is miles ahead of any plotting in Python (to me :) ).

eyegor · 2024-02-12T01:25:22 1707701122

To your last point, have you tried plotnine? It's meant to be ggplot2 for python.

https://github.com/has2k1/plotnine

a_bonobo · 2024-02-12T03:29:28 1707708568

I've looked at it yes!! but there's heaps of ggplot-based libraries that I also like to use; things like cowplot, ggsignif, ggtree etc. It would be a cat-and-mouse game for plotnine to keep up with the ggplot-based ecosystem!

jakobnissen · 2024-02-12T06:53:44 1707720824

Yes. But the big problem is that all the things that make Python Python is also the things that make it slow. People have tried again and again to make a fast Python, and failed. And from my first impressions of Mojo, it's not very much like Python at all.

CaptainOfCoit · 2024-02-12T00:31:43 1707697903

That seems to be exactly what Mojo is/wants to be. At least that's how I understand their landing page: https://www.modular.com/max/mojo

WeatherBrier · 2024-02-11T21:05:30 1707685530

I feel the same way, I love using Julia, but the features that Mojo provides are exciting. It's great that we have both of them.

mcqueenjordan · 2024-02-11T19:43:55 1707680635

Another point of clarification that is of great importance to the results, and is a common Rust newcomer error: The benchmarks for the Rust implementation (in the original post that got all the traction) were run with a /debug/ build of rust, i.e. not an optimized binary compiled with --release.

So it was comparing something that a) didn't do meaningful parsing against b) the full parsing rust implementation in a non-optimized debug build.

SushiHippie · 2024-02-11T22:51:02 1707691862

Am I missing something? In the git repository [0] it says:

> needletail_benchmark folder was compiled using the command cargo build --release and ran using the following command ./target/release/<binary> <path/to/file.fq>.

Or are you talking about something else here?

[0] https://github.com/MoSafi2/MojoFastTrim

mcqueenjordan · 2024-02-12T04:21:26 1707711686

It was later edited, after it had basically made the rounds.

SushiHippie · 2024-02-12T12:42:11 1707741731

Ah okay, found the commit that changed the benchmark numbers

https://github.com/MoSafi2/MojoFastTrim/commit/530bffaf21663...

tehsauce · 2024-02-11T19:52:00 1707681120

How much does this particular result change when running in release mode?

alpaca128 · 2024-02-11T20:15:30 1707682530

Depending on the code I've seen performance increases above 100x in some cases. While that's not exactly the norm, benchmarking Rust in debug mode is absolutely pointless even as a rough estimate.

eyegor · 2024-02-12T01:49:14 1707702554

Is there any compiled language that doesn't benefit heavily from release builds? That would be interesting if true.

pornel · 2024-02-12T16:01:44 1707753704

This can happen in languages that use dynamic constructs that can't be optimized out. For example, there was a PHP-to-native compiler (HipHop/HPHPc) that lost to faster interpreters and JIT.

Apple's Rosetta 2 translates x86-64 to aarch64 that runs surprisingly fast, despite being mostly a straightforward translation of instructions, rather than something clever like a recompiling optimizing JIT.

And the plain old C is relatively fast without optimizations, because it doesn't rely on abstraction layers being optimized out.

adgjlsfhk1 · 2024-02-12T02:05:17 1707703517

Julia, for example runs by default with -O2 and debug info turned on. It's a good combo between debug-ability and performance.

fwip · 2024-02-11T20:41:33 1707684093

On my machine, running the debug executable on the medium-size dataset takes ~14.5 seconds, and release mode takes ~0.8 seconds.

adgjlsfhk1 · 2024-02-11T21:25:06 1707686706

do you know why debug mode for rust is so slow? is it also compiling without any optimization by default? it's it checks for overflow?

FridgeSeal · 2024-02-11T21:51:08 1707688268

The optimisation passes are expensive (not the largest source of compile time duration though).

Debug mode is designed to build as-fast-as-possible while still being correct, so that you can run your binary (with debug symbols) ASAP.

Overflow checks are present even in release mode, and some write-ups seem to indicate they have less overhead than you’d think.

Rust lets your configure your cargo configs to apply some optimisation passes even in debug, if you wish. There’s also a config to have your dependencies optimised (even in debug) if you want. The Bevy tutorial walks through doing this, as a concrete example.

jakobnissen · 2024-02-12T06:56:00 1707720960

That's not right, Rust only checks for overflow in release mode for numbers where its value is known at compile time. In debug mode all operations are checked for overflow.

estebank · 2024-02-12T15:08:23 1707750503

Integer overflows can be enabled in release mode by modifying your Cargo.toml with

    [profile.release]
    overflow-checks = true

IMO it should have been the default.

FridgeSeal · 2024-02-12T11:51:10 1707738670

Aahh, my bad. TIL.

fwip · 2024-02-11T22:05:37 1707689137

Yes, optimization is disabled by default in debug mode, which makes your code more debuggable. Overflow checks are also present in debug mode, but removed in release mode. Bounds checking is present in release mode as well as debug mode, but can sometimes be optimized away.

There's also some debug information that is present in the file in debug mode, which leads to a larger binary size, but shouldn't meaningfully affect performance except in very simple/short programs.

bhansconnect · 2024-02-12T22:10:56 1707775856

This is not accurate. The blog post used `--release` for it's Rust numbers. The confusion comes from the 50% performance win being specific to running on an M2 mac. On an x86_64 Linux machine, the results are more or less equivalent.

stellalo · 2024-02-11T20:15:09 1707682509

> If I include the time for Julia to start up and compile the script, my implementation takes 354 ms total, on the same level as Mojo's.

I don’t think the article mentions it explicitly, but I suppose the timing is from Julia 1.10: as far as I can remember, this kind of execution time would have been impossible in Julia 1.8 even to run a simple script.

Bravo, Julia devs. Bravo.

adgjlsfhk1 · 2024-02-11T21:20:04 1707686404

for a script like this that doesn't have any dependencies, Julia 1.10 doesn't make a significant difference. that said, for real world usability, Julia 1.10 is dramatically better than previous versions.

dr_kiszonka · 2024-02-12T02:49:52 1707706192

Folks using multiple languages, what is your workflow?

I do most DS/ML work in Python but move to R for stats, and publication-ready plots and tables (gt is really great). I often switch between them frequently, which is a hassle in the EDA and prototyping stages, especially when using notebooks. I enjoy Quarto in RStudio, but the VS Code version is not that great.

How do you make it work?

Also, after so many years using Python and R, I would love to learn a new language, even if only for just a couple of use cases. I considered Elixir for parallel processing and because it has a nice syntax, but ultimately decided against it because it can be a little slow and isn't used much in my area (sadly!). Rust seems to require too much time to get decent at it. Any recommendations? (Prolog?)

skwb · 2024-02-12T05:28:27 1707715707

Use python and write my results in a CSV that I quickly import into R and do my fancy stats.

Tbf python's stats implementations can be garbage; the last time I checked you can't do multiple levels for hierarchical regression.

carbocation · 2024-02-12T03:07:54 1707707274

My workflow is similar to yours: python for deep learning and surface reconstruction. R for stats and plots.

I use go extensively for data preprocessing. Sounds weird but it works well for highly repetitive conversion tasks like DICOM parsing, converting EKGs to numpy, etc.

_huayra_ · 2024-02-12T06:45:19 1707720319

It's hard to learn a language for fun, so I'd pick something that fits your needs to build something (or even just your curiosity). Elixir and Prolog, although both cool, might not fit the bill because they really excel at one particular thing.

Golang is a popular answer, as you can start building stuff with it fairly quickly (especially compared to Rust). Java can also be useful if you haven't learned it and find a use case (although you will hear it bemoaned as the "New COBOL", there is still a lot of work done using it).

samuell · 2024-02-12T15:02:19 1707750139

I've been thinking to learn Rust for these use cases, but always get frustrated with the complexity.

I find Go is a great middle-ground though! And now there starts to be a few more bio-related tools and toolkits out there, including:

- https://github.com/vertgenlab/gonomics

- https://github.com/biogo/biogo

- https://github.com/pbenner/gonetics

- https://github.com/shenwei356/bio

... except from there being some really popular bio tools written in Go, like:

- https://github.com/shenwei356/seqkit

I think Go lost a bit of steam in bio after Rust started to take off, but it seems the field is growing to such an extent, and people are also starting to realize Rust isn't the answer to everything. I.e. it is fantastic for fast tools, but for replacing Python for all of the various ad hoc coding in biology ... nah, not so much. That's where I think Go shines.

f6v · 2024-02-11T19:29:06 1707679746

As someone who practices bioinformatics, it doesn’t seem appealing. Bioinformatics is like 0.1% dealing with FASTQ files and the rest is using the ecosystem of libraries for statistics and plotting. Many of them in R, by the way.

tstactplsignore · 2024-02-11T20:54:33 1707684873

To disagree, I'm a computational biologist and it's my firm belief 99% of the scientifically important stuff happens before the stats and plotting. That's not to say I dismiss those things and haven't done my fair share of stats, but just that the difference between real results and incorrect results most often happens before that step.

I'm a microbiologist though, for stuff like human RNA-Seq I understand that it's often plug and play to get a gene counts table at this point.

bfrankline · 2024-02-11T22:26:24 1707690384

Sure, but I think, for example, representation learning, doesn’t involve manipulating an array of strings.

kescobo · 2024-02-12T00:59:20 1707699560

>To disagree, I'm a computational biologist and it's my firm belief 99% of the scientifically important stuff happens before the stats and plotting.

I'm a microbiologist too, but the kind that uses mostly off-the-shelf tools to do the taxonomic/functional assignment on metagenomes, and then stats/data science on the features. I kinda didn't know what you mean by "99% of the scientifically important stuff happens before the stats and the plotting".

I mean, give me a 500x2.6x10^6 sparse matrix of gene function abundances and tell me that you've done anything scientifically meaningful. Or on the other side, let me hand you a fastq file from sequencing a poorly extracted DNA sample, and you give me the best algorithm in the world, and there's nothing scientifically meaningful that's going to come out of that.

folli · 2024-02-11T21:43:56 1707687836

I guess that depends on your exact ecological niche within bioinformatics.

I got my start at a NGS facility, so handling FASTQ was closer to 80% of my time, so any speedups would have been greatly appreciated.

MillironX · 2024-02-12T01:55:33 1707702933

> I guess that depends on your exact ecological niche within bioinformatics.

Agreed. I know people in my department who just ran Galaxy pipelines and R scripts to make pretty plots. I was on the other side of the spectrum and needed fast parsers, so the SAM and VCF specifications were my bible.

__MatrixMan__ · 2024-02-11T20:14:02 1707682442

As someone who is considering a switch from generic software engineering towards bioinformatics, what would you say the pain points are?

If this is not the way to remove workflow friction, what is?

f6v · 2024-02-12T11:27:38 1707737258

I had an ok career in software engineering (Android/iOS -> backend -> engineering management) before getting MS in Bioinformatics and starting a PhD in Medicine.

For me, the pain points are often the same as in business. Biologists with no data analysis experience want something done without understanding constraints. Requirements are often not understood and there isn’t a good plan.

Some people do indeed suffer from code being slow and this can be solved with better tools. I works with large datasets in single-cell genomics (over a million cells) and the model takes ~12 hrs to train on an entry-level GPU. So, most o my time is spent at trying to understand the results.

getoffmycase · 2024-02-12T01:56:09 1707702969

Honestly the major pain point is that the grad student that wrote the package you need is no longer maintaining it because they’ve graduated. Also the code they wrote sucks, but whatever.

I’m wary of software engineers coming over the bioinformatics because they never have the domain expertise required to make meaningful contributions, and yet many think they know everything.

__MatrixMan__ · 2024-02-12T07:14:59 1707722099

Yeah, I'm wary of being that guy too. My current approach is the slow one: first get a biochemistry degree.

life-and-quiet · 2024-02-11T21:52:36 1707688356

Would like to second this question. I'm very interested in getting into this world, but it feels like there isn't a clear path (especially for someone self-taught like me). Bioinformatics feels pretty inaccessible without a computer science or biology degree, even with substantial R and Python experience.

fwip · 2024-02-11T22:14:30 1707689670

There's a few camps in bioinformatics, from what I've seen.

1) The fellows writing papers - usually these guys have PhDs. Usually a science-focused PhD. 2) Analysts - often have a background in mathematics, biology, or big-data. Success here can lead to an onramp to camp 1. Much of your time here is spent in interactive programming environments, like Jupyter notebooks. 3) Programmers - writing novel or faster bioinformatic tools, often in low-level languages like C++ or Rust. Sometimes you can get a paper out of these, especially if you have a CS background. There's increasingly room for higher-level tools though here too, so it starts to overlap with 2. 4) Pipeline programmers - people gluing analysis workflows together out of the tools written in low-level languages, often with a liberal helping of Unix command-fu. Often sort of an ad-hoc role, containing people from diverse backgrounds, from biology to sysadmin. (This is my current role). 5) Biology/wetlab - people running experiments in the lab, and want to analyze their own work, especially for QC purposes. Wild-west ad-hoc development practices.

__MatrixMan__ · 2024-02-12T00:17:43 1707697063

I couldn't speak to careers, but my curiosity was enough for me to ask a biochemist to join his bioinformatics class despite lacking a great many prerequisites.

I was quite helpful to him and the other students (who mostly struggled with packaging: conda, pip, apt, etc). In turn, they were quite patient with my lack of biochemistry background. It was nice to get a taste without having to take what would've been 2.5 years worth of prerequisites.

f6v · 2024-02-12T11:32:06 1707737526

I think there’s a lot of gate keeping and having some formal degree is a pre-requisite. And be advised that pay isn’t great either.

But bioinformatics is an umbrella term. There’re so many different things people do. I started by identifying field I’m interested in (ageing and immunology) and backtracked from there.

gandalfgeek · 2024-02-12T01:44:00 1707702240

> It does grate me then, when someone else manages to raise 100M dollars on the premise of reinventing the wheel to solve the exact same problem, but from a worse starting point because they start from zero and they want to retain Python compatibility. Think of what money like that could do to Julia!

Python is a juggernaut with total control of the ML space and is a huge part (even if less dominant) in modern scientific computing.

A VC has way better chances of success building solutions compatible with Python rather than replacing it.

chaxor · 2024-02-12T04:35:27 1707712527

I was interested in trying our mojo. Then I looked at it booked out quick.

No one will use a language that isn't free and open source.

If mojo was free and open source (wasn't a company), and didn't just give out binaries with a 'trust me bro' stamp if approval, then I would have worked with it. But it's not, so I will never use it.

andyferris · 2024-02-12T07:39:20 1707723560

I get your viewpoint. However, in terms of numbers, I suspect >90% of the populace (even research populace) will care that it is free-as-in-beer and that's all. So from the VC's point of view...

hkmaxpro · 2024-02-11T19:36:24 1707680184

https://news.ycombinator.com/item?id=39296559

math_dandy · 2024-02-11T21:07:54 1707685674

I’m really excited about Mojo’s potential. But I don’t think it’s ready for real use outside it’s AI niche yet. Being able to call Mojo functions from Python is the sentinel capability I’m waiting for before considering its use for general purpose code.

refulgentis · 2024-02-11T19:51:30 1707681090

I felt like I learned more about the author than Mojo.

- Never actually runs it. Seriously.

- Wants us to know it's definitely not a real parser as compared to Needlepoint...then 1000 words later, "real parser" means "handles \r\n...and validates 1st & 3rd lines begin with @ and +...seq and qual lines have the same length".

- At the end, "Julia is faster!!!!" off a one-off run on their own machine, comparing it to benchmark times on the Mojo website

It reads as an elaborate way to indicate they don't like that the Mojo website says it's faster, coupled to a entry-level explanation of why it is faster, coupled to disturbingly poor attempts to benchmark without running Mojo code

jakobnissen · 2024-02-11T20:13:40 1707682420

I feel like if you believe my conclusion was that "Julia is faster" then you are missing the point.

The point is that the original blogs claims of "Mojo is faster" isn't right - it's comparing different programs. That implementation in Mojo is faster than Needletail - but that doesn't say very much and I prove it by also beating Needletail in Julia by using the same algorithm as Mojo does. So it's the algorithm. Not Mojo. Not Julia.

Also, did you even read my discussion on how much a parser ought to validate? Your resume is completely missing the point.

refulgentis · 2024-02-11T20:20:17 1707682817

Yeah, I got the joke, and understood the parser.

It's just, the content length : content ratio is high - all I got out of it was you don't like the Mojo speed claim & genomics parsing is text parsing*

Don't take that the wrong way, I feel bad. It's just bad for me - I'm a mobile developer, so I was way out of my domain, I've barely written Python, Julia is a complete abstraction to me outside of HN. An alternative way to think about it is, I shouldn't have expected an in-depth analysis of Mojo.

* i mean, everything is bytes parsing, but it always tickles me when I find out other domains aren't castles in the sky, speaking an alien language

jakobnissen · 2024-02-11T20:28:02 1707683282

I yeah I get that. If you were expecting a review of Mojo, then the post falls short. Maybe the title should have emphasized the benchmark as being in question, not Mojo itself.

disgruntledphd2 · 2024-02-11T20:36:33 1707683793

I'm a data scientist, not a bioinformatician and I really enjoyed the post. I too am sceptical of Mojo though, so maybe it just played to my biases...

cbkeller · 2024-02-11T22:22:27 1707690147

It looks like you very dramatically missed the point

refulgentis · 2024-02-11T23:01:29 1707692489

Please, explain

zaptheimpaler · 2024-02-11T23:34:45 1707694485

How does a software engineer transition into bioinformatics or computational biology? I've taken some online courses on bioinformatics and have some experience in large distributed jobs but these jobs seem few and far in between and generally want M.S/PhDs in bioinformatics. Is it really a field that's not viable to enter without an MS?

jltsiren · 2024-02-12T00:15:28 1707696928

Doing a Master's and/or PhD in bioinformatics is probably the easiest way. It's a pretty specialized field, and the first couple of years are usually spent learning the basics. You are unlikely to find anyone willing to hire you to a real job to do that.

samuell · 2024-02-12T15:09:51 1707750591

I think the challenge is learning enough of the biology outside of academia. I think it is fully possible, e.g. from books and videos ... but will take a lot of determination.

For the bioinformatics part, I think something like the "Genomics data science" specialization on Coursera should be a pretty good start.

jakobnissen · 2024-02-12T01:06:01 1707699961

I'm not sure what's the best strategy to get hired, but professionally, you need to learn as much biology as you can. Cell biology, molecular biology, genetics, physiology. My experience has been that there are a bunch of software engineers in bioinfo already who fall short on the biology side. Differentiate yourself from those.

jimbob45 · 2024-02-11T19:27:56 1707679676

Crystal was never able to find traction as a Ruby clone that could compete with C speeds. Why would a Python clone have any better luck? I don’t think anyone would accuse Python of being dramatically more usable than Ruby.

Alifatisk · 2024-02-11T19:34:33 1707680073

I think the appeal with Crystal is for users who already know Ruby, so the marked was already limited there.

Crystal itself is a gem, but comparing it to Mojo and its relation to Python is fair but gives the wrong message. Python is by far more popular becuse of all the packages, so the market is way larger there.

coldtea · 2024-02-11T19:53:14 1707681194

Well, for the domains Mojo targets, Python is king. So a faster-Python-like language would have more potential audiences. A fast Ruby-like language, not so much, as Ruby was never that special in those domains, or in most places outside web development, and even for that it kind of lost steam in the past 10 years.

Besides people opting for closer to C speed had Rust, Go, Java, Swift, and other options to go to, all with more momentum and support, before going for a yet unproven Ruby clone.

pjmlp · 2024-02-11T19:57:54 1707681474

I used to be quite sceptical given how Swift for Tensorflow went, however since NVidia decided to partner with Modular, alongside their ongoing CUDA JIT bindings for Python, I think Mojo might actually work out.

pests · 2024-02-11T21:40:52 1707687652

Chris Lattner has made a few comments here about Mojo the last few months.

https://news.ycombinator.com/threads?id=chrislattner

Here's his comment on swift for tensorflow:

https://news.ycombinator.com/item?id=37330031

pjmlp · 2024-02-12T06:26:01 1707719161

In case you missed it, he was replying to me...

pests · 2024-02-12T07:21:21 1707722481

Haha, oops my bad. Thats funny though.

coldtea · 2024-02-11T20:10:36 1707682236

"Swift for Tensorflow" never had any real backing apart from the announcement though.

pjmlp · 2024-02-11T20:19:49 1707682789

Apparently it had Google's money backing, for what it is worth.

I never believed into it, because Swift is as relevant as Objective-C outside NeXT/Apple's platforms, and not the kind of programming language that the research community cares about.

coldtea · 2024-02-11T21:25:16 1707686716

>Apparently it had Google's money backing, for what it is worth

You mean they paid to have it created, like they pay for thousands of other things.

But it was never really pushed, the way they push things they want to promote.

pjmlp · 2024-02-12T06:26:41 1707719201

It certainly got more love than Dart 1.0.

akkad33 · 2024-02-11T19:40:19 1707680419

Crystal is an entirely different language with a similar syntax. Valid Python is valid Mojo

frou_dh · 2024-02-11T20:24:44 1707683084

Apparently that is the goal, but not the reality:

> Mojo is still early and not yet a Python superset, so only simple programs can be brought over as-is with no code changes. We will continue investing in this and build migration tools as the language matures.

https://docs.modular.com/mojo/faq.html#how-do-i-convert-pyth...

breather · 2024-02-11T19:37:14 1707680234

Crystal didn't have much use in ruby's sweet spot—being a DSL for some immensely complicated-to-configure framework (eg rails, chef).

samuell · 2024-02-12T15:10:56 1707750656

From someone who would love for Crystal to be the answer here, because of its fantastic concurrency features: It is a bit of a non-starter because of excessive compile times for larger projects. Also, they hadn't solved the cross-compilation issue last time I checked.

jdiaz97 · 2024-02-11T20:55:37 1707684937

I think it's less about the language and it's more about Modular's product, their MAX supercomputer thingy.

pjmlp · 2024-02-11T19:55:02 1707681302

Because of the people and companies behind the project.

zer00eyz · 2024-02-11T19:39:27 1707680367

>>> As a bioinformatician who is obsessed with high-performance, high-level programming, that's right in my wheelhouse!... Mojo currently only runs on Ubuntu and MacOS, and I run neither. So, I can't run any Mojo code

1. Back to the rust vs mojo article that kicked this off... this isnt someone who is going to use rust.

2. Availably, portability, ease of use... These are the reasons python is winning.

3. I am baffled that this person has to write code as part of their job, and does not know what a VM is! Note: This isnt a slight against the author, I doubt they are an isolated case. I think this is my own cognitive dissonance showing.

jakobnissen · 2024-02-11T20:04:59 1707681899

Author here. I do know about VMs. Is it too lazy for me to write that article and not bother to install a VM with Mojo (and Rust and Julia, to benchmark in the same environment)? Maybe. If this was for my work I certainly would have felt compelled to.

On the other hand, the fact that Mojo doesn't run on Windows and most Linux distros is a point in itself. And also, would the blog post really be substantially improved if I had gotten the number of milliseconds right for the Mojo implementation on my computer? Of course not. It should be clear that the implementations are incomparable, and that a similar Julia implementation is very fast which implies that the reason the original Mojo implementation allegedly beat Rust is not because Mojo is faster. It's just a different program.

zer00eyz · 2024-02-11T21:35:37 1707687337

>> Is it too lazy for me to write that article and not bother to install a VM with Mojo

Yes.

Would you talk about a book you didn't read? Or a movie you didn't see? Not on any meaningful level.

kescobo · 2024-02-11T23:09:27 1707692967

Someone knowledgeable enough about movies can read a script and know of it's good or not without needing to see it actually produced.

Here, it's possible to read the code and know what the program does sufficient to critique it for what it is.

eviks · 2024-02-12T05:33:34 1707716014

But he did read the "book" (source code). But ignoring analogies, can you cite a specific benefit to running the benchmark when discussing parsing correctness?

jdiaz97 · 2024-02-11T22:34:09 1707690849

That's not a very good analogy, you can understand code without having to run it.

refulgentis · 2024-02-11T19:54:13 1707681253

Got the same general impression, TL;DR: wrote a benchmark article without...running it? Then you conclude with "the language I use is faster!!!" based on a one-off run on your machine, which surely isn't the same machine Mojo used to run bechmarks for their website copy?

It's odd to read something that's pretty well-versed with some relatively complex CS concepts, i.e. it's not just a PhD with a blank text editor. But simultaneously, makes egregiously obvious mistakes that I wouldn't expect any college graduate to roll with.

There's a certain type, and I don't know what name to give it, especially because I certainly don't want to give it a condescending name. I call it "data scientist types" when I'm in person with someone who I trust to give me some verbal rope.

Software really feels like it ate everything and everyone. So you end up with insanely bright people who do software engineering as part of their job, but miss some pieces you expect from trad software engineering.

kescobo · 2024-02-11T23:19:02 1707693542

>TL;DR: wrote a benchmark article without...running it?

He benchmarks against the rust implementation, which, unless benchmarks have zero meaning, should be sufficient to get a general sense of the scale of the difference. The post is obviously not meant as the last word on this benchmark, it's meant to show that the benchmark is kinda meaningless.

>Then you conclude with "the language I use is faster!!!"

If this is your take-home from the post, it's pretty clear you didn't read it, or your reading comprehension needs some work. That sentence was obviously facetious, poking a little fun at the author of the original piece.

refulgentis · 2024-02-12T00:19:56 1707697196

> He benchmarks against the rust implementation.

No he doesn't.

The post is Mojo for Bioinformatics.

They ran a completely different library, in a different language, on their machine.

They did not run anything in Mojo.

You are asserting one data point of a Rust bioinformatics library on a random machine contributes information about Mojo, and berating me about reading comprehension to cling to that.

> If this is...

"If this is your take-home from my post, it's pretty clear you didn't read it, or your reading comprehension needs some work. That sentence was obviously facetious, poking a little fun at the author of the original piece."

^ seriously, right back at you. With a wink, and hopeful understanding I'm saying subtly "relax partner." Your first reaction should be curiosity when you're confused, not name-calling.