Hobbes – A language and an embedded JIT compiler

zoom6628 · on July 16, 2017

This should be of great interest to anybody working with high volume sensor data in IOT field. I will be checking it out very soon (on a few deadlines now so cant afford the diversion). Anything that is fast and simple for massive volumes of small record types should always be of great interest to IOT/WOT folks.

I agree with the comments about KDB/Q - tried to look at it but could never understand it. Maybe cos im not fulltime dev, nor in that field, but KDB just always becomes too hard. AT least hobbes looks like i can work it into C++. Looks at first 'parsing' to be a case of right tool for the job of working on high volume discrete data which one would expect from an investment bank.

Kudos to the bank for releasing this part of their secret-sauce.

kthielen · on July 17, 2017

Thanks, I hope to hear about your experience if/when you look into this (and I'm happy to help in any way I can). :)

kthielen · on July 17, 2017

I started the hobbes project at MS and am happy to answer any questions that folks have about it!

grumpyprole · on July 17, 2017

Are you able to comment on your implementation of row types?

You give the following type for a field selector:

> :t .foo

(a.foo::b) => (a) -> b

These give constraints with arbitrary types embedded in them. Most examples in the literature using qualified types, suggest something like the following type:

(r lacks foo) => { foo :: b | r } -> b

Where r is of kind row. The lacks constraint here is a very simple constraint and it is clear that this function takes a record, the disadvantage is that predicate lists can get very large. The old Haskell TREX extension built into Hugs worked this way.

It would be great to learn some background behind your choice. Thanks.

kthielen · on July 17, 2017

Yes, TREX and Mark Jones's work on qualified types was very influential to this project. I am a big fan of that work.

I added a few details in this post below:

https://news.ycombinator.com/item?id=14787429

I hope that answers your questions although I know that there's a lot more that could be said about this topic.

ah- · on July 17, 2017

What motivated you to start building your own language? Did you use much kdb/q before?

How did you manage to get this open sourced?

kthielen · on July 17, 2017

Well I have been interested in programming languages for a very long time. I was hired into a group at Morgan Stanley specifically to develop this project. We had some initial use-cases where some internal trading systems could be "hot patched" with new logic but it was very slow (there was basically a simple AST-walking interpreter here). So at first I just had to make a roughly equivalent thing that could be much faster. However once that immediate need was satisfied, we found many more interesting places to apply this tool. I would say that the storage and live data-analysis features are just as valuable as the efficient machine code produced by hobbes.

Yes I have used kdb/q before. It's pretty popular in the financial domain. I really like what we can do with type systems. :D

Getting this project open sourced was really accomplished by other folks who I'm not sure if I can name here (I will check with them) but I was very happy to help make it happen. There are several folks internally who have been working really hard to make it possible for the firm to open source the work that we do and hobbes is just one part of that effort. There are several folks in management positions in the firm who have made open source efforts like this a priority and have even gone out on a limb to make sure that this happens. Morgan Stanley is a really remarkable company in the technology/finance domain, in my experience.

throwaway7645 · on July 17, 2017

I'd like to know this as well. I know most array languages coupled with a DB aren't as fast as a language with a JIT, but the code is quite expressive. If Kdb+ can ever easily be run on an FPGA or GPU cluster...that would be something.

stuntprogrammer · on July 17, 2017

Not kdb+, but a proprietary (internal-only) language that it heavily influenced was designed around execution on GPU clusters.

The FPGAs were used mainly for feedhandlers and there was a different DSL for that (compiling to verilog).

It was indeed rather something to see :)

throwaway7645 · on July 17, 2017

Was it received favorably? A secret sauce, or a failed project.

stuntprogrammer · on July 17, 2017

Secret sauce, at least for the team involved, and well-received. Nice combination of extreme perf and decent productivity for them.

I've moved into non-finance stuff for quite a while now though, so not sure what's become of it. Given business challenges in that particular sub-field, who knows..

Volt · on July 17, 2017

There was a paper a while back from someone working with MS on a Q typechecker in Prolog. Was this you?

kthielen · on July 17, 2017

Nope, that wasn't me. :)

eagle2001 · on July 17, 2017

The LtU blurb says "a variant of Haskell", but the github readme does not mention non-strict. Strict or not?

kthielen · on July 17, 2017

Strict

obl · on July 17, 2017

any plan to move beyond "allocate linearly and never free anything" memory management ?

semantics look nice and coherent but unfortunately without some kind (automatic or manual) of memory management it's not usable as a general purpose tool.

kthielen · on July 17, 2017

I wouldn't summarize it that way. We obviously "free" memory or else our trading systems wouldn't work and we'd be in the news in a bad way. :)

Allocation does happen out of a default region and applications that embed hobbes generally define "transaction boundaries" for their use of compiled expressions which may allocate. It's also possible to manually nest these regions and control them yourself, which we do in some applications (e.g. where we're loading whole days of trading data and indexing them in batches).

Ultimately we might formalize that process with a region inference algorithm.

continuations · on July 17, 2017

Wait, so there's not even manual memory management? I can't even call free()? How would this work then?

nurettin · on July 16, 2017

All embedable languages should come with their own header parser/code generator to save us from the hassle of generating a bunch of class wrappers and boilerplate registration code. I think this high up in the list of concerns in a professional setting.

grumpyprole · on July 16, 2017

This looks like a typed take on KDB/Q, something that is long overdue! The key to making this work is structural typing of rows (row polymorphism / extensible records). I'd be interested to see more details on how they tackled this.

amiramir · on July 16, 2017

Morgan Stanley has a history of languages in the vein. MS was a big APL shop in the 80s. Arthur Whitney[1] worked at MS and developed the languages A, and A+. He later wrote J, and K.

[1] https://en.wikipedia.org/wiki/Arthur_Whitney_(computer_scien...

throwaway7645 · on July 16, 2017

I think J was mainly written by Roger Hui (now @ Dyalog APL) after Ken Iverson designed it himself. I think Roger was inspired by Whitney though...or was Whitney actually involved?

kthielen · on July 17, 2017

Yes, structural typing of records (and variants and recursives) has been very valuable to us.

There are a couple of type constraints (ala type class constraints) that make most of this work.

There's a constraint that looks like 'r={lbl:h * t}' that decides if/how a type "r" can be broken into a head field of type "h" (with a field named "lbl", this type variable gets the field name and can be used to e.g. relate one record type to another, like for deciding structural subtyping conversion), and a successor record type "t" (this bottoms out in the unit type).

There's also a constraint "r.f::x" (or we write "r/f::x" if "f" is a type variable) which decides if/how the type "r" can be indexed by the field name "f" to determine a type "x".

(Field access can also be overloaded, and we have some pretty awesome use-cases for that, hopefully I can put up some examples of that shortly.)

So e.g. we can use the destructuring constraint to make a generic print function for all record types, like:

  instance (r={lbl:h*t}, Print h, Print t) => Print r where
    print x = do{putStr(lbl++":");println(recordHead(x));print(recordTail(x));}

The field access constraint makes basic function definitions trivially "duck typed". So e.g. if I write ".jimmy" (shorthand for "\x.x.jimmy" or if you prefer Haskell syntax this would be like "\x -> x.jimmy") then it will select the "jimmy" field of whatever type it is given, however that is done for that particular type.

There's a lot more to say on this point but maybe it should be turned into its own doc on the site...

grumpyprole · on July 18, 2017

Ah overloaded field access, that explains the type signature. I implemented a very similar extensible record/variant system for a product definition language at Barclays. I hope we see a lot more structural typing in industry going forward, it is an especially good fit for typing queries.

zokier · on July 16, 2017

I'd love to see more about this, seems very interesting language. Something to explain how/where this is/could be used, maybe few more complete examples to show the language etc. Also some notes about performance would be great, I guess it is reasonably fast by the looks of it, but that is pretty vague.

kthielen · on July 17, 2017

Yes the whole project started because we needed dynamically-determined logic (in various places) that could fit in very tight latency budgets. And the output of this compiler is put in a critical trading path that sees a very large portion of all daily US equity trades.

So there is a lot to say about performance.

jitl · on July 16, 2017

I wish there was a more clear language tutorial, instead of mixing embedding tutorial with the language itself.

kthielen · on July 17, 2017

Yes that's a good point, I do hope to build out more targeted documentation and example programs over the coming weeks.

nnq · on July 17, 2017

Is this a "pure" language like Haskell? Or it's more like OCaml?

And, even if it's not pure, is there any syntactic sugar for monads?

Anyway, this looks awesome, especially with easy C++ interfacing that seems to be there: a system where you can have machine learning (anything from bayesian to deep nns) code in C++ and business logic in something Haskell-like is the stuff of wet dreams...

kthielen · on July 17, 2017

It's not pure currently, so definitely more like OCaml.

We make frequent use of overloaded array comprehensions, which is one form that monad syntax can take. The "do" notation introduced by Haskell isn't supported right now, but maybe soon. I'd like to refactor the parser code a bit anyway.

nnq · on July 17, 2017

Imho non-pure is probably ok. Haskell's purity is probably what destroyed any chance of it growing popular enough to have a healthy ecosystem of libraries and community.

Anyway, since this has no memory management, neither rust-like nor gc, this is not in the niche I care about now, so moving on...

But again, great work. "Strict Haskell" with good C++ interop and decent records is awesome.

gjem97 · on July 17, 2017

Can someone explain variants vs sums vs tuples vs records? This sentence from the README confuses me: "We can combine types with variants or sums (the "nameless" form of variants, as tuples are to records)."

kthielen · on July 17, 2017

Sure, I guess "record type" makes sense? Basically a record type is a "struct" or a series of values with particular names (the "field names"). A tuple is the same thing, but by convention the "names" are by position (so you have the "0th field", the "1th field", etc).

A variant can be _one of_ a series of values (where a record is _all of_ a series of values), and each of the things can have a particular name (the "constructor names"). As with tuples to records, a sum type is just a variant where you don't care to name the constructors (so you have the "0th constructor", the "1th constructor", etc).

A common example of a variant is an "enum" type, where here the payload types are trivial (just the unit type) and only the constructor names matter. In hobbes we might write the type "|red:(),green:(),blue:()|" or the shorter syntax "|red,green,blue|" for an enum that we might write in C as "enum { red, green, blue }". You could also represent this same type as the sum "()+()+()" but that looks a little weird/profane and maybe not clear to other programmers.

Tuples are sometimes called "product types" and that connects well with sum types and the general logical/combinatorial interpretation of types. If you see the type "bool * bool" (or perhaps "2 * 2"?) then it means "a bool AND a bool" and there are four such values. Similarly if you see the type "bool + bool" then it means "a bool OR a bool" and it's either the first or the second one.

If you're interested in the general area of type theory, you might like the book "Types and Programming Languages" by Ben Pierce.

keenerd · on July 16, 2017

> you will need LLVM 3.3 or later

Though it doesn't build with 4.0. (And is now in the AUR.)

Nice to see first-class parsing in a language.

kthielen · on July 24, 2017

OK, there's now a PR in the queue to allow hobbes to build with LLVM 4.0+. Pending review, it will likely be merged to master soon.

kthielen · on July 17, 2017

Doh you're right. We will need to look into LLVM 4.0+ soon.

lostmsu · on July 16, 2017

Examples do not look very readable.

beagle3 · on July 17, 2017

TL;DR: They do not look readable to you because they are a "foreign language" you are unfamiliar with, much like Japanese is to most westerners.

The vast majority of languages are essentially minor syntax "skins" applied to the same underlying Algol ideas; This describes C, Pascal, Python, Ruby, Perl, Java and most other languages - which is why being proficient in one lets you read others. Haskell, Ocaml are close enough to read, but usually not write, unless you are specifically proficient in one of them.

APL (and J and K and likely Hobbes) are very different, both in concept and syntax -- to the point that you actually have to know one of them to be able to read any of them. They look intimidating at first, and they often take longer (per character / per line) to read even when you do know them, but they are very orthogonal; e.g. in K, the expression |/0(0|+)\x efficiently computes the maximum subarray sum of the array x; this is not a predefined program - rather, each character here carries a meaning, and all combined they solve the mss problem.

If you looked at Japanese[1], it would look unreadable to you, but that's not because there's a problem with Japanese.

[0] https://en.wikipedia.org/wiki/Maximum_subarray_problem

[1] Pick any language you do not know with a character set you don't know - Japanese, Arabic, Farsi, Hebrew, Thai ...

kthielen · on July 17, 2017

Thanks, I do have it in the queue to make more targeted examples. I realize that what's there is a little scattershot and each of the use-cases mentioned could use a lot more detail.

For example, inside MS we have these reactive hobbes processes that watch log data and incrementally build B-trees to deliver future query responses very quickly. I hope to put some examples like that up in the coming weeks.

Quequau · on July 17, 2017

I may be way off base with this thought but I can't help but ask the question anyways.

Back in last year's HotChips conference Oracle spent a fair bit of time talking up new hardware capabilities targeting databases and their new ISA extensions that support it. It seems to me that this might be a good fit for some of the things it seems like Hobbes might well suited for. So I wonder if you guys spent any time thinking about that (or any ISA beyond x86).

kthielen · on July 17, 2017

I am not aware of that work from Oracle, but it would be interesting to talk with those folks!

Our main focus has been x86, although there are interesting alternative platforms.

Quequau · on July 17, 2017

Hey thanks for the response.

I am not a sparc developer, so I can't really point you in the right direction with any certainty. I did a quick web search for the buzzwords they were using in the HotChips presentation ("Software in Silicon") and found this link, which has the slides from the talk among a whole lot of other stuff. My expectation is that you guys might find the database / analytics acceleration stuff of interest.

https://community.oracle.com/community/server_%26_storage_sy...

Anyway, my question was motivated by a desire to inform my nebulous thesis that getting developers, even really smart devs working for major corporations with deep pockets, to target specialized hardware or to do optimization beyond getting code to compile with a favorite compiler is really difficult. My follow on theory is that this is going to create some interesting tension as hardware developers are increasingly motivated to do these sorts of things as Moore's law runs out of steam.

grumpyprole · on July 16, 2017

They look readable to me, certainly much more so than K or Q.

zokier · on July 16, 2017

> much more so than K or Q.

For most people that is not very high bar.

kthielen · on July 17, 2017

> > much more so than K or Q.

> For most people that is not very high bar.

hehehe

throwaway7645 · on July 17, 2017

Out of curiosity, how much experience do you have with APL, J, Q, or K? They aren't like C, so none of your previous programming skills are transferrable, but that doesn't mean it is hard to read.

grumpyprole · on July 17, 2017

I have programmed some Q professionally. It provides a lot of syntactic sugar, does not support operator precedence and there is a culture of avoiding whitespace. These can make it hard to read sometimes.

throwaway7645 · on July 17, 2017

That i can agree with.

ehudla · on July 16, 2017

A useful summary: http://lambda-the-ultimate.org/node/5452