The environment around Jd has changed a bit since it was young! Jsoftware[0] announced it in 2012, and this particular page has been effectively the same since it was created in 2017 (I suspect this was a page move, and the content is somewhat older). In these early days the column-oriented database was quickly gaining popularity but still obscure, which is why there's this "Columnar" section that goes to so much trouble to explain the concept. Now the idea is well known among database users and there are lots of other options[1].
The history goes back further, because column-oriented is the natural way to build a database in an array language (making a performant row-oriented DBMS would be basically impossible). This is because a column can be seen as a vector where every element has the same type. A row groups values of different types, and array languages don't have anything like C structs to handle this. In J, Jd comes from Chris Burke's JDB proof-of-concept (announced[2] 2008, looks like), and the linked page mentions kdb+ (K) and Vstar (APL). KDB, first released in 1993, is somewhat famous and gets a mention on Wikipedia's history of column-oriented databases[3].
I did some work to compare Jd to data.tables and found that it was more performant in some instances such as on derived columns, and approximately equally performant on aggregations and queries. Jd is currently single-threaded, whereas multiple threads are important on some types of queries. I tried to further compare with Julia DB at the same time (maybe a year ago) and found that was incorrectly benchmarked by the authors and far slower than both; that might be different now. Jd is more equivalent to data.tables on disk; Clickhouse is far better at being a large-scale database.
Rules of thumb on memory usage:
Python/Pandas (not memory-mapped): "In Pandas, the rule of thumb is needing 5x-10x the memory for the size of your data.”
R (not memory-mapped): "A rough rule of thumb is that your RAM should be three times the size of your data set.”
Jd: "In general, performance will be good if available ram is more than 2 times the space required by the cols typically used in a query.”
Re CSV reading, Jd has a fast CSV reader whereas J itself does not. I have written an Arrow integration to enable J to get to that fast CSV reader and read Parquet.
I've programmed in J professionally (admittedly not for all that long) as a data scientist and, coincidentally, have just completed a small analysis using J as part of an internal workshop about data analysis I am planning. I typically work in R and Python and I have to say that at this stage there is almost no reason I would pick up J to do any work. Unless code-golf level conciseness is your only goal, these other platforms offer superior performance, clarity, ease of use, access to libraries and are, as programming languages, substantially better designed.
I say this as a great lover of function-level programming and as a J enthusiast. I would say I am quite familiar with J's programming paradigm and conceptual widgets and doodads (I know the verbs, nouns, adverbs and conjunctions and can use them appropriately). I even remembered a pretty good portion of the Nuvoc. But doing even the simplest analysis in J was _excrutiatingly_ slow and inconvenient compared to using R and the tidyverse (in particular, I missed dplyr and ggplot). The tidyverse CSV readers are, for example, much faster and smarter and more convenient and informative than anything you'll get from the J universe.
I love vector languages but at this point J can't compete with the major platforms for data analysis. Its less convenient, often _slower_, much more low level, strange, and its library situation is anemic at best. I recommend learning J because it will expand your mind, but I can't imagine picking it up for real work.
I get enamored with apl/k/j every time I see it and was looking for excuses to use it despite everything.
I understand that due to the much smaller community the tooling and ecosystem is much weaker but there must be a reason why some people keep reaching for it, especially the guys in finance. I don't get the Cobol vibes from it like it is some sort of legacy burden. While the use case is narrow there must be an edge.
This is HN after all. You wouldn't tell people not to mess with lisp and just reach for python now would you? *puppy eyes stare*
J feels a lot like Smalltalk and Lisp to me. If you got on board early, you could do all sorts of stuff other languages struggled to make easy and performant. Hence the set of dedicated users. And there are some genuinely interesting conceptual things going on in array languages which have real appeal. But in the end I think J reflects a previous era and hasn't caught up to really useful ideas in more contemporary languages, probably because its user base is too conservative.
I wouldn't recommend people use XLisp or run Genera in a VM to solve real problems. Recommending J feels like that to me.
Not true, at all! Since 2010 or so, the APL family has only improved its reputation and grown in popularity. I listed some developments of the past two years at https://news.ycombinator.com/item?id=28930064. Now, it's not much relative to the huge growth of array frameworks like TensorFlow with more mainstream language design, but it is definitely not losing appeal.
Oh, thanks for clarifying, since it occurred to me that you might mean just the appeal to you, but not that you meant the field of programming! I'm no NN expert, but tinygrad looks very approachable in BQN. You might be interested in some other initial work along those lines: https://github.com/loovjo/BQN-autograd with automatic differentiation, and the smaller https://github.com/bddean/BQNprop using backprop.
TBF, that guy isn't doing it for the fun of it (OK, partly for the fun of it) but because Mr. Chickadee is a content creator. Sure, it's a lifestyle choice, but he also makes his living do it. I love his channel, but his lifestyle is as much a product of our modern world as the Java programming language is.
When I want to just "Get stuff done" TM, I reach for Python. Except that I've stopped doing that because setting up package versioning and venvs is a nightmare that gets more frustrating every time I try to do it.
Now I'm looking for a "better" TM way to get my scripting needs met. I'm looking at Nim, specifically. I may also try to lean on a Scheme or a Lisp. My problem with the latter is lack of decent docs for getting stuff done. Maybe I'm missing something, but being productive in those languages for me is like a high jump when I can't even step up on a curb.
Nim is excellent, I use it daily, but you'll have to define what you mean by "scripting" language :) It is very much in the school of "build a binary, run that binary", which is ideal for my usages. There is of course, support for running scripts (".nims", Nimble tasks, and so on), and one can just tell the compiler to run your binary no worries from nimcache anyway, but I still use other actual scripting languages for some tasks. Mainly bash, to be perfectly honest.
That said, my view is biased, as most of what I do with it is for the "make a binary, then run the binary somewhere" school of usage patterns! Its quite possible it is a good fit for what you're after.
I've been doing some research in this area: for day-to-day, quick and dirty, scripting purposes, if not Python, then what?
I want something expressive: Python can be a bit verbose by default and isn't very good at oneliners. I want something that starts up fast, but can run a bit slower. It would be great if it had a REPL, but if compilation and/or startup are fast enough its lack would also be acceptable. The amount of available libraries is not that important, but at least https client, json and csv parsers, are a must.
I considered: V, Nim, Clojure, Scala 2, Scala 3, GNU Smalltalk, Rust, Zig, Clojure, Racket, Emacs Lisp, Raku, Ruby, OCaml, and LiveScript.
I don't have the time right now for a full write-up, but in terms of expressiveness and for scripting on the command line, LiveScript[1] won; in terms of general scripting (10-20 loc) Emacs Lisp won (I use Emacs, and the IDE-level support for Elisp there is outstanding). Raku and Ruby were runner ups in both categories.
The expressiveness of Elisp - with some extensions I made to the interpreter and a bunch of libraries I already have installed (dash, s, f, ht) - was the biggest surprise. It was on par with LiveScript, which has a lot of specialized syntax. The problem with LiveScript is that it's not very actively maintained, but it's still one of the most expressive languages I know of, and running on Node makes it quite fast and gives it access to all of NPM.
I didn't consider Haskell, because I don't know it well enough. The rest of the languages were either significantly more verbose, had unacceptable startup time, required a lot of extensions and libraries to get the expressiveness level I wanted, or had a bit too few libraries available.
you may consider Raku AND sparrow - http://sparrowhub.io , you'll get all the benefits of Raku on high level scenarios, and you can write _underlying plugins_ on many languages (Bash/Perl/Ruby/Python/Powershell), so that will give you a certain level of freedom while transitioning from one language to another.
Nimpy makes it possible to move from Python to Nim gradually. It’s magical, and while it doesn’t solve python’s own venv problems, it would only need the DLL from Python - whether it was 2.5 or 3.4 or 3.8, it would just work - they probably removed the python2 support by now, but it was just magic.
Some Scheme/Lisp implementations are capable enough to accomplish daily work. Common Lisp is one option, and I've used Chicken Scheme effectively for some projects.
You're right though, there's a significant learning curve with any language in a different paradigm. Forth-like languages are an example, and yeah, J/K and cousins are hard to grasp. I've dabbled in these but never quite got there.
IMO Lisp-like languages aren't quite as "foreign" since the syntax is a variation on 'function parameters body' used in "normal" (Algol-like) languages.
I guess it comes down to what we get used to, and really for many purposes choice of language isn't all that critical, assuming of course it supports the task at hand.
Not without quite a few libraries. You need at least rackjure[1] (for threading macros if nothing else) and data/collections[2] (for unified sequence/collection handling and convenient helpers). You might want data/functional[3] for the `do` notation (to replace all the for/... forms; only if you know what you're doing). You will probably need a few srfis - 26 for cut, for example - depending on your preferred coding style and domain.
You won't need external libraries for streams, generators, objects and classes, contracts, graphics and plotting, documentation generation, concurrency, limited parallelism (futures) and multiprocessing (places) - as these are all built-in.
Racket is a great language. I learned it around 2008 and get back to it regularly. However, due to its focus on pedagogy, the defaults tend to be verbose for practical work. The incredible power of `syntax-parse` makes it trivial to build DSLs, but that backfires, because everyone and their cat make DSLs instead of settling on a single set of helper libraries. Packages that require a lot of effort to make, like a numpy/pandas/data-frame implementation, tend to lag behind their counterparts in other languages due to the relatively small community. Without easy access to larger ecosystem that Clojure enjoys it's pretty much impossible for the situation to drastically improve on that front.
Contracts and Typed Racket are still, to my knowledge, the most advanced in their respective fields.
I love Racket. It's my first Lisp, and it'll have a place in my heart until the day it stops beating. However, for practical, day to day coding, that doesn't fit precisely into Racket strengths, it's not as good as any of the top 10 on TIOBE.
>I get enamored with apl/k/j every time I see it and was looking for excuses to use it despite everything.
You should do it. Nothing in my programming career has changed the way I thought so much as learning J to the point of real fluency. Though you could swap out APL, k, or BQN for the same effect.
The ecosystem problems are genuine. Though I do not think they are so great as you make them out to be. But with respect to semantics, numpy et al are but pale imitations. With respect to syntax, too (https://www.jsoftware.com/papers/tot.htm).
> I do not think they are so great as you make them out to be
There's a dynamic with ecosystem problems I believe applies to all languages. You only need one missing or bad library that's critical to your project to make the whole language useless.
An anecdotal example: I remember many years ago trying to give Python a go and within 15 minutes ran into a problem parsing XML. A search revealed this was a known issue that was being worked on with the foremost tool in Python for this job. You couldn't have credibly argued that Python had an ecosystem problem even at the time, but for me in that particular scenario Python had a show-stopping ecosystem problem. There were ways around this, but the most convenient way around it at the time was switching back to a more familiar language.
My greater point is that, we can definitely make generalizations about a language's ecosystem health, but keep in mind there is a very context-sensitive, practical dimension to that type of language assessment.
> You only need one missing or bad library that's critical to your project to make the whole language useless
...no? If there is functionality I need, and no library implements it, I will implement it myself. That goes for any language. Otherwise, the job of a programmer would simply be to string together existing libraries, not writing anything meaningful.
It's not ideal, but I've done this in BQN and it took about 15 lines. I didn't need to handle comments or escapes, which would add a little complexity. See functions ParseXml and ParseAttr here: https://github.com/mlochbaum/Singeli/blob/master/data/iintri...
XML is particularly simple though, dealing with something like JPEG would be an entirely different experience.
# https://www.intel.com/content/dam/develop/public/us/en/include/intrinsics-guide/data-3-6-1.xml
xml ← •FChars •wdpath •file.At ⊑•args
#⌜
# An xml parser good enough for our use case
# Accept xml; return (as three lists):
# - Parent index of each tag
# - Contents of open tag
# - Text after last child tag
E ← +`⊸×⟜¬-⊢
ParseXml ← {
text‿tags ← ('?'=·⊑·⊑1⊸⊑)⊸(↓¨) <˘⍉⌊‿2⥊((+⟜»E∨)˝"<>"=⌜⊢)⊸⊔𝕩
d←+`tt←1-(+˜⊸+´'/'=0‿¯1⊸⊏)¨tags # Tag type: ¯1 close, 0 void, 1 open
tp←(⍋⊏⟜d)⊸⊏∘/˘ 1‿¯1=⌜tt # Tag pairs
! (∧`' '⊸≠)⊸/¨⊸≡⟜(1⊸↓¨)˝tp⊏tags # Tag matching
oi←(0<tt)(⌈`↕∘≠⊸×)⊸⊏⌾((⍋d)⊸⊏)↕≠tt # Open index, for closed and void tags
ci←⍋⊸⊏○(∾⟜(/0=tt))˝tp
pi←(/0≤tt)(1-˜⍋)¯1⌾⊑ci⊏oi # Parent index
⟨pi,(0≤tt)/tags,ci⊏text⟩
}
May or may not be worth your time, but I think if you can learn Haskell it's pretty likely you could learn to read this code. It does a lot of stuff, and has to be read one bit at a time. Here's a dissection of the second line:
d←+`tt←1-(+˜⊸+´'/'=0‿¯1⊸⊏)¨tags # Tag type: ¯1 close, 0 void, 1 open
(I'll handwave the parsing a bit, but the rules are simple, with only 5 or so levels of precedence. Modifiers bind tighter than functions: superscripts like ´ are 1-modifiers and apply to one operand on the left, and glyphs with an unbroken circle like ⊸ are 2-modifiers and apply to an operand on each side, left-associative. The ligature ‿ is one way to write lists and has higher precedence than modifiers.)
BQN evaluates functions right to left, so the first thing that's done to evaluate this code is (+˜⊸+´'/'=0‿¯1⊸⊏)¨tags. And ¨ is each, so we apply +˜⊸+´'/'=0‿¯1⊸⊏ to each tag, which is a string of whatever's between < and >. This is a 4-component train (+˜⊸+´)('/')(=)(0‿¯1⊸⊏), and the rules of train evaluation say that every other component starting from the end ('/' and 0‿¯1⊸⊏) apply to the argument and then the rest apply to those results. '/' is the slash character and applies as a constant function returning itself, and 0‿¯1⊸⊏ is the array 0‿¯1 bound to the function ⊏, which performs selection. Applied to a tag t, it gives what might be written t[[0,-1]] in more conventional notation, taking the first and last element. After that, '/'= compares to slash. This automatically maps over arrays, so now we have a two-element array where the first entry is 1 (BQN booleans are numbers) if there's a slash after the < and the second is 1 if there's a slash after the >.
Finally the train applies its first function +˜⊸+´ to the result. ´ is a fold, so we apply +˜⊸+ as an infix function between the two booleans. We can write (a +˜⊸+ b), which is (+˜a) + b by the definition of ⊸, which I think as adding +˜ as a preprocessing step on the left argument before the +. Then +˜a is a+a, so we have twice a plus b. That is, 2 if the tag starts with / and 1 if it ends with /. The case with both doesn't occur in my xml.
The rest of the expression is d←+`tt←1-numbers. The 1- transforms 0, 1, and 2 to -1, 0, and 1 as mentioned in the comment. This is given the name tt. Then +` is a prefix sum, since ` is scan. Taking this sum gives a list that goes up on each open tag and down on each closed tag, that is, the depth of tag nesting.
That went through a lot of concepts, but if you can parse that line then you can parse nearly every other one, since they're all made up of trains, modifiers, functions, and constants. The only other syntax in ParseXml is ⟨,⟩, which is just a normal array literal using angle brackets, and the braces {} for a function, with implicit argument 𝕩 in this case. The other knowledge required is what all these functions do. That's the hard part, but at this point it's no different from learning any other language. The functionality is just indicated with symbols instead of names.
The job of a programmer is to glue together existing libraries in the most convoluted manner possible and collect rent on maintenance. Perhaps even graduate to consulting. Grow a pointy haircut.
Who the hell wants to be a programmer, dismal profession.
I sort of agree with you, especially about numpy. Nothing in the data science space in Python feels right to me. But you can't beat the network effects. Its still easier to actually do data analysis in Python than in J.
I haven’t used R recently (10 years or so), but when I did, the speed with which K/kdb+ could scan through and summarize terabytes of data was orders of magnitude faster than R or any other system. Once the data was summarized into (say) a gigabyte or so, analyzing it with R or even Python was much easier thanks to the ecosystem and reasonable time (probably 10-100 times slower, but the time saved by using well tested stat code is more than worth it)
One of the nicest thing about J is the notion of verb rank. For non-J-programmers, you can apply a rank to a verb and this effects how the verb operates on its vector operands. A rank of zero means "operate on the entire object" whereas a rank of 1 means "operate on the (1) elements of the operands. Other ranks change the meaning of what counts as "an element."
However, like most things in J, support for this excellent idea (which eliminates the need for most looping constructs and can be very performant) is irregular: it is limited to monadic and dyadic verbs. Nothing about verb rank forbids functions which accept more than two arguments, but the idea of a function which accepts more than 2 arguments is poorly supported in J (the idiom is to pass a boxed array to a monad, but the boxing of the items to be passed makes supporting rank behavior for the "arguments" impossible or absurdly complicated.
Other beefs with J: J doesn't have first class functions as such. While you can represent functions as "nouns" in a few ways, you cannot have (for example) an anonymous reference to a function as a thing unto itself (you may denote a verb tacitly in a context where you need a verb, however, but this is not the same thing). If you want to pass around verbs in a way familiar to you as a contemporary programmer you have to use "adverbs" and "conjunctions" which are just higher order functions which (more or less) return verbs. But adverbs and conjunctions have their own peculiarities and restrictions (not the least of which is that they are not themselves verbs or nouns and thus cannot be passed around either). In contemporary programming languages the verb/adverb/conjunction space would just be represented by "functions" and to great effect. As a functional programmer and Lisp guy, I find the limitations on "verbs" very frustrating in J.
J's error messages are also bad, never more than a few words.
There are some great ideas in the language, but it feels very old-fashioned and out of touch.
What I would like to see is a "array scheme." A lexically scoped Scheme-like language where every object is an array and function argument slots can be independently "ranked" to support the elimination of loops over array arguments. I'm too busy to put this together, but it would be great to have if you wanted to fiddle with arrays for some reason but could do without any library support for actually doing data analysis.
How do you feel about the J/APL syntax in live coding sessions ? does it help iterating a bit faster than R/python ? or was it a totally irrelevant aspect ?
Most RDBMS systems are row oriented. Ages ago they fell into the trap of thinking of tables as rows (records). You can see how this happened. The end user wants the record that has a first name, last name, license, make, model, color, and date. So a row was the unit of information and rows were stored sequentially on disk. Row orientation works for small amounts of data. But think about what happens when there are lots of rows and the user wants all rows where the license starts with 123 and the color is blue or black. In a naive system the application has to read every single byte of data from the disk. There are lots of bytes and reading from disk is, by orders of magnitude, the slowest part of the performance equation. To answer this simple question all the data had to be read from disk. This is a performance disaster and that is where decades of adding bandages and kludges started.
Jd is columnar so the data is 'fully inverted'. This means all of the license numbers are stored together and sequentially on disk. The same for all the other columns. Think about the earlier query for license and color. Jd gets the license numbers from disk (a tiny fraction of the database) and generates a boolean mask of rows that match. It then gets the color column from disk (another small fraction of the data) and generates a boolean mask of matches and ANDS that with the other mask. It can now directly read just the rows from just the columns that are required in the result. Only a small fraction of the data is read. In J, columns used in queries are likely already in memory and the query runs at ram speed, not the sad and slow disk speed.
Both scenarios above are simplified, but the point is strong and valid. The end user thinks in records, but the work to get those records is best organized by columns.
Row oriented is slavishly tied to the design ideas of filing cabinets and manila folders. Column oriented embraces computers.
The general consensus as I understand it is: column-oriented indices/storage options are good for OLAP, large scale analytics, bulk data analysis. Row-oriented indices are suited more for OLTP, individual "record processing."
Both are just techniques and there's nothing stopping a single db product from offering both.
Again comes back to usage patterns. Yes, if you're doing aggregation operations on a small number of columns then I expect locality of reference could be better with a column-store, rather than thrashing through row-retrievals one after another (and then just throwing them away after aggregating).
But if you're frequently doing "look up this customer and others like them" and then using the bulk of the information there? I'd expect better cache behaviour out of row oriented storage.
But these days it's so unclear what's happening inside the actual "black box" that is our hardware that it's hard to make generalizations.
It all depends on access pattern. Do you tend to select entire rows? Use a row-oriented DB. Do you tend to select entire columns? A column-oriented database might be for you. That's it, really. None of the designs are superior, afaik.
All of that syntax is awful. Why not just x.sortBy(y) ? Did all of the advances in software legibility fail to make their way to the modern scientific computing world?
Hyperbolically, because you don't write math with variables in camel case.
J traces its roots from a notation for math, used on whiteboards. That awful syntax you see - it's the same as in some formulas in, say, general relativity, only J is Turing complete and not a Turing tarpit. When you work on a formula, in case of J you have ability to execute it, and if you see it's wrong you can update the formula and try again. This could also be done in other languages, but in J (I mean, APL family of languages) it's more focused.
In defense of J, I had a professional example of a problem which wasn't clearly specified, which needed some experimentation - that took, if I remember correctly, some 45 minutes of attempts in J, and then the prototype was re-written in C#, when if was already producing desired outcomes. Rewriting took somewhat longer.
You might be selecting entire rows, but you are probably not selecting all of the rows, and your selection criteria probably do not depend on all of the columns.
In many OLTP systems, almost all work operates on multiple attributes of a single record. E.g., when logging a user in, an authentication system cares about multiple attributes of a single user record, not facts about the aggregate pool of users.
Column oriented stores are extremely efficient for aggregate queries, but they make writes and single-row reads more expensive and are thus not suitable for every workload. There's an excellent overview in Martin Kleppmann's Designing Data Intensive Applications.
A lot of modern, data-oriented ECS frameworks for game dev follow a similar philosophy, wherein components are stored in linear collections that optimize memory layout for caches and parallelism. Given how rarely you need 'SELECT *' this makes sense for a relational DB as well, though modern SQL DBs have a lot of sweat put in to their performance.
If you're using J, you are probably doing analytics and stats. That means you are looking for patterns in a handful of attributes across a large population - i.e. columnar.
As others have said, row-based makes sense for most OLTP / app databases. You're probably not writing those products in J.
It tend to include a large proportion of the large, expensive reporting queries your business people want to do. Whether or not those kinds of queries dominates for your system will depend greatly on your system.
You also need to reach a certain scale before the choice (either way) will affect you enough to matter.
But when you reach that scale it can be the difference between reporting queries taking seconds vs. hours in some cases.
For some systems you'll end up wanting both, and stream updates from the transaction focused db (row oriented) into a separate reporting database that uses a column store.
This section was interesting! Somehow I've never realized that row oriented storage is orthogonal to how disks work...
Jd is a columnar (column oriented) RDBMS.
Most RDBMS systems are row oriented. Ages ago they fell into the trap of thinking of tables as rows (records). You can see how this happened. The end user wants the record that has a first name, last name, license, make, model, color, and date. So a row was the unit of information and rows were stored sequentially on disk. Row orientation works for small amounts of data. But think about what happens when there are lots of rows and the user wants all rows where the license starts with 123 and the color is blue or black. In a naive system the application has to read every single byte of data from the disk. There are lots of bytes and reading from disk is, by orders of magnitude, the slowest part of the performance equation. To answer this simple question all the data had to be read from disk. This is a performance disaster and that is where decades of adding bandages and kludges started.
Jd is columnar so the data is 'fully inverted'. This means all of the license numbers are stored together and sequentially on disk. The same for all the other columns. Think about the earlier query for license and color. Jd gets the license numbers from disk (a tiny fraction of the database) and generates a boolean mask of rows that match. It then gets the color column from disk (another small fraction of the data) and generates a boolean mask of matches and ANDS that with the other mask. It can now directly read just the rows from just the columns that are required in the result. Only a small fraction of the data is read. In J, columns used in queries are likely already in memory and the query runs at ram speed, not the sad and slow disk speed.
Both scenarios above are simplified, but the point is strong and valid. The end user thinks in records, but the work to get those records is best organized by columns.
Row oriented is slavishly tied to the design ideas of filing cabinets and manila folders. Column oriented embraces computers.
A table column is a mapped file.
> Somehow I've never realized that row oriented storage is orthogonal to how disks work...
The section you posted is very misleading. Storage is arranged in blocks. The secret to database performance is how you lay out data in those blocks and how well your access patterns to the blocks match the capabilities of the device. This choice is the fundamental key to database performance.
If your database stores shopping baskets for an eCommerce site, you want each basket in the smallest number of blocks, ideally 1. It makes inserting, updating, and reading single baskets very fast on most modern storage devices.
If your database stores data for analytic queries, it's better (in general) to store each column as an array of values. That makes compression far better, and also makes scanning single columns very efficient.
To say as the article does that "row oriented is slavishly tied to design ideas of filing cabinets and manila folders" is nonsense. Plus there are many other choices about how to access data that include parallelization, alignment with processor caches, trading off memory vs. storage, whether you have a cost-base query optimizer, etc. Even within column stores there are big differences in performance because of these.
(Disclaimer: I work on ClickHouse and love analytic systems. They are great but not for everything.)
I would not that this query behavior (sorted data columns bitmasked together) is further orthogonal to primary-data storage representation. For example, Postgres can give you this same behavior if you declare a multi-column GIN index across the columns you want to be searchable.
If you’re interested in this thought, check out Martin Kleppman’s book DDIA where he explains storage concepts like this and many more. One of the best architecture books out there!
Ironically - He didn't even mention indexes in his description (which he admitted was simplified) - a good query optimizer will do wonders for not only coming up with the appropriate hints for the query plan, but will also dynamically adjust those hints based on the underlying data patterns.
The example he provided,
"So a row was the unit of information and rows were stored sequentially on disk. Row orientation works for small amounts of data. But think about what happens when there are lots of rows and the user wants all rows where the license starts with 123 and the color is blue or black. In a naive system the application has to read every single byte of data from the disk."
Is something no modern database would ever do. The real challenge is not to only read the records starting with 123, or having blue/black - that part is trivially handled by every Database engine I'm familiar with. The query challenge is *do you filter on license # or color first? (If there are 1k records starting with 123 and 5mm blue/black vehicles, the order is pretty critical for performance) - that's one of the features that distinguishes query optimizers.
Columnar databases are awesome when you have columnar data to work with - I've seen 20-30x reductions in disk storage in the wild (and you can obviously create synthetic examples that go way north of that), but a well indexed SQL database backed by a solid query optimizer/planner can probably stand it's own with a columnar database in terms of lookup performance, particularly if your data is row-oriented to begin with.
Not that I claim anyone in particular can read it of course. Jd uses a hierarchy of folder, database, table, column that's handled with an object system to share code between them. A folder is just a place to put databases and hardly needs to add anything, while the other levels have a lot of extra functionality. As an inverted database, Jd stores each column in a file, and accesses it using memory mapping.
J isn't really made to be pretty. It's made to be terse and simple to read once given enough learning effort, and it's made to be a consistent keyboard typable notation.
I would say it is not knowledge leaking, most languages leak knowledge so that if you are not familiar with the language but you do know some other programming languages you can sort of figure out what they do.
But some languages do not leak knowledge in this way.
There is the concept of beauty in programming languages that the expression of an idea should be succinct. This J code might be beautiful, but unsure.
It's extremely common: license it under GPL/AGPL or some other very copyleft license; get contributors to sign a CLA, then offer the library with hefty license fees for non-FOSS projects.
Yeah, KDB is... Not super fun. I've been looking at TimeScaleDB recently, because it's just a PostgreSQL plug-in it seems nice and simple, but I haven't actually compared them directly yet.
If you want some intro info – and you may have found it already – the YouTube channel is a great place to start for TimescaleDB youtube.com/TimescaleDB
(for tranparency: I work for Timescale...)
Can you give some tips on what do you mean by "less than rudimentary analysis"? Considering adopting Clickhouse and wondering whether we will encounter problems down the road.
I know nothing about J and JSoftware, but this reads like an Aprils fools joke. Is it?
> In a naive system the application has to read every single byte of data from the disk. ... To answer this simple question all the data had to be read from disk. This is a performance disaster and that is where decades of adding bandages and kludges started. .... Think about the earlier query for license and color. Jd gets the license numbers from disk (a tiny fraction of the database)
Ofcourse that data has to be read from disk. Well, for simple or aggregate queries he may gain performance. Moreover, as other commenter has commented, you can organize data in columns in MSSQL too for aggregations: https://docs.microsoft.com/en-us/sql/relational-databases/in...
> columns used in queries are likely already in memory and the query runs at ram speed, not the sad and slow disk speed. ... Jd performance is affected primarily by ram. Lots of ram allows lots of rows
Any other RDBMS can have sensible indexes that satisfy your queries. And, surprise, your data also lives in RAM once you read it.
> You can backup a database or a table with standard host shell scripts, file copy commands, and tools such as tar/gzip/zip.... If you understand backing up file folders, then you pretty much understand backing up Jd databases.
And... throw data consistency out of the window?
I'm reading and I'm "not getting" the selling point - why is this better?
Okay, I read that things are files. SQLite is also a file if physical format is a concern.
> Ofcourse that data has to be read from disk. Well, for simple or aggregate queries he may gain performance.
Lets say you want to access 2 columns out of 100 in a particular table. In a row-oriented database, you have to read the full rows off the disk, which means that you have to read 98 pieces of data off the disk that you have no use for, a total waste of I/O. In a columnar database, you don't have to do that, you just read off the relevant columns. This is VERY similar to the "array of structs"/"struct of arrays" argument in gamedev (and related high performance fields), it's the same kind of tradeoff: slightly more complicated data layout traded in for much more efficient reads.
In addition: if you have a columnar database, you can employ compression in a much more efficient manner. If you have 10 million rows with the same (or very similar) data in a column, you can compress that to a fraction of the size. This messes with indexes, but it's often worth it because it VASTLY speeds up aggregate calculations.
Row-based and column-based databases have different tradeoffs and advantages, and it's not quite as clear-cut as the article makes it seem. But it's certainly no April fools joke: columnar databases (for many tasks, particularly aggregates) can vastly outperform row-oriented databases. This is why Google BigQuery is columnar, for instance. Another good example is kdb+ (which this is clearly based off of), which is widely used in places which value quick time-series aggregates (Wall Street, being the obvious example).
The article is a bit over the top and one-sided, but it doesn't say anything that is particularly controversial. You might wanna read up on columnar database systems: https://en.wikipedia.org/wiki/Column-oriented_DBMS
> slightly more complicated data layout traded in for much more efficient reads.
Depending on read patterns. The classic example is address. Sure, you can store an address as column. Name here, city there, street 1 there, street 2 there. How useful is 1/5th of an address, and how often are you pulling it like that? For something like address that you generally read all or none, you generally are better served by a row oriented database.
You also have FKs to kind of do this in a row oriented database. If some part of the data is not read nearly as much as another, it can be a foreign key sitting in another table.
Yeah, exactly: there are tradeoffs to both models, neither is strictly superior. You would never want to do aggregates on addresses anyway, so that advantage is out the door. You do, however, want to very easily index a table of addresses, so you could quickly look them up for a particular user, which a columnar database is (arguably) worse at. BigQuery, in particular, does not use indexes at all.
(EDIT: I guess you do might want to do aggregates on addresses, actually. "How many customers do we have in NYC?", that kinda thing.)
Hybrids are straightforward enough. A "simple" way of achieving that is to support using the indexes to directly answer queries, as quite a few databases do. Now an index on a single column is also a columnar store of the contents of that column, yet you still have the full row to query if you need lots of data from individual rows. A more sophisticated option would be to reduce duplication of column data.
(EDIT: How well a usually row-oriented database optimises this, is another question, and will differ by database)
We should have a secret handshake or some type of insignia to better signal to our peers. I've tried draping a J colored kerchief out of my back pocket but the results so far are not great, it appears there is more anti-J sentiment than I'd imagined, as I get harassed unduely in certain areas of town. May have to switch to maybe a hand gesture based signaling that can be done on the fly to signal allegiance.
Meta comment: I tried to click the link but since the title is so short and my mousing not very precise, I accidentally clicked the upvote arrow. I then clicked "unvote" and tried again, but the same thing happened. The third time round, I managed to click the link.
Takeaway: Very short titles might get you some upvotes from clumsy users :)
Yes! I came here to post the same thing - the link didn't work for me, and I inadvertently upvoted as well. I assume the article has merits of it's own, but I do notice a huge ratio of votes to comments, I guess some are inadvertent. It would be interesting to search through other very short titles and look at the ratio of comments to votes vs others above some vote threshold... (I'm on my phone or I'd try) - edit, is there a regex search for HN anywhere? It looks like algolia doesn't support them
J is an APL language. APL is the coolest language you’ve never heard of. It’s mind blowing in the same way people talk about Lisp, but more so since the concepts are so alien to most programmers.
The history goes back further, because column-oriented is the natural way to build a database in an array language (making a performant row-oriented DBMS would be basically impossible). This is because a column can be seen as a vector where every element has the same type. A row groups values of different types, and array languages don't have anything like C structs to handle this. In J, Jd comes from Chris Burke's JDB proof-of-concept (announced[2] 2008, looks like), and the linked page mentions kdb+ (K) and Vstar (APL). KDB, first released in 1993, is somewhat famous and gets a mention on Wikipedia's history of column-oriented databases[3].
[0] Company history: https://aplwiki.com/wiki/Jsoftware
[1] https://en.wikipedia.org/wiki/List_of_column-oriented_DBMSes
[2] https://code.jsoftware.com/wiki/JDB/Announcement
[3] https://en.wikipedia.org/wiki/Column-oriented_DBMS#History