Hacker News new | past | comments | ask | show | jobs | submit login
Scaling our spreadsheet engine from thousands to billions of cells (causal.app)
323 points by Lukas1994 on July 6, 2022 | hide | past | favorite | 109 comments



This brings me back to when I worked on Excel. The formatting structs (and most cell related ones) are so tightly packed and the byte code VM for the formulas is pretty neat.

I wish I could have met duanec, that guy wrote probably 10% of the code for Excel. Sadly he had retired 2 years before I started working on Excel.

Fun fact: when they were about to ship Excel 1.0, they didn’t know what to name it then so they had an internal poll. The name that garnered the most votes was “Mr Spreadsheet”. Needless to say, marketing vetoed it.


♫ Mr. Spreadsheet, spread me a sheet ♫

♫ Make it a big one, with columns all neat ♫

♫ Give it two tabs, and pivot tables ♫

♫ And column headers I can use as chart labels ♫


> The name that garnered the most votes was “Mr Spreadsheet”. Needless to say, marketing vetoed it.

The older I get the more I hate people who work in marketing :(



If anything, it should have been Ms Spreadsheet.


They should have named it "NOT A DATABASE!!!"


I really like Notion for databases


!DB


I also worked on Excel and in the calc code. XOpers, LBLs… it’s all coming back. We never figured out what the Hungarian prefixes kml and ptg stood for. There was a rumor kml might be the initials of duanec’s girlfriend

Recalc or die



I have heard it call that before by teammates but thought it was just a guess. Seems it’s official

Thanks for sharing :)


> Needless to say, marketing vetoed it.

Like HR, they take the fun out of everything.


Spready Freddy A:O


What did duanec do after Excel?


I was able to find this without much googling:

https://www.geekwire.com/2013/duane-campbell/


Spreadsheet McSpreadsheetface


“That name again, is Mr. P̶l̶o̶w̶ Spreadsheet.”


If you're interested in multi-dimensional calculation engines I can also highly recommend this documentary: https://tm1.film/

Companies like Amazon are still using TM1 for their forecasting. We're trying to build a 21st century version at https://causal.app


TM1 is, for all its virtues, slow and memory hungry with unfortunate implementation requirements that make its optimization via rule file complex to impossible. My experience is with IBM TM1.


I work as a consultant and spend five out of eight hours a day in Excel, and I want the features in this app so bad.

Sadly, the only way I’ll be able to use these features is if Microsoft bought them out and integrated them into Excel. It’s ubiquitous at almost every workplace you go to, but the only innovation of the past 20 years seems to be PQ (which is awesome), and xlookup.


We're mainly working with high growth startups but people seem to be quite open to try out new tools. There will always be spreadsheets but once your company reaches a certain size they move their financial budgeting to Anaplan/Adaptive/TM1/Causal :)


I’ve been in this type of role and the problem is your clients send you XLSX and expect it in return, as well as other 3rd parties exchanging data as part of the engagement/transaction. Changing one company is hard but changing an industry is entirely different.


Definitely agree — that’s why we have an Excel export. Importing is much harder so we haven’t worked much on that.


I seriously don't recommend this as a solution for production, but Access is quite capable to suck data out of Excel and squeezing it back into a database. It is horrible but I am still used to businesses doing electronic data interchange through a text file on FTP servers so my standards are low.


Basically what one needs is a db that treats excel as a basic unit of computing.


PowerQuery is actually pretty capable, but it’s effectively a modern plug-in bolted into Excel. They’ve kind of left Excel behind.


Except the xlsx has to be human friendly


Excel added lambdas recently which is nice for defining custom functions without having to expect anything special on the client side (well, other than a newer copy of Excel at the moment).


Conceptually, I agree that it's cool, but oh man are there going to be some indecipherable formulas showing up in spreadsheets once the power users start using it.


They should really add the concept of array. Now you have to type "B2:B11" to refer to a range of cells, but why not "B2", and let the ":B11" part be implicit from the structure of the sheet?


I think excel has named ranges, which is a little bit like what you're describing I guess.


There are array formulas:

https://support.microsoft.com/en-us/office/guidelines-and-ex...

There have actually been a couple of tries at them. The modern “dynamic” array formulas are a little less fiddly.


I might be missing something obvious, but wouldn't B2 be ambiguous, as it could refer to either B2:B11 or say B2:H2?


The obvious thing you are missing is that you'd have to define it in some other way, e.g. by selecting the cell and clicking "Define array" or something along these lines.


You either make a table and refer to a column, or define the array in the cell and then refer to it with a hash (i.e. B2# refers to the array of cells stored/referenced in B2).


You can, but you need to make a table then refer to the table column.


Google Sheets has this, Excel doesn't?


It does - and you can do it multiple ways.

The obvious way to do it is as follows:

* Setup an excel sheet with "Name" and "Age" in A1 and B1, then put some sample data in. * Highlight the data and go insert -> table and hit 'ok' * Where it says Table Name: put in "Users"

If we want to refer to the array of names in that table, you would now say =Users[Name] in a formula.

As another example, if we wanted to find the average age, we could simply say =Average(Users[Age]) passing in the ages as an array.

If we made another two columns - MathsScore and EnglishScore, we can either automatically create a column adding these up ([@MathsScore]+[@EnglishScore]), or alternatively these ranges can then be used in spill formulas.


I hadn't seen this before but when I went to try I couldn't get it to work in Google sheets, am I using the wrong syntax? Typing "=SUM(A1)" just gave me the sum of A1 not A1 through the rest of the data in the column while similarly "=SUM(A1:)" resulted in an error. "=SUM(A1:A)" gives me the whole column as expected, not stopping when the data breaks after A11.


This is the correct behaviour, =SUM(A1) should give you A1 in both Excel and Google Sheets. You are passing in a single cell into SUM, so only asking it to add that cell up (not the cells below).

You must refer to a range of cells for it to work - in Excel this would be for instance =SUM(A:A).

In Google Sheets: generally you write a formula and hit Ctrl+Shift+Enter which will turn it into an array formula.

In Excel there are many more options in how to handle this, one of which is to define a table in Excel (Insert -> Table) and then refer to the table and column (=TableName[Column]).


Talk to the LibreOffice people. You might even be able to sponsor work on the feature(s) you're interested in.


Excel is ubiquitous in almost every workplaces, so I can use the same workflows and templates, and don’t have to raise a request to install some new app, or permission to load client data in a cloud service operated somewhere.

LibreOffice is so, so far behind the curve already, and I’d be laughed out of the building.


> Excel is ubiquitous in almost every workplaces

Only in workplaces which use Windows. But granted, that's the large majority of them.

> so I can use the same workflows and templates

LibreOffice Calc opens Excel files well enough.

> and don’t have to raise a request to install some new app

It's not "some" new app. It has an estimated 200 Million users worldwide, and is probably the most popular FOSS application after browsers.


Most Apple-centric enterprise workplaces use Office. With the exception of some features that are specific to O365-specific licenses, it has feature parity with the Windows version.

LivreOffice does not do PowerQuery, and while it ‘opens Excel files well enough’, it probably only has 70% feature parity at most.

They themselves claim that it has ‘tens of millions of users’…

I’m not saying it’s a bad - or new - product. It’s just yet another customer requesting to install something. Very, very few (exactly zero) workplaces I have visited in Oceania and South East Asia are using it.


This can never compare to the sheer power that is excel. I’m glad they don’t move fast and break things here.



To be honest, apps of this kind seem not entirely too difficult to clone as an opensource project.

Could you name the exact killer features you see in this one, compared to, say, etherpad or https://pad.envs.net ?


this is definitely not true - there's a huge depth of complexity in these multi-dimensional databases. It's relatively easy to make a proof of concept - it's "just" an array of arrays after all, but building something production quality is a multi-year endeavour


Interesting. A few q's as an Excel performance nerd who's fooled around in Pandas:

* Why the design choice to tie the calc engine directly to a DB vs computing dependencies and returning values post-facto? * How does vectorization scale when using "volatiles" like offset-style formulas that may dynamically change the calc graph?

Every time I tried to recreate stuff in Excel in Pandas/Numpy, 1:1 multiplication wasn't the issue, the weird dependencies that prevented using their vectorized capabilities were.


Curious to think why they opted for Go and not C++ for the core engine, where there are lots of batched mathematical calculations.

If you want to eventually leverage SIMD and GPGPU computation C++ is almost mandatory, languages like Rust/Nim/Zig are getting there but still behind in many areas.


Great question. The biggest reason is that Go was what the team was most familiar with when the engine moved from Javascript. The reason today is that the engine is 40K LOC of Go, so a full-blown rewrite would be a big deal. As the post goes into depth with, we have a lot of rope left before we need to get to SIMD/GPGPU. Once we hit those computational bottlenecks in Go, we're very open to changing our mind!


Now that you have decent data structures, why don't you use a library for arithmetic calculations on vectors? I would use MKL for this, at least if you're running on Intel machines, the vdMul function for your particular example. Even on AMD machines the MKL implementation will probably be faster than hand-written code.


Go supports custom assembly, have you considered just polyfilling the SIMD functions manually?


Yes, I didn't mean to imply we wouldn't go far to squeeze Go further! At some point the downside of the complexity of dropping down from Go to intrinsics, bypassing the GC, etc. might overwhelm the benefits of staying with Go. We've spent a while writing simulations like the ones in the post to understand Go's GC's behaviour under lots of pressure, and it won't be a bottleneck for a long time. I also think that doing all the pre-requisite work to make SIMD remotely possibly will yield _far_ more performance than SIMD itself.


GC pressure is the cause of many of your competitors' problems. I'd be planning a move to an alternative sooner rather than later otherwise you'll run into the exact same problems that they have - after a certain point in the company's life it becomes practically impossible to fix


> If you want to eventually leverage SIMD and GPGPU computation C++ is almost mandatory, languages like Rust/Nim/Zig are getting there but still behind in many areas.

Is this mostly about GPGPU? I think you can just import and use SIMD intrinsics in Rust. Zig lets you do that and additionally features native vector types whose operations generate the appropriate SIMD instructions.

There's nothing like Highway yet, but for many use cases it seems like a lot more trouble to use Highway than to write several implementations that use intrinsics directly.


Yeah mainly GPGPU. C++ still has an advantage in that there are quite some convenient SIMD libraries available (like Agner Fog’s vectorclass, and Google’s Highway that you have mentioned), but you can still write raw intrinsics in other languages (actually Zig has vector types baked in so you can use this instead).

Though on second thought: for this particular company’s usecase, since they only need this for the server backend it would be best to stick to a certain hardware configuration and optimize the hell out of it. For example they can just stick to Xeon servers and use raw AVX512 intrinsics which can be particularly fast for number crunching. And since GPUs are incredibly expensive to host (and also has latency costs associated with it), it might be best to just stick with CPU SIMD.


The JVM is quite good at emitting SIMD instructions if you write straightforward loops with primitives. The forthcomming Vector API will open up more possibilities.


I am less excited by auto vectorization than I once was after being repeatedly disappointed by LLVM's output, especially for string parsers. I hope it works well though.


Can you elaborate some of the use cases for spreadsheets having billions of rows? Databases of phone calls and credit card transactions reach this size, but I would never imagine someone wanting to dump any significant amount of transaction data into a spreadsheet without quite a bit of SQL/R/Py modification ahead of time.


I run a VA business with my brother so we deal with a lot of spreadsheets. We have had a few real estate firms as our customers. They put EVERYTHING in Google Sheets and Excel.

Everything from regulatory document , CRM data and checklists everything. Having billions of row might sound problematic but when a google sheet gets capped or slow they will either create a new sheet on the same file or a new file. A RE firm over its 10 year operations has hundreds and hundreds of spreadsheets. They even have spreadsheets to locate spreadsheets.....

Having everything in one spot is a slightly better thing to do. You can't pry spreadsheets from small businesses no matter what justification you have. From a practical standpoint upping the limits is kind of a good thing.

We usually clean everything up manually and create a proper database that they can query easily. Or they can just tell us what they need through email.


Veterans Affairs? Virginia?


Virtual Assistance.


Maybe virtual assistant?


Fascinating. We’re seeing the same thing. Seeing a few companies opt for Coda as a way to convert some sheets to processes. Here’s what we’ve done with it: https://supsync.com/squared-away/


E-commerce companies have 1000s of products, modelled over 100s of weeks, broken down by 10s of locations, ... SaaS companies running cohort models over multiple products, geographies, ...

You pretty quickly end up with very large numbers.


What’s the best etl process for transferring the data?


What is your strategy for storage? It seems that because you're offering this as SaaS, your customers both set a high bar for durability of their data and your COGS would be very high if you just kept a sheet this big in memory all the time.

Do you have any plans for models which exceed the size that can be reasonably processed on one machine?


Right now we keep our paid customer models' in memory. Models in free accounts are just evaluated every time (these models are usually much smaller). For now we want to avoid the complexity of distributed systems and just run this on very powerful machines with lots of RAM. For reference, one of our enterprise competitors has a calculation engine that can do 40B cells on a single host (using Java!).


How are you ensuring the durability of the data?


We store the model representation (formulas, data) in Postgres. We just keep it in memory for computing updates faster.


that'd be Anaplan I assume?


Correct


Really interesting write up of the performance improvements. I love reading about such huge improvements by paying attention to algorithms and data structures.

You have inspired me to try replacing one of the hash tables in an algorithm I'm working on with an array to see what happens. It won't be exactly the same algorithm afterwards, but possibly the benefits will outweigh the disadvantages.


> However, we can create serious bottlenecks when it comes to moving memory back and forth between the CPU and GPU. We’d have to learn what kinds of models would take advantage of this, we have little experience with GPUs as a team—but we may be able to utilize lots of existing ND-array implementations used for training neural nets.

Indeed, moving memory back and forth from CPUs to GPUs has an overhead. There are ways to mitigate this though! I vaguely remember that one of the patterns in reducing this movement was to keep the data in the GPU as much as possible. I haven't kept up with the latest tech in GPUs off late. When I first played around with CUDA, ArrayFire (arrayfire.com) (no affiliation) was a promising library, and might be a good fit for your GPU prototypes?


Wait, where is the actual spreadsheet? Everything is hidden behind demo booking.

It'd be nice to actually see billions of cells in action. Otherwise it's just marketing for their product disguised as a technical blog post. For all we know it doesn't actually work that well in practice.


Man, this is exactly what I did at a previous job. We had a massive Query engine as a part of our BI product. I've recoded the entire sorting algorithm to use variadic templates to get a sorter for any set of column types in one go, rather than dereferencing a pointer for each column. We've observed massive improvements in speed (more like 50%, rather than 100x mentioned here. However, that codebase has been constantly optimized over the last 25 years).


OMG, does anyone else recall a series of comedy videos about a sysadmin dealing with employees? My favorite was a call from someone who "ran out of rows" in Excel. The sysadmin is losing his mind trying to explain why that wasn't possible.

EDIT: Found it! https://youtu.be/1SNxaJlicEU (slightly NSFW, skip to about 3:30)


This needs to be able to be run locally/natively. Most companies would not want to run this with their proprietary data in a third party SaaS


We'll probably offer an on-premise solution soon but most companies are quite open to using SaaS these days (banks and financial institutions might be an exception).


I’m curious how incremental updates fit into this? I guess there are relatively structured possible updates like adding a row but one can imagine unstructured updates like updating some parameter that many formulas depend on or (not sure how much causal allows this) much more varied structures/dependencies such that one update may require calculating some small but hard to predict subset of cells. Maybe there’s no need if you can recalculate everything fast.

I’m also interested in how causal fits into the taxonomy here: https://www.scattered-thoughts.net/writing/an-opinionated-ma...


That's what I wonder about too. I imagine that a slow calculation would be fine if done once when the data is first loaded into the system. Once the results are in memory, any new data should only affect smallish subsets of cells.

This is the approach our product Grist (https://www.getgrist.com) takes for its calculation engine.

Keeping track of dependencies slows it down for the initial load, but beyond that, it only needs to calculate what's needed, so most updates are fast. Even though the engine literally evaluates user-defined Python code for every cell.


Suggestions for further improvement:

Cap the amount of cells and essentially partition. Using 2 machines could conceivably double performance after scaling up stops working. So a 1bn row sum could conceivably ask for 5 200m sums then sum those.

Maintain a float32 and float64 version. Compute the float32 first and return to user, then update the result.

You could conceivably run some kind of ml model or even regular logic to predict the API call your formula engine will need to make, eg. If the customer status typing a column of sum range, you could optimistically fetch that part as soon as you see that element.

Optimistically caching is obviously high variance and resource draining.

The 10000iq version is coordination free horizontal scaling with scyladb algorithm


Is there some kind of standard / open source specification for "formulas"

I see this concept in a few applications and wondering I'd need to re-invent my own


I was just looking at a GIS inside a spreadsheet idea. Have you guys looked into spatial awareness yet?


Most of those billions of values are not shown... why not compute only the values that are on screen?


That's exactly what we're doing :) The issue is that most of the time people look at aggregates and to compute those you have to compute all the leaf cells.


Can I ask — why do you need all the leaf cells? If we want to know total gross profit for the year — why can't we compute that as "total revenue - total COGS", rather than aggregating every net profit split by product / region / month?


Are you suggesting that this is a two step aggregation , first by groups, second, ultimately, aggregate those groups, in which case the first step was unnecessary?

In any case, if you sum a list of numbers you have to look at each value. There’s no way around it. You can parallelize, like MapReduce, but the number of items is unchanged. You can preaggregate, but at some point you need partial sums.


this looks really cool! I've been reading quite a bit about spreadsheet/calculation engines as I've been building my own as a hobby project recently, and the whole space is pretty interesting.


have you seen any C++ or C spreadsheet engines that look reasonable. I need to embedded a spreadsheet engine in a C++ system and have looked at using either sc-im and also this obscure project called teapot. I need a spreadsheet style runtime formulas and cells, not a calculation engine for compile time.


Any reading pointers?

I've built my own cheapo one in js to support my financial modeling, but it's not quite general purpose so I get away with a lot. Would be curious how to take it to the next level without reinventing the wheel.


This post is pretty good: https://lord.io/spreadsheets/


this, and the documents it links to, are very helpful! thank you for sharing this.


This is one of the best sources [1]. Spreadsheet implementation in C# + corresponding book.

[1] https://www.itu.dk/people/sestoft/funcalc/


excellent! thank you so much—these things are difficult to find on search engines.


Not spreadsheet, but concept is the same: https://www.microsoft.com/en-us/research/uploads/prod/2018/0...

You need to solve recursive dirty updates.


Nice! Any open source code I can check out?


Check FunCalc [1]. Spreadsheet implementation in C# + corresponding book.

[1] https://www.itu.dk/people/sestoft/funcalc/


possibly at some point in the future, it's a very early prototype right now. the objective is actually to make a spreadsheet-based video game.


Sounds sick! Would love to play it once it's ready :D


How do you manage memory on the backend? If you have single spreadsheets taking up GBs it can add up pretty fast. What are your thoughts on scaling this as the business grows?


can someone please point to a similar display/UI/grid engine used in online word processor/spreadsheets? It'll be cool to play with it


Nots sure what you mean but you can just signup here and use it for free: https://my.causal.app/register


I meant from engineering perspective and not end-user


Looks very similar to ECS-style data storage used in video games. Have you looked at using opaque index-based IDs instead of references?


I've implemented ECS many times but I still have a hard time to see what you're referring to. What are the similarities?


I'm pretty sure this is better addressed with SQL.


We spent a fair amount of time trying to get this executing fast on SQL. However, with spreadsheets you recursively compute a lot of values. You can do this in SQL with recursive CTEs, but it’s slow, they’re not optimized for this. They also of course won’t do smart caching and dependency management for the cache. Fundamentally, it’s possible, but then we’d need to start hacking on a DB engine to make it work as a sheet. We concluded it’s better to continue on our existing engine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: