What if OpenDocument used SQLite? (2014)

p4bl0 · 2023-09-18T09:02:56.000000Z

I'm currently working on an application where I use SQLite as the file format. I want to keep a usual workflow for users where you can make edit to your document and it only changes the file when you save it.

So to open a file I copy it into the :memory: database [1], then the user can do whatever manipulation they want and I can directly make the change in the database I don't need to have a model of the document other than its database format. And to save the document I VACUUM [2] it back to the database file. It works quite well, at least for reasonably sized file (which is always the case for my app) :).

[1] https://www.sqlite.org/inmemorydb.html

[2] https://www.sqlite.org/lang_vacuum.html

rakoo · 2023-09-18T11:50:50.000000Z

Why do you use a secondary, volatile database ? Performance-wise you won't gain a lot more (we're talking about a user editing a file, so not even 1 write per second).

A proposal: write directly, and automatically in the database. No more Save button. There are multiple advantages:

- the system is crash-resistant. I like taking the approach of CouchDB where the only correct way to close the system is to crash it. That way a crash is an expected situation that you actually account for, not a special case that you might forget

- there is only one database. Less code, fewer bugs.

- it is safe. A write to SQLite works or doesn't work, there is no in-between. As said in the VACUUM doc you point to: "However, if the VACUUM INTO command is interrupted by an unplanned shutdown or power lose, then the generated output database might be incomplete and corrupt"

- it is how SQLite was intended to work. And because of that, you won't have to think about it for the lifetime of SQLite

fluidcruft · 2023-09-18T12:34:57.000000Z

There is nothing I hate more than an app that modifies files secretly when I open them. Then I have to get all defensive to copy files before I open them to keep them intact. You may not see the problem with changing the checksum or hash of a file, but silently tampering with files is a nightmare in many domains. If you open a file and accidentally change something trivial (some apps like to store things like presentation state i. e. window positions, last page viewed, zoom level, ...)

For example in many regulated domains such as human subjects research files must be approved and only approved files may be used. "Is this version of the consent document the version that the IRB approved?" Well let's see... (1) file modification date is after the approval date and (2) checksums do not match.

Not to mention that writing a single byte of content to a filesystems marks the entire blob as needing backup.

The fact is the filesystem is the user's database, save is commit, and it should be under the users control because application developers do not have the faintest idea about user context.

rakoo · 2023-09-18T13:16:59.000000Z

You are right, this is a use case I absolutely did not take into account, but I want to separate user-defined actions and app-defined actions. A level of zoom is something a user does to read a document, but I wouldn't consider that as data to be persisted automatically, unlike characters typed or a font chosen. I value the idea of persisting it, but that would be a user-specific action ("Save view" or something like that)

In the case of checksums in a database, that is why read-only modes should be used and I don't see what automatically saving would change. If anything, when the user zooms on a document in read-only mode, either it shouldn't be stored or storing it should trigger the same flow as modifying the document

fluidcruft · 2023-09-18T15:26:21.000000Z

I see what your are saying, but in other use cases the presentation state needs to be considered as part of the document. This is one of the reasons zip/jar containers work somewhat well. You can audit different chunks of data separately and cryptographically sign them. sqlite actually has an archive format[1] that is interesting to think about and I have pondered using it for some applications (store the files and also store tables of metadata/analysis)

[1] https://sqlite.org/sqlar.html

atoav · 2023-09-18T18:09:04.000000Z

Why not do it like Blender: just autosave into some software directory, have the possibility to restore on crash, have the possibility to restore the last n autosaves from disk and add a setting for how many to save etc in the options.

wongarsu · 2023-09-18T15:20:18.000000Z

Word has a fairly simple solution to this: there's a big slider labeled "Autosave" in the title bar, right next to the save button, allowing you to turn this behavior on and off at any time.

95% of the time I want changes persisted immediately, but it's nice to be able to turn it off when I don't.

prewett · 2023-09-18T22:37:46.000000Z

Depends a little on the type of file. A prose document, sure, probably want autosave by default. A vector graphics file? I want autosave when I'm creating it, but I do NOT want autosave when I'm copying out a piece buried several groups in and behind some things I need to delete/move out of the way. I also don't want to have to think about whether I need it or not.

But generally the way autosave works is to save a copy that can be recovered on a crash, and only overwrite the original if directed by the user. That works for both use cases. (Haven't used Word in years, so I'm not sure if they have a different behavior now.)

afiori · 2023-09-18T14:50:12.000000Z

For an application working with reasonably sized files sqlite files it would be reasonable to

1. on opening a file clone it to a temporary folder

2. edit the temporary file there on disk

3. on save mv/cp the temporary file over the destination

I am probably missing a lot of use cases, but it migth be a good idea for a game like Factorio where you are expected to have multiple on disk saves of the same run at different times.

fluidcruft · 2023-09-18T15:05:07.000000Z

In the sqlite case, I think it actually can save uncommited edits to a separate journal file until committed. At least, one of the systems I am familiar with that uses sqlite as a container format (MRI scanner) seems to do this, so I suspect sqlite supports that mode natively.

I'm just pushing back against the idea that its a good or helpful idea to "help the users" by taking the deliberate "save" action away from them.

As an aside, one of the things that has been learned from this class of MRI scanners is that users need to feel "in control" of the machines they're using. The "look how smart this machine is by doing all these magical things you used to do yourself!" attitude works well in sales but really does not go over well in the field because users encounter the fuckups and are held responsible for them. So they quickly start to distrust the machines.

didntcheck · 2023-09-18T14:50:51.000000Z

> Not to mention that writing a single byte of content to a filesystems marks the entire blob as needing backup.

If the size, mtime, and inode number stay the same (i.e. it writes into the file directly instead of replacing it), then most backup software will skip it. AFAIK to do otherwise you either need to read the whole file every time, or be live monitoring audit events to see what files have been opened for write, or be the filesystem (e.g. ZFS snapshots, which can be maximally efficient since it knows exactly which blocks it's modified)

Of course this has its own downsides. While those writes may have been "unimportant", the fact is that your backups are now flawed. And if the application has had the foresight to distinguish unimportant writes, and preserve the mtime, I'd rather they just not make those writes in the first place

fluidcruft · 2023-09-18T15:19:17.000000Z

I am under the impression that modifying a file's content updates the modification time. Is this incorrect? Modifying a file without updating the mtime or allowing mtimes to be edited in userspace sounds like a security nightmare.

hddqsb · 2023-09-21T07:39:30.000000Z

On Linux, the mtime ("modification time") is indeed automatically updated when the content is changed, but it can also be set arbitrarily from userspace (e.g. using `touch`) without special permissions. This is very useful, e.g. when you mirror a directory you'll typically want to preserve the mtimes, to help identify changes later.

Linux also has a ctime ("status change time"), which is automatically updated when the content or metadata (inode) are changed. It is not possible to change from userspace (you have to use tricks such as changing the computer clock or directly modifying the disk). This gives you the security benefits, but it is not commonly used e.g. in backup tools, precisely because they want to be able to set "fake" timestamps.

mixmastamyk · 2023-09-18T17:48:59.000000Z

Yes it does. mtimes can definitely be edited, if you have permission, but it is rare. I have a photo script that pulls the taken time from Exif and writes it to the file mtime.

hddqsb · 2023-09-21T07:45:50.000000Z

> mtimes can definitely be edited, if you have permission

It just requires you to be the owner of the file (https://serverfault.com/a/337810), no special permissions. On the other hand, editing the ctime is not easy -- it requires tricks (e.g. change the computer clock, which typically requires root privileges).

eternityforest · 2023-09-19T06:45:56.000000Z

There's also SSD wear issues. There shouldn't be, because SSDs are durable, but some applications find dumb reasons to write multiple GB in a minute.

And by some applications I pretty much just mean browsers, but still.

p4bl0 · 2023-09-18T14:01:32.000000Z

> Why do you use a secondary, volatile database ?

For the exact reason I gave in the comment you are replying to: I want to keep a usual workflow for users. Principle of least surprise.

Users are okay with change being autosaved when there is a single "thing" that can be edited to the point that you don't even have to open it, it's just there, it can be seen as a property (as in ownership) of the application more than of the user. For example, your music library in your jukebox application.

On the contrary, when the user have to open the "thing" with your application and can choose between many of their files that can be edited with your application, users do not expect their files to be automatically modified at all. For example users may start doing some heavy editing and then at the moment of saving their work, they might make a backup of the previous state file before saving, or choose to "save as…" in order to keep the old version just in case.

Crashes are not something that happen that often. It can become an actual problem when you have tens of thousands of users and rare-events do happen, but in the particular case of the application I working on, I do not actually have to worry about that (on the contrary, any solution would have downsides that are worst in the particular case of this application than having to do some work again because of a crash if it ever happens).

infogulch · 2023-09-18T21:25:03.000000Z

Another option is to explicitly start in read-only mode (modification buttons hidden / grayed out, some distinct mode indicator "Viewing Document" next to a button to "Start Editing", etc), and when the user chooses then switch into autosave mode. At this point many users are used to autosave and don't pay due attention to the document saved state. With Microsoft Windows' habit of rebooting your system overnight without your explicit permission, I'm concerned that this might lead to a lot of lost work.

rakoo · 2023-09-18T14:30:52.000000Z

You can keep a distinction between old and new version inside the same database, by having a pointer to the "current" version, and updating the pointer when clicking on "save". You could store all changes in the database in a "staging area", such that when you reopen the app you can load the changes and you don't need a recovery phase, but with the "save" button active meaning that something changed since last save.

p4bl0 · 2023-09-18T19:09:53.000000Z

I could, but as you said in your previous comment (emphase is mine):

> Less code, fewer bugs.

knome · 2023-09-18T12:16:35.000000Z

a save button is still good, as it allows you to keep specific checkpoints.

but the save button could simply tag specific save points in a larger table.

if the format can roll up changes to compress them, they also indicate where which variants need to be kept indefinitely.

theamk · 2023-09-18T14:02:30.000000Z

Auto-save is nice, but I think it works much better when it's a separate file.

This way "main" document file (which might be checked in to git, or shared via dropbox or read by some document) only contains nice, clean, saved version.

And yet if your computer crashes for whatever reason, the data is still not lost and can be trivially recovered.

rakoo · 2023-09-18T12:59:11.000000Z

In this view the save action is more like a commit, where the user manually checkpoints and also offers a simple description of why this is an important point. But in my view all intermediary points also need to be saved, because the user might have forgotten to explicitly checkpoint, and might still want some undo/redo capability that is more granular than just checkpoints.

asalahli · 2023-09-18T16:49:51.000000Z

> I like taking the approach of CouchDB where the only correct way to close the system is to crash it.

The term you're looking for is (aptly named) crash-only software.[0]

0. https://en.m.wikipedia.org/wiki/Crash-only_software

nextaccountic · 2023-09-18T09:52:37.000000Z

This means that like a regular app, you lose data if the app crashes or there is a power loss.

It's much better to save after each operation in a temporary place (probably in ~/.local/share/application/yourapp, using XDG directories), and when the user clicks save, just copy the file into the desired location. That way, if there is a power loss and you reopen the app, it opens right back where it was doing (losing maybe he last few seconds of changes, but not all unsaved data)

scherlock · 2023-09-18T11:13:18.000000Z

If you have a db, why not just model it as unsaved data? I.e. all changes get stored to the db, but have a flag of unsaved. If you open up a file and there are unsaved changes, you can prompt the user to either make them saved or discard them.

mort96 · 2023-09-18T11:36:37.000000Z

That feels like it requires the data model to be very different? The file format would essentially need to be a list of changes, with a "committed" flag.

Like, if someone changes some text in a paragraph, you can't just model that as "this paragraph now contains this new text". You have to model it as "this paragraph used to contain this text, but an uncommitted change changed it to this other text". User deletes an image? You have to still store the image and all the references to it, but with an uncommitted change to delete the image and remove the references to it.

And maybe that's a good thing, maybe a git-like system where the history of every change is tracked is what you want. But it certainly doesn't feel like it'd be appropriate for every application and file format.

julesnp · 2023-09-18T12:53:33.000000Z

You wouldn't necessarily need to track every change, you could just have 2 tables, one which contains the last "saved" version of the document, and one which contains the last modified version of the document. Upon opening after a crash, if there is a more recent modified version, the program will ask if you want to load that version.

mixmastamyk · 2023-09-18T20:50:55.000000Z

Databases handle all this natively with transactions and WALs. i.e. Don't need to build a Flintstones version yourself.

Also binary documents are a lousy fit with git, smashing square peg into round hole makes little sense.

Someone · 2023-09-18T10:47:39.000000Z

> and when the user clicks save, just copy the file into the desired location.

To be perfectly safe, you want to rename it, not copy it. If there’s a power loss during copying, you may endcup with corrupted data.

Renaming is, to coin a phrase, “more atomic” than copying (on Linux, the OS says it is atomic. ISO C says it, too, but POSIX doesn’t (https://pubs.opengroup.org/onlinepubs/000095399/functions/re...: “This rename() function is equivalent for regular files to that defined by the ISO C standard. Its inclusion here expands that definition to include actions on directories and specifies behavior when the new parameter names a file that already exists. That specification requires that the action of the function be atomic”)

Also, filesystems may have bugs, hardware may lie about syncing to disk, and network shares can be finicky.

Doing this properly isn’t as easy as one would think. You’ve to make sure to sync the file to be written and you’ll have to handle the case where the save location is on a different file system than your temporary file. If so, you’ll have to create a copy on that file system first.

I think many tools do not check whether they need to work cross filesystem and just write their scratch files to the save directory with a different name and then rename them.

Of course, that means you always need twice the disk space on the target disk to do a save. That used to be a problem almost everywhere, but nowadays mostly is restricted to embedded systems and USB sticks.

In this case, however, SQLite will do a lot for you, and probably better than you would do it. It claims (https://www.sqlite.org/atomiccommit.html#_multi_file_commit):

“SQLite allows a single database connection to talk to two or more database files simultaneously through the use of the ATTACH DATABASE command. When multiple database files are modified within a single transaction, all files are updated atomically. In other words, either all of the database files are updated or else none of them are. Achieving an atomic commit across multiple database files is more complex that doing so for a single file. This section describes how SQLite works that bit of magic.”

However, about VACUUM INTO, it says (https://www.sqlite.org/lang_vacuum.html):

“The VACUUM INTO command is transactional in the sense that the generated output database is a consistent snapshot of the original database. However, if the VACUUM INTO command is interrupted by an unplanned shutdown or power lose, then the generated output database might be incomplete and corrupt. Also, SQLite does not invoke fsync() or FlushFileBuffers() on the generated database to ensure that it has reached non-volatile storage before completing.”

So, I don’t think doing “VACUUM INTO” is sufficient to guarantee that you get a good copy of your data on disk.

cduzz · 2023-09-18T11:32:04.000000Z

Well, copying is simply not atomic in linux; directory entry operations (rename, link, unlink) are atomic. There is a definition somewhere that says how many bytes may be written atomically; that's it -- past that writes are not atomic.

The prior comment of "model user interactions in the database" seems spot on -- just keep track of what the user's doing as unsaved data in the database and commit it (in the appropriate way) to the DB as it happens; save is just another user action.

Presumably sqlite has figured out how to write to the filesystem without corrupting itself even in a variety of adverse scenarios?

If not, commit to a temporary copy of the DB that gets renamed to the "main" name periodically or when the app closes. There's an xzzzbit meme there somewhere, yo.

naniwaduni · 2023-09-18T14:10:56.000000Z

> The prior comment of "model user interactions in the database" seems spot on -- just keep track of what the user's doing as unsaved data in the database and commit it (in the appropriate way) to the DB as it happens; save is just another user action.

The trouble is that often in the wild, the content of the file on the filesystem is a user-facing interface. Users will copy it around and attach the whole document onto emails. When they do, they do not expect the file to contain data that they didn't want to save.

(Yes, they sometimes also expect the file to contain data that they wanted to but didn't explicitly save. This is not a contradiction.)

cduzz · 2023-09-18T14:36:06.000000Z

That's certainly true, and also the case in many modern file formats.

For example there's the story of the academic studies with falsified data [1] where forensics on the included excel documents showed they'd gone through and replaced data with "randomized" data (if I remember correctly); there are tons of examples of "redacted" data in pdf docs being visible under the blacked-out rectangles.

I'm not disagreeing with you, btw, such actions are certainly problematic, but hopefully kids will grow up knowing they need to run an export to sanitize their data if they don't want to show the whole world their transaction logs...

[1] https://www.npr.org/transcripts/1190568472

naniwaduni · 2023-09-18T14:43:04.000000Z

While "knowing they need to run an export to sanitize their data" is (unfortunately!) a thing user just Have to Know, a wider issue imo is that we also shouldn't really be encouraging developers to casually assume that users will only interact with their software through their own ordained interfaces.

jrockway · 2023-09-18T15:56:14.000000Z

> they do not expect the file to contain data that they didn't want to save.

I think that's fine. Just remove "save" from the UI, and save after every keystroke. This may sound crazy in 1970, but it's how nearly everything works today. It's really only us weirdos that started using computers before "the cloud" that think "save" is an operation that does something, and we're dwindling in numbers!

NavinF · 2023-09-18T12:52:07.000000Z

> Linux, the OS says it is atomic

nitpick: At least on some filesystems if you rename a.txt to overwrite b.txt and the machine crashes, you might end up both a.txt and b.txt hardlinked so they contain the same data.

Of course this is no big deal since b.txt is still updated atomically so it contains the new data (assuming a.txt was fsynced) or the old data. I assume nobody depends on a.txt being deleted simultaneously.

avereveard · 2023-09-18T11:09:07.000000Z

Good ole .filename.swp

p4bl0 · 2023-09-18T14:08:17.000000Z

Yes, I am aware of that, and you are right about this in general. In my particular case however, it is preferable to loose some work in the rare cases were a crash occurs than to have a copy of the file in some place that the users are not aware of. Of course if crashes were frequent the trade-off would be different.

SanderNL · 2023-09-18T11:25:00.000000Z

You are right and like you explained this is trivially easily fixed by autosaving regularly.

What I have trouble imagining is people working with documents on computers for more than a few years yet somehow failing to develop the Always Save Instinct. I regularly catch myself saving unreasonably often.

josephg · 2023-09-18T11:34:06.000000Z

Users shouldn’t ever need to adapt to computer crashes like this. Software should always auto save or have recovery files or something. As a principle, software should hold anything a user inputs with reverence.

SanderNL · 2023-09-18T11:47:59.000000Z

I agree. Maybe as a dev I've become cynical and don't trust anything. Least of all some app holding my document.

Makes me think of the "Voting software" xkcd: https://xkcd.com/2030

"I don't quite know how to put this, but our entire field is bad at what we do, and if you rely on us, everyone will die."

belenos46 · 2023-09-18T14:48:14.000000Z

You think that's cynical? I must be some avatar curmudgeonliness then.

I'm pretty sure that what's actually going on is that everything we think of as a 'profession' is that way, and that people are in fact dying because of it.

The difference is that devs are honest about it.

mixmastamyk · 2023-09-18T20:59:57.000000Z

Yes, however you have to be sure every single program across decades is written with those rules in mind. Hard on any OS, but a lost cause on */Linux.

chrisshroba · 2023-09-18T11:47:05.000000Z

I used to have that instinct but lost it in the age of auto save. The applications (web or native) I use most often all do it for me: Google docs, Dropbox paper, notion, vscode. I don’t think I’m alone in this!

dsego · 2023-09-18T11:58:43.000000Z

I actually tried using open/libre docs a few years ago just because of it being open source. I was trying to make a point of using locally installed software and avoid google products. Then the thing crashed and I lost an hour of work because it didn't save a temporary version. That's when I gave up on it for good.

olddustytrail · 2023-09-18T14:18:53.000000Z

Libre office does keep a temporary version that allows you to recover, so you're talking crap.

dsego · 2023-09-18T19:26:08.000000Z

It might very well keep it, but either that behavior wasn't turned on by default or it crashed in way where it wasn't recoverable. I know I lost work.

mixmastamyk · 2023-09-18T20:58:48.000000Z

Doesn't need to be turned on. However it's always possible you mistakenly pressed Esc/Close, didn't read the dialog, or hit a very obscure bug?

However this has worked well for twenty years, so PEBKAC is a reasonable conclusion.

dsego · 2023-09-18T11:56:46.000000Z

That may have been true years ago in win 95 or xp days. The modern paradigm starting with google docs is that things are automatically saved and even always sharable through the cloud, making manual saving actually an atavistic leftover of a bygone era.

iggldiggl · 2023-09-18T12:16:20.000000Z

What if I actually don't want any changes saved because I've only opened a document for reference purposes?

ghkbrew · 2023-09-18T12:33:51.000000Z

Then you proactively prevent changes. Either "open a copy" or "open in readonly mode".

If you make saving the default you have to manually not save. It's a trade off versus default no saves with manual saves

ReactiveJelly · 2023-09-18T12:24:15.000000Z

Have a "Read-only" checkbox. For the love of God, have a "read-only" checkbox.

NavinF · 2023-09-18T12:59:18.000000Z

Modern apps like gdocs have a toggle for switching between read-only, suggest changes, and editing. If you forgot to toggle, you can just open version history and revert. MS Office also had version history for nearly a decade.

liuliu · 2023-09-18T13:53:34.000000Z

Maybe simpler? When open the DB, change it to WAL mode, turn off the automatic checkpoint https://www.sqlite.org/pragma.html#pragma_wal_autocheckpoint

When user saves, you just checkpointing the file, merging it back into the main database.

nyanpasu64 · 2023-09-18T09:50:02.000000Z

> The VACUUM command works by copying the contents of the database into a temporary database file and then overwriting the original with the contents of the temporary file. When overwriting the original, a rollback journal or write-ahead log WAL file is used just as it would be for any other database transaction. This means that when VACUUMing a database, as much as twice the size of the original database file is required in free disk space.

> The VACUUM INTO command works the same way except that it uses the file named on the INTO clause in place of the temporary database and omits the step of copying the vacuumed database back over top of the original database.

Do you use VACUUM (uses a write-ahead log to survive power-off) or VACUUM INTO (as far as I can tell, it doesn't survive power-off during writing, and might corrupt the existing file contents if the filename already exists)?

justsomehnguy · 2023-09-18T10:05:23.000000Z

>> The file named by the INTO clause must not previously exist, or else it must be an empty file, or the VACUUM INTO command will fail with an error.

EDIT: there is no difference between VACUUM/VACUUM INTO - they both write to a new file (COW) it's just VACUUM [NOT INTO] does mv temp.sqlite originalfile.sqlite after that, while VACUUM INTO does not.

ilyt · 2023-09-18T13:07:17.000000Z

We used something similar (DB doing caching run in memory but saved periodically on disk) but with backup API

https://www.sqlite.org/backup.html

remram · 2023-09-18T11:28:36.000000Z

Why not use a transaction?

p4bl0 · 2023-09-18T19:12:05.000000Z

A single transaction for the whole user session? That seems a bad idea. Also I'm not sure you can do transactions during another transaction, and I need them for other purpose, i.e., for what they were designed to do (doing changes in multiple tables that need to stay consistent).

subleq · 2023-09-18T21:09:39.000000Z

It’s exactly what transactions are for. A nested transaction is called a savepoint, which sqlite does support.

rewmie · 2023-09-18T10:48:00.000000Z

> I can directly make the change in the database I don't need to have a model of the document other than its database format.

I don't get your point. Are you saying that you don't need to have a model of the document other than the model of the document? What's the nuance I'm missing?

torstenvl · 2023-09-18T12:23:16.000000Z

An in-memory data model often differs from the serialized data as it exists on disk. For example, emacs uses a gap buffer for text files; but it outputs plain linear text to disk.

Programmers often have to make software design decisions around how to represent a file in memory in order to manipulate it. For example, if I'm writing an HTML editor, should I mostly treat it like a text file (maybe a gap buffer) with syntax highlighting and auto indentation as an afterthought? Or should I maybe load the whole thing into a tree? What are the robustness and performance characteristics of each?

The commenter above was saying that using SQLite made that decision easy. He could keep traditional (or "atavistic" per the commenter upthread, depending on your perspective) load/save semantics while also making the data model easy to work with.

rewmie · 2023-09-20T10:17:13.000000Z

> An in-memory data model often differs from the serialized data as it exists on disk. For example, emacs uses a gap buffer for text files; but it outputs plain linear text to disk.

The whole point of my remark is that the domain model and the export document format are two entirely separate things.

socksy · 2023-09-18T10:55:33.000000Z

I suppose this is in the context where you will be syncing up the changes to a backend server which will also be storing the document in an SQL database. Normally, you might expect that data format on the client to be JSON/XML/something else, and you'd need to maintain logic that marshalls the document representation

    SQL <-> In-memory representation <-> Disk format.

With SQL on the client, in theory you only now need to maintain

    SQL <-> In-memory representation

Obviously I'm skirting over the format you would use to send either entire documents or partial updates of documents over the wire.

p4bl0 · 2023-09-18T14:28:24.000000Z

When an application loads a document, for example if the document is formally a list of things (imagine a very simple TODO app), the usual approach is to have this data represented (modeled) as an actual list in your program, like a Python list of objects, because it's what is easy to manipulate programmatically.

Then, saving your document means serializing the data in some format (which could be JSON, XML, CSV, an SQLite database, …) and writing that to disk, and opening a document means reading the file from disk and unserializing it to your internal model.

What I'm saying is that my approach is to use an in-memory SQLite database as the internal model of the data in the applications. I presented an upside (opening and saving are easy), but is also has downsides: I have to do SQL queries to manipulate the data rather than manipulating objects directly (which could be mitigated using an ORM but that's outside my point). In Python-like pseudo-code you can imagining something like:

    self.todos[42].status = 'DONE'

vs

    self._db.query("UPDATE todos SET status='DONE' WHERE id=42")

(Of course there is the possibility of using ORMs or other approach in between the two.)

hot_gril · 2023-09-19T17:09:11.000000Z

Btw, Apple's CoreData, commonly used by iPhone and Mac apps, uses SQLite by default. That part works fine, so you can study it if you'd like and ignore all the bad parts built on top (ORM, MVC framework, etc).

miki123211 · 2023-09-18T13:29:01.000000Z

The problem with SQLite is that it's not a standardized file format. It's well-documented and pretty well understood for sure, but there's no ISO standard defining how to interpret an SQLite file in excruciating detail. Same goes for competing implementations, Zip and XML have a much smaller API surface than SQLite, whose API, apart from a bunch of C functions, is the SQL language itself. Writing an XML parser is not a trivial task, but it's still simpler than writing an SQL parser, query optimizer, compiler, bytecode VM, full-text search engine, and whatever else Sqlite offers, without any data corruption in the process. If Open Office used SQLite, its programmers would inevitably start using its more esoteric features and writing queries that a less-capable engine wouldn't be able to optimize too well.

This isn't a concern for most software. If you're writing a domain-specific, closed-source application where interoperability with other apps or ISO standardization isn't a concern, SQLite is a perfectly fine file format, but as far as I understand the situation, those concerns did exist for Open Office.

bane · 2023-09-18T14:11:28.000000Z

I'm not sure if the problem you are pointing out has to do with:

a) SQLite the file format - which is Public Domain and so well documented that parsers for it exist in numerous other languages even though it's almost pointless because...

b) SQLite, the Public Domain (and thus entirely source available) C implementation of the library that can operate on the file format -- and is documented to a level well above what most ISO standards shoot for. It's designed to be used in other software and has bindings for pretty much every major language.

c) Some notional OpenDocument stored in a SQLite file that's really just waiting for somebody to make and document.

ISO standards are great, but if we had to wait for ISO to define a file format we'd have pitifully little to work with.

btilly · 2023-09-18T18:14:54.000000Z

It is possible that the C implementation of SQLite is the single most commonly deployed software library ever. If not, then it is probably the second, after zlib.

https://www.sqlite.org/mostdeployed.html

Therefore I consider it a better supported format than most standardized formats.

lucideer · 2023-09-18T22:51:51.000000Z

That page makes the argument for zlib & sqlite, but Daniel Stenberg makes some good points here[0].

My guess would be zlib is still number 1 though, even accounting for Daniel's considerations.

[0] https://daniel.haxx.se/blog/2021/10/21/the-most-used-softwar...

pl4nty · 2023-09-18T23:16:37.000000Z

for one, it's bundled with consumer versions of Windows as winsqlite3.dll. not sure when this started though

sdeframond · 2023-09-18T19:18:21.000000Z

I think this has been discussed before about WebSQL.

> The [WebSQL] specification reached an impasse: all interested implementors have used the same SQL backend (Sqlite), but we need multiple independent implementations to proceed along a standardisation path.

https://www.w3.org/TR/webdatabase/

sdeframond · 2023-09-18T19:53:26.000000Z

The Chrome blog post about deprecating sqlite-based WebSQL makes an interesing point. I believe it applies to OpenDocument as well.

> The Web SQL specification cannot be implemented sustainably, which limits innovation and new functionality. The last version of the standard literally states "User agents must implement the SQL dialect supported by Sqlite 3.6.19". SQLite was not initially designed to run malicious SQL statements, yet implementing Web SQL means browsers have to do exactly this. The need to keep up with security and stability fixes dictates updating SQLite in Chromium. This comes in direct conflict with Web SQL's requirement of behaving exactly as SQLite 3.6.19.

https://developer.chrome.com/blog/deprecating-web-sql/

bane · 2023-09-18T23:28:04.000000Z

In other words, "all the implementors chose a standard, but we're the standard deciders so we're killing the whole idea".

dragonwriter · 2023-09-18T23:40:11.000000Z

One problem was the standard was bug-for-bug replication of a particular version of SQLite.

There’s very good reason for that not to be a standard. (Now, assuming the SQLite documentation is licensed in a way which supports this, copying the documentation of SQLite’s supported SQL as of that version into the standard might have been viable, but no one interested in having WebSQL proposed that or any other resolution.

That relates to the cited issue of absence of independent implementations, which would have been a problem even with a spec that supported independent implementations and verification of their compliance independent of a particular reference implementation. though I personally think the spec problem is a bigger real problem (even if not the decisive policy problem) than the “everyone is using the same underlying software to implement the spec” problem is in this case, where the shared implementation is a permissively licensed open source implementation sponsored by several of the browser vendors, among others.

bane · 2023-09-19T00:27:16.000000Z

hmmm...I appreciate the thoughtful reply. You bring up an interesting point. What is the SQLite documentation licensed as? I would assume PD like the rest of it, but I don't know that for sure.

anticensor · 2023-09-19T19:46:07.000000Z

SQLite itself is in the public domain.

sdeframond · 2023-09-19T06:23:43.000000Z

The standard deciders are the implementors. There is no point in opposing them. The W3C is actually the representatives of Google, Mozilla, Microsoft, Opera and so on.

zie · 2023-09-18T21:38:25.000000Z

100% agree and the Library of Congress loves it: https://www.loc.gov/preservation/digital/formats/fdd/fdd0004... and https://sqlite.org/locrsf.html

joshspankit · 2023-09-18T14:24:08.000000Z

Sounds like a solution is to use the C implementation to define the standard and have it canonized in to an ISO.

cornstalks · 2023-09-18T16:25:00.000000Z

That's what Opus did. The RFC[1] has a base-64 encoded libopus.tar.gz appendix (Appendix A), which is the "primary normative part of this [Opus] specification." If the prose and source code disagree, the source code takes priority and "wins" when it comes to which is normative.

I have a love-hate relationship with this approach.

[1]: https://datatracker.ietf.org/doc/html/rfc6716

dfox · 2023-09-18T16:31:09.000000Z

That is common for codec standards, the normative part of many MPEG specifications is the parser/decoder in C-like pseudo-code. What is somewhat unique for Xiph is that their normative reference decoders are actually usable.

Eduard · 2023-09-18T19:15:17.000000Z

funny, the RFC even includes a shell command pipeline to extract the base64 out of the awkward RFC formatting.

Using the C source code still leaves room for ambiguities / under-specification, no? After all, the semantics rely on the particular gcc release used for compiling the code.

cornstalks · 2023-09-18T20:09:50.000000Z

There is still the possibility of a bug or under-specification, but that's always the case in any spec. At least with Opus they document what implementation-defined behavior they require, so assuming there aren't any hidden bugs then you should get consistent output across compilers.

Eduard · 2023-09-19T19:20:46.000000Z

but the semantics change depending on the build tool version and other factors.

belenos46 · 2023-09-18T14:27:12.000000Z

Yeah, a solution in search of a problem.

gwd · 2023-09-18T13:49:34.000000Z

> Writing an XML parser is not a trivial task, but it's still simpler than writing an SQL parser, query optimizer, compiler, bytecode VM, full-text search engine, and whatever else Sqlite offers, without any data corruption in the process.

Just to clarify: You don't actually need to implement all that for it to be a standardized file format, any more than you need to implement all the spreadsheet functionality to be able to read a LibreOffice spreadsheet. All you need to do is to be able to reconstruct the tables. There's no reason, having reconstructed the tables, you couldn't write your own imperative code in the language of your choice to go over them and get whatever information you wanted.

justin66 · 2023-09-18T13:59:55.000000Z

> This isn't a concern for most software.

It's not even a concern for the US Library of Congress, which defined SQLite as a recommended storage format for datasets alongside CSV, XML, and JSON.

nelgaard · 2023-09-18T16:35:18.000000Z

But those are completely different uses of a storage format.

Library of congress considers if someone a 100 years from now could write a new importer in whatever langauge/AI they might use by then.

Office documents are something you send in email attachments to people you often barely know, and expect them to read it in whatever office system they have. And if the recipient uses e.g., Microsoft Word, OFD/Sqlite might not work.

justin66 · 2023-09-18T16:54:05.000000Z

It is true that it requires effort for the developers of a software program to support a given file format. Beyond that I'm not sure what your point is.

galangalalgol · 2023-09-18T19:25:20.000000Z

Not the op, but one point would be, why did we even pick xml, when we had latex and html? Why is a relational database the right tool for a document format?

cxr · 2023-09-21T13:03:55.000000Z

They're constrained by different requirements. The comment was clear enough:

"those are completely different uses"

It's not a hard concept to grasp. There is no riddle to decipher.

toast0 · 2023-09-18T18:44:12.000000Z

> Office documents are something you send in email attachments to people you often barely know, and expect them to read it in whatever office system they have.

Eh, if they're not running the same office system, down to patches, you can't really expect much.

sethev · 2023-09-18T16:07:02.000000Z

You seem to be mixing up the file format with how it's used. An application that uses SQLite's file format would use SQLite's library as part of the application. Yes, it would be quite a lot of work to replicate that library but in the same way that replicating the code that uses OpenDocument's file format would be.

The file format itself is pretty straightforward.

coliveira · 2023-09-18T13:45:12.000000Z

But you don't need a standard, because all interaction between applications and the document is made through SQL. And SQL is standardized (at least the parts that matter). If you have concerns about compatibility, make sure that the document can also be accessed through other databases (like mysql).

orra · 2023-09-18T13:48:02.000000Z

But other databases cannot access sqlite databases, because the file format is internal...

nojvek · 2023-09-18T14:08:06.000000Z

SQL file format is very well documented. In some universities it is an assignment to directly read and write sqllite files from disk and understand the paged and blocks structure.

You don’t need sql for any of it.

https://www.sqlite.org/fileformat.html

dunham · 2023-09-18T18:10:23.000000Z

It's interesting that this is a classroom assignment, like the sibling comment, I'd curious which university / class this was. I did the read part (+ query planning) on my own as an exercise, but I haven't gotten around to implementing writing yet.

You do need to parse DDL to get the column names, they're stored as a "CREATE TABLE" string. But you don't have to if you want to dump the file without names.

https://github.com/dunhamsteve/sqljs

avinassh · 2023-09-18T17:22:56.000000Z

> In some universities it is an assignment to directly read and write sqllite files from disk and understand the paged and blocks structure.

do you have any links?

bane · 2023-09-18T13:58:30.000000Z

https://github.com/pgspider/sqlite_fdw

orra · 2023-09-18T18:04:09.000000Z

I'll admit, that's a fantastic third party effort. But there definitely isn't the same level of first party support as there is for zip files.

hot_gril · 2023-09-18T21:20:58.000000Z

They can if they want to, using the standard SQLite lib or their own implementation.

marcinzm · 2023-09-18T17:57:36.000000Z

>at least the parts that matter

In my experience every part matters in non-trivial use cases since someone somewhere will use that part.

patapong · 2023-09-18T18:45:26.000000Z

This sounds exactly like the argument that killed WebSQL in 2010: https://en.wikipedia.org/wiki/Web_SQL_Database

I am still salty about this, as WebSQL would have made it much easier to build a certain class of web apps.

bb010g · 2023-09-18T22:25:35.000000Z

You can still use <https://github.com/jlongster/absurd-sql>. <https://jlongster.com/future-sql-web>

eternityforest · 2023-09-19T06:41:21.000000Z

It almost seems worth giving up ISO for SQLite, but I understand there are real concerns when you get into enterprisy stuff.

SQLite is kind of its own standard. It's public domain and they don't do breaking changes all day, and it's in C. As long as C is still viable, SQLite is usable on basically all non embedded platforms, and nobody really needs to reimplement it, unless they want to port it to Rust or something.

Not that you'd need to, since it's already very reliable.

paulddraper · 2023-09-18T15:05:38.000000Z

> less capable engine

There wouldn't be another engine.

It would be SQLite. Period.

cryptonector · 2023-09-18T15:53:01.000000Z

This could have been used as an opportunity to standardize the SQLite3 DB file format.

hot_gril · 2023-09-18T21:14:28.000000Z

I've never seen this as a problem, since plenty of random things are distributed as sqlite files. All the remaining questions for ODF would be about the schema design.

jimbokun · 2023-09-18T18:15:23.000000Z

Just define the schema and the semantics of each column for each table.

nyanpasu64 · 2023-09-18T08:55:29.000000Z

I was optimistic that Audacity adopting SQLite would be a substantial improvement in its file saving capabilities. In practice I encountered many gotchas:

- On Linux, saving into a new file onto a root-owned but world-writable NTFS mount created in /etc/fstab, fails due to permission errors or something. Saving into an existing file works as usual.

- Files are modified on disk when you edit the project in the program, creating spurious Git diffs if you check Audacity projects into Git as binary blobs. And when you save the file, old and deleted data is left in the SQLite file until you close the project's window (unlike saving a file in a text editor), and you can accidentally commit that into a Git repo if you don't close the window before committing. (I recall at one point that you had to manually vacuum the .aup3 file, but now closing the window is sufficient.) I'm getting Word 2003 Fast Save vibes.

chrismorgan · 2023-09-18T10:01:29.000000Z

It’s also a bit of a bother if Audacity crashes (or is otherwise terminated abnormally), as the cleanup just doesn’t happen at all then, whereas in the past the recovery process would mention the presence of orphaned blocks and allow you to choose to keep or delete them. But when I had a several-gigabyte project that should have only been a couple of hundred megabytes, and needed to save disk space, I finally found a solution suitable for my simple single-track stuff: Mix and Render. Doesn’t change the audio, but allowed it to clean up the detritus on save and exit. But all up, this is clearly an application-level problem, not something inherent to SQLite.

Hmm… I think I vaguely recall that Audacity 2 had the concept of a temporary working space, whereas it seems that Audacity 3 just uses the .aup3 file as its working space? Some advantages, some disadvantages.

Mildly less on-topic: I looked into Audacity 3’s format, and was utterly baffled by what they’ve done with the project data (what used to be the .aup file). They still encode it as XML, storing it in a single-row table, but instead of just writing it as text, they use a simplistic dictionary coder on it. Just… why? Why did someone go to all the trouble of writing that code? It makes interoperability and inspection much harder, surely harms performance (even if by a trivial amount), and the space saving will be rounding error in every plausible case (like, maybe as much as a few kilobytes out of hundreds of megabytes of audio files).

rini17 · 2023-09-18T09:22:18.000000Z

Yes it should replicate the functionality user expects - save everything into temporary file and overwrite the original file only on explicit save action.

As for Git, it would benefit from using text format specifically aimed for easy diffing/merging. No idea how easy the sqlite dump is in this regard.

gwd · 2023-09-18T10:02:20.000000Z

> As for Git, it would benefit from using text format specifically aimed for easy diffing/merging. No idea how easy the sqlite dump is in this regard.

The problem I'd predict here is that then people would expect to be able to do three-way merges. It might even work correctly a lot of the time, depending on the exact pattern of changes. But my gut feel is that unless the schema were designed just right, there would be possible merges that would result in a database that was valid from SQLite's point of view but insane from the application's point of view (broke expected variants, etc).

bawolff · 2023-09-18T13:06:17.000000Z

If you want to use flat files you should just use flat files. There are plenty of unix tools to treat them like DB.

You're not going to have a sensible text version of a btree that is reasonably editable by a text editor.

massysett · 2023-09-18T11:22:41.000000Z

I set up my Git to use the SQLite dump on SQLite files when using “git diff”. This at least shows me the changes row-by-row, or shows nothing if no changes.

I don’t expect to be able to merge though.

mcpackieh · 2023-09-18T18:46:10.000000Z

I have been told that the new generation of users does not expect, want or appreciate applications that use explicit saves.

I've also been told that they don't understand or even want to understand folders...

rini17 · 2023-09-19T06:39:43.000000Z

The context here is someone using git. Who presumably understands folders.

bawolff · 2023-09-18T10:01:51.000000Z

> Yes it should replicate the functionality user expects

Do users really expect this now a days? Most users use cloud apps, and almost all of those save after every operation automatically.

raxxorraxor · 2023-09-18T12:18:34.000000Z

Which is a compromise for using browsers really. It isn't a good solution and no user really understands this and I believe it is the most hated feature of the new cloud world. Yes, leaving the page open for multiple hours might now allow you to save because your access token expired. No, communication in the background is unreliable too. Autosave is a bad band aid for a bad solution.

Doing periodic and automatic saves is good. Doing so on a document "in production" is majorly stupid. Not that I want to accidentally validate the busy work dev ops puts us through.

bawolff · 2023-09-18T13:02:56.000000Z

Its pretty easy to make a cloud app that emulates the traditional working draft/save workflow. Browsers all have pretty reliable local storage technology now a days if your network is unstable. I don't think this design choice is a compromise of the medium. If anything it seems like if you were going to have to compromise for web you would do it in the other direction so apps are more usable during poor network conditions.

I would say the traditional model is a compromise from back when disks were unacceptably slow to be saving constantly.

anonzzzies · 2023-09-18T09:30:08.000000Z

My wife uses Audacity all day and every few days there is a corrupt sqlite file (duplicate key) which cannot be (as far as we know) repaired/reimported etc from Audacity. I can fix it manually if it's important, but usually just throw the file away and things work again.

justinclift · 2023-09-18T10:24:20.000000Z

Duplicate keys in a SQLite file sounds like an audacity bug. :(

sgarland · 2023-09-18T11:55:27.000000Z

Maybe. The SQLite list of gotchas [0] is quite something. NULLs in the PK? Sure. FKs don’t actually do anything unless you pass a PRAGMA? Why not? Etc. I could easily see someone not fully grasping just how much SQLite lets pass by default, and thus not having a test catch it.

[0]: https://www.sqlite.org/quirks.html

anonzzzies · 2023-09-18T10:34:07.000000Z

Yeah, it definitely is. And it's fixable manually. Kind of the advantage to an open file format with nice tooling.

regularfry · 2023-09-18T10:34:27.000000Z

It also sounds like something that could be manually prevented ahead of time. If you can crack open the file on first save and add the right uniqueness constraint, that should make Audacity crash when it tries to corrupt the data.

justinclift · 2023-09-18T14:37:49.000000Z

> ... that should make Audacity crash when it tries to corrupt the data.

That'd be fairly non-optimal behaviour. ;)

When the application tries to add wrong data (eg duplicate key violating uniqueness constraint), SQLite will return an error.

The application should handle things better than by crashing. In theory anyway. :)

regularfry · 2023-09-18T20:41:44.000000Z

If your choices are a crash or corruption, choose the crash.

justinclift · 2023-09-19T00:49:30.000000Z

Sure, data integrity is important.

But hopefully the choices are a bit better than just those two. :)

Freak_NL · 2023-09-18T10:41:54.000000Z

Good article. Although one thing I do like about OpenDocument being just a bunch of XML files in a ZIP archive is that it is fairly easy to generate documents like spreadsheets without using a (potentially hefty) library which knows about the document format.

I have a use case where users of a web service want to use data exported as a bunch of rows in a table in a variety of tools. Now, CSV with UTF-8 encoding is of course, totally open, conventional, and workable, but anyone who has ever offered CSV files to end users will know the pain of these users getting stuck when they want to use these files in a spreadsheet application¹. So I saved a sample spreadsheet in OpenDocument's ODS and another in that Microsoft XML abomination called OOXML as XLSX, and just figured out the basics of those XML formats. I trimmed the ZIP archives down to the essentials, marked the places where content goes, and just build a new spreadsheet file whenever data is requested in that format. Now I can output CSV, ODS, and XLSX (and JSON thrown in for good measure) of the same data.

Doing this with SQLite would be possible of course, just a tad more complex and with a lower development speed. Being able to fire up the office suite, create a template document, and just dig into its XML files in the saved file is a nice feature (although admittedly of niche interest).

1: More specifically, users who use Excel in a locale like nl_NL, where CSV files are, hardcoded, assumed to have their columns separated by semicolons, because Microsoft once notoriously decided that the Dutch did not use comma's in a comma separated values file.

dfox · 2023-09-18T16:46:53.000000Z

As for [1], it is not really hardcoded, but depends on what is the value of localeconv()->decimal_point, if it is “,”, excel uses semicolons both in CSV files and formula expression language.

This used to be configurable when opening CSV/TXT file in excel (and still is in LibreOffice) but as a part of the overall UI dumbification was moved somewhere under the “Data” menu/ribbon tab (so you have to open new workbook and find the right option, or well, use LibreOffice if you value your time).

Freak_NL · 2023-09-18T19:45:21.000000Z

> decimal_point

Are you sure that affects it? The decimal point parameter sounds like it decides how to write out 5½ (i.e., 5.5 (English style) or 5,5 (Dutch style)) surely? Although on the topic of this particular bête noire I would not be surprised.

stareatgoats · 2023-09-18T12:42:52.000000Z

As an aside, this blew me away. I can hardly believe it. No nested query required?

> SELECT manifest, versionId, max(checkinTime) FROM version;

> "Aside: Yes, that second query above that uses "max(checkinTime)" really does work and really does return a well-defined answer in SQLite. Such a query either returns an undefined answer or generates an error in many other SQL database engines, but in SQLite it does what you would expect: it returns the manifest and versionId of the entry that has the maximum checkinTime.)"

mrighele · 2023-09-18T13:08:46.000000Z

> Such a query either returns an undefined answer or generates an error in many other SQL database engines, but in SQLite it does what you would expect:

It may be a useful functionality, but it is NOT what I would expect such a query to return, to be frank.

Also you don't need a nested query in this specific, you can order by checkinTime and limit the result to one.

> select manifest, versionId, checkinTime from version order by checkinTime desc limit 1

or something like that. This should work in SQlite and Postgresql at the minimum. I think to remember that in Oracle you have to use "where rownum=1" so indeed you have to use a nested query. I don't know about other databases.

tannhaeuser · 2023-09-18T15:43:26.000000Z

I agree, that doesn't make sense to me either. What about select versionId, max(checkinTime), min(checkinTime)? Can as well query SqlGPT. And above all, it's not what the SQL standard says when that's the entire point of using a standard in the first place.

OskarS · 2023-09-18T17:14:43.000000Z

Well, it doesn't error out! In this example, it seems like it picks the result from whatever matches the last column, but not sure if this is determinstic:

    sqlite> create table x(c1, c2);
    sqlite> insert into x values ("a", 1);
    sqlite> insert into x values ("b", 2);
    sqlite> insert into x values ("c", 3);
    sqlite> select c1, max(c2) from x;
    c|3
    sqlite> select c1, max(c2), min(c2) from x;
    a|3|1
    sqlite> select c1, min(c2), max(c2) from x;
    c|1|3

(note: since SQLite is dynamically typed, no need to specify column types for simple examples like this).

globular-toast · 2023-09-18T13:46:21.000000Z

The interesting thing is if you want more than one record, like you want the latest version number for each document ID. In SQLite you could do: `SELECT documentId, versionId, max(checkInTime) FROM version GROUP BY documentId`. In Postgres you can do `SELECT DISTINCT ON (documentId) documentId, versionId, checkInTime FROM version ORDER BY versionId, checkInTime DESC`.

See: https://www.sqlite.org/lang_select.html#bare_columns_in_an_a...

paulddraper · 2023-09-18T15:07:55.000000Z

MySQL allows the query, but the non aggregate fields are selected randomly

wongarsu · 2023-09-18T15:27:02.000000Z

Following MySQL's longstanding tradition of just doing whatever instead of showing an error message, no matter how unreasonable the result.

paulddraper · 2023-09-18T15:47:29.000000Z

MySQL + PHP, name a more iconic match

sgarland · 2023-09-18T17:34:58.000000Z

It only allows that if you’ve set it to do so. The default SQL_MODE variable includes ONLY_FULL_GROUP_BY.

However, in their brilliance, AWS RDS defaults to only NO_ENGINE_SUBSTITUTION for SQL_MODE, thus merrily allowing partial aggregates with non-deterministic results. Wheee!

https://github.com/awsdocs/amazon-rds-user-guide/issues/160

paulddraper · 2023-09-18T18:24:50.000000Z

Prior to 5.7, MySQL always accepted non-aggregated fields.

Version 5.7 introduced ONLY_FULL_GROUP_BY, but since that change broke lots of code that depended on this historical behavior, many people disabled it.

asddubs · 2023-09-18T18:51:15.000000Z

randomly, but after filtering by the criteria in the WHERE part of the query. This can actually be useful sometimes if all non-aggregate fields contain the same value (though I wouldn't actually rely on it, since whether this is allowed depends on how the database is configured, and it makes it easy to introduce errors by changing the query)

deathanatos · 2023-09-18T22:43:05.000000Z

> or something like that

That query isn't guaranteed to produce a well-defined result in most SQL engines. (For pretty much the same reason the original doesn't/can't/shouldn't…) In the simple case of two rows with the same `checkinTime`, many engines permit the results to be ordered arbitrarily.

cryptonector · 2023-09-18T15:51:56.000000Z

That's basically short-hand for

  SELECT manifest, versionId, max(checkinTime)
  FROM version
  GROUP BY manifest, versionId
  ORDER BY 3 DESC LIMIT 1;

or

  WITH m AS (SELECT max(checkinTime) AS checkinTime FROM version)
  SELECT v.manifest, v.versionId, v.checkinTime
  FROM version v
  JOIN m m USING (checkinTime)
  LIMIT 1;

It's a bit of a footgun though because there is some randomness here if multiple rows have the same max checkinTime, so I try not to use this SQLite3-ism. You want to also do something to deterministically pick a "best" row, but for that you need to do something like the above.

mwexler · 2023-09-18T13:02:43.000000Z

It's not really what one would expect in SQL, but SQLite often defies expectation. In this case, handy, but non-standard.

globular-toast · 2023-09-18T13:36:03.000000Z

I think this is a convenient side-effect of the implementation which was later turned into official behaviour. A bit like Python dictionary key ordering.

In Postgres you can do similar things with a DISTINCT ON query.

I always found this one of the hardest simple things to do in SQL.

ealexhudson · 2023-09-18T13:52:17.000000Z

I think if you have a set that you want the latest value from, in all engines you can do something explicit like:

> SELECT manifest, versionId, checkinTime FROM version ORDER BY checkinTime DESC LIMIT 1

The problem with putting aggregation functions in the select output is that you're being unclear about what is aggregating; that pattern begins to break down once you have e.g. multiple documents in the same schema. Or if versionId somehow wasn't linear (e.g. branches of changes).

deathanatos · 2023-09-18T22:39:10.000000Z

That's um… quite the aside. How can it possibly claim that to be well-defined, given that `manifest` and `versionId` are not functionally dependent¹ on `max(checkinTime)`?

¹e.g., there could be two rows with the same checkinTime, whose value happens to then be the max such.

kevincox · 2023-09-18T18:26:40.000000Z

The docs don't make it clear that this works as stated. The first docs I found don't say that they come from the matching row:

From https://www.sqlite.org/lang_select.html#generation_of_the_se...

> Each non-aggregate expression in the result-set is evaluated once for an arbitrarily selected row of the dataset. The same arbitrarily selected row is used for each non-aggregate expression.

Does `max` somehow only affect the selected rows? Or is this relying on a side affect of the query planner sorting the table to optimize max?

However then I found https://www.sqlite.org/lang_select.html#bare_columns_in_an_a...

> If there is exactly one min() or max() aggregate in the query, then all bare columns in the result set take values from an input row which also contains the minimum or maximum.

In all of the nearby examples contain an explicit "GROUP BY" clause but I don't think that this section says that one is required for this behaviour. So I guess this is the behaviour that is being described.

However I found this rule as well:

> If the same minimum or maximum value occurs on two or more rows, then bare values might be selected from any of those rows. [...] The choice might be different for different bare columns within the same query.

Which is in conflict with the earlier rule which says that the row is consistent. Or is this consistent row rule only provided for the implicit grouping. AKA is a and b guaranteed to be from the same row for the first query but not the second? That would be very surprising, maybe the docs just promise too little?

    SELECT a, b, MAX(c)
    FROM t

    SELECT 1 as group, a, b, MAX(c)
    FROM t
    GROUP BY 1

There are also more not-well sepcified results if multiple of MIN or MAX are used or if these functions or customized. So overall it is probably best to avoid this in "production" use. But can be convenient for some quick exportation if you are careful.

gwbas1c · 2023-09-18T13:21:50.000000Z

I shipped a product that used both SQLite and XML files.

One of the improvements that I made was moving a few tables that contained small amounts of data to xml files. Because these files were small and rarely written; it simplified the data access layer, and simplified diagnostics. (I made sure the files were multi-line tabbed xml.)

For "technical" people who needed to diagnose the product, asking them to crack open a SQLite database was a huge ask; but for the major part of the product that used SQLite, it was hands-down better than XML files. (An older version of the product used XML files. It had scalability problems because there's no good way to make an incremental update to an XML file.)

The advantages of XML, specifically, a human-readable format; really only work for small files when the design of the schema is optimized for readable XML. Unfortunately, the need to always rewrite the entire XML file, and the "complexities" that come with lots and lots of features will quickly erode XML's biggest advantages.

IMO: A "lay" person needing to muck around with the internals of an office document is fringe enough that learning to use a SQLite reader is an acceptable speed bump. The limitations of XML + Zip, when it comes to random writes in the middle of a file, just can't be overcome by Moore's law.

Tempest1981 · 2023-09-18T17:09:17.000000Z

I'm unclear on how SQLite (native format, no zip) is achieving sizes similar to XML + Zip. Are SQLite TEXT or BLOB fields compressed? Or are they assuming the caller is compressing BLOBs before writing?

gwbas1c · 2023-09-19T03:46:45.000000Z

SQLite does not compress, as far as I know.

Engineering is all about tradeoffs: SQLite is optimized for quick incremental updates where you don't need to rewrite the whole file. Zip & xml aren't. (IE, if you decide to add a letter to a word at the beginning of a document, with zip & XML you have to rewrite the whole document. SQLite can make a minor change without the whole rewrite.)

In our case, file size was not a factor in choosing between SQLite and XML.

But, remember that file size is deceptive: Disks are block devices; the 30 byte and 1k file take up the same space if you block size is 2k. (I've shipped a filesystem driver.) HTTP servers gzip on download. It's more important to know your needs than to get hung up on a single metric like file size.

> I'm unclear on how SQLite (native format, no zip) is achieving sizes similar to XML + Zip. Are SQLite TEXT or BLOB fields compressed? Or are they assuming the caller is compressing BLOBs before writing?

Remember, XML writes each tag name 1 time if there's no content and twice if there is. Each attribute has it's name written every time. I doubt SQLite writes all the metadata in each row.

hddqsb · 2023-09-21T08:22:29.000000Z

The article assumes the caller compresses the blobs.

ealexhudson · 2023-09-18T09:05:52.000000Z

ODT was designed to be standardised: while the predecessor format was very similar too, it relies very heavily on XHTML, SVG, and CSS, to name but three (there's a lot more).

Without being able to call out to existing standards, the ODT spec itself would suddenly become massive. The effort to update the standards appears to be significant and hasn't progressed much in recent years already :/

I think realistically, an Sqlite format could be offered as an option, but the office doc ship has really sailed.

Good argument to formalise the spec of Sqlite as a standard though...

dfox · 2023-09-18T16:57:30.000000Z

The specification is massive (840 pages) even though it is written in very terse way that does not really specify the effects and behavior, only the syntax.

On the other hand if one ignores few warts (explosion of local styles and text spans due to ooo:rsid attribute, non-sparse spreedsheets and weird mechanism for styling tables as a few examples) it is really well designed markup for this kind of document data that strikes right balance between it being semantic markup and representing the kinds of stuff users want to do. Compare that with Office OpenXML with stateful formatting empty tags (yes, really, in DOCX <b/> _TOGGLES_ whether following text is bold).

orf · 2023-09-18T09:28:47.000000Z

Coupling a file format to SQLite smells wrong.

SQLite is good, but it is also fairly unique in this space. Why? Because it’s hard to replicate everything it does, because it does a lot.

But… for this case, do we need it do a lot? No, not really. We don’t need the full SQL standard, a query optimiser, etc etc for basic (+ safe) transaction semantics and the ability to store data in a basic table structure.

Perhaps there is a better file format we can use, but it would be better if it was decoupled from SQLite.

jve · 2023-09-18T09:57:56.000000Z

- Why not? https://www.sqlite.org/appfileformat.html

- Its size is less than a megabyte: https://sqlite.org/footprint.html

- 750KB if all features are enabled: https://www.sqlite.org/about.html

- Looks like fair amount of functionality can be left out when compiling sqlite and with options to influence/strip down query planner: https://www.sqlite.org/compile.html

- And "SQLite does not compete with client/server databases. SQLite competes with fopen()": https://www.sqlite.org/whentouse.html

In the end, you don't need a database, but a library that gives you database API and behavior.

orf · 2023-09-18T10:19:31.000000Z

> In the end, you don't need a database, but a library that gives you database API and behavior.

Why do you need a single library that gives you a database API and behaviour?

Wouldn't it be better to decouple those: provide an open, standard format that enables compact, fast, structured storage that is built to allow transaction/atomic updates.

If that exists then you can plug sqlite on top of that, or something else. Because you don't _need_ any of SQL, or really sqlite to improve the OpenDocument format. You need the storage format.

OpenDocument is very different from the pretty scientific/niche/highly-vendor-locked examples given in replies by others here. Locking this into a format developed by essentially a single person with a single implementation is absolutely mad.

But... it's less mad if the file format wasn't coupled to sqlite.

nurbl · 2023-09-18T11:37:42.000000Z

A great thing about just using sqlite as the format is that you get lots of potential features. Sure most applications don't need full SQL power just to save and load data. But then at some point you might want more advanced functionality, or migrate to a new structure. And both you and your users get tools for free, e.g. to extract data or fix problems, or just look around. Other applications can quite easily read your files, without you needing to write various language libraries. Very few projects get around to building that kind of tools for their made up format.

I could agree about the single implementation, but if the alternative is making something new up I am not sure in what way that would be better.

colonwqbang · 2023-09-18T11:56:56.000000Z

The Sqlite format is open and the spec is here: https://www.sqlite.org/fileformat2.html

I haven't studied the spec in detail but it seems comprehensive.

The fact that there also exists a high-quality, stable, public domain reference implementation can't really be counted against the format, can it?

hot_gril · 2023-09-18T21:38:10.000000Z

> Wouldn't it be better to decouple those: provide an open, standard format that enables compact, fast, structured storage that is built to allow transaction/atomic updates.

The high-level software abstraction approach doesn't hold up when it comes to databases. This is such a wide and performance-critical interface that any abstractions are gonna leak badly. Even the SQL standard has all these impl-specific flavors. Many have tried to build layers on top that'll work with multiple DBMSes, and it's never worth. Anyone writing an app backend is just gonna marry a particular DBMS for the performance benefits (puns intended).

If for some reason an alternative implementation really needs to exist, SQLite is simple and open enough that someone can do it.

orf · 2023-09-18T22:35:31.000000Z

You’re totally right, but there are a few things missing: this isn’t a DBMS, really, and the files are not going to be huge.

You need fast listing/pagination, key value get/set, and transactional updates. Basically DynamoDB, but for a single file. Build a query layer on top of that, sure. Use those primitive to build persistent indexes if you want.

Or just iterate through the keys in a for loop. It fits in memory anyway.

You don’t need a fully fledged DBMS for a word document. And if you’re shuffling around lots of data in a structured format with no updates needed, you probably want arrow/parquet rather than sqlite because the read performance is going to crush SQLite.

hot_gril · 2023-09-18T22:52:15.000000Z

I don't know, probably a lot of us have dealt with large docs that become noticeably slow to edit and scary to save, mostly spreadsheets.

orf · 2023-09-18T23:02:11.000000Z

Ok cool: so adding SQL to that is going to magically speed it up?

No. It’s the on disk format that matters. Because it would be just as slow and scary if it used a sqlite file that was embedded in a zip file or something equally as mad.

It’s not the SQL, it’s the file format.

If you decouple the file format from the SQL engine, it becomes simpler to reimplement, more agnostic and less vendor locked.

hot_gril · 2023-09-18T23:18:14.000000Z

SQLite (not just the language SQL) would make it much easier to reimplement in a way that's fast and safe, yes.

> If you decouple the file format from the SQL engine

That alone would be a difficult project. If you really want to break things down, easier to say your document standard relies on SQLite's rather simple query language (https://www.sqlite.org/lang.html), and there, it's independent of SQLite's query planner and file format. Wouldn't be hard to make it work with Postgres or MySQL, for instance.

orf · 2023-09-18T23:24:32.000000Z

I’d love to understand your thinking behind the idea that a document standard should rely on a query language and not a file format…

Document standards are file formats…

Or are you saying a document format should just be some DDL statements? What? How is that interoperable? It’s coupled to the database that is storing the data as an implementation detail, which is exactly the problem with using SQLite.

> That alone would be a difficult project

I’m not suggesting using the SQLite file format, I’m suggesting the pretty basic idea that the storage for a general purpose widely used and interoperable document format should be logically decoupled from anything else, and definitely not be tied to the implantation details of a single library or even a single version of that library.

The file format is the most important part. It’s the only part. Nothing else matters because there is nothing else.

It’s not rocket science.

hot_gril · 2023-09-19T01:39:52.000000Z

> Or are you saying a document format should just be some DDL statements? What? How is that interoperable?

Yes. How is it interoperable, because it's quite easy to make DDL for SQLite that also works for many other DBMSes, given that SQLite is kinda the lowest common denominator of those.

Maybe not as interoperable as ODF since it's easier to implement an ODF parser/writer than a SQLite clone, but probably more interoperable than some kind of advanced ODF designed for efficient updates. Just because you define a standard doesn't mean there are good portable implementations out there.

orf · 2023-09-19T08:19:14.000000Z

> Maybe not as interoperable as ODF since it's easier to implement an ODF parser/writer than a SQLite clone

Ladies and gentlemen: he’s so close, he’s nearly there, but he just can’t make the final connection!

Ensorceled · 2023-09-18T10:12:10.000000Z

The complaint is not “it isn’t good” but rather “it is not replaceable”. Since SQLite is so powerful, once you specify it as a format, you are stuck with SQLite forever.

ncruces · 2023-09-18T10:16:15.000000Z

Which is also "not a big issue", since it's a recommended Library of Congress storage format, and supported long term:

https://www.sqlite.org/locrsf.html

https://www.sqlite.org/lts.html

5e92cb50239222b · 2023-09-18T10:52:28.000000Z

It is somewhat of a problem: the development team is very small, they don't take outside contributions (so nobody outside the core team really builds up expertise over time), and the vast majority of tests are proprietary. I hope they have a contingency plan just in case (some sort of a dead man's switch that publishes the test suite under a permissible license) as it would probably be quite difficult for others to maintain the same quality without those tests, or re-implement them in a reasonable time frame.

fweimer · 2023-09-18T12:00:35.000000Z

But that equally applies to getting critical bug fixes for your particular usage scenario of SQLite. It's not just about the viability as a storage format.

For the latter, because the stored data has such a simple format and the implementation has so few dependencies, I expect it will be very easy to get your data out for a long time to come. It's going to be tougher if you have business logic in views or other SQL expressions, of course, and if you rely on SQLite's particular approach to data types (as in “values have types“, but not much more).

floppydiscen · 2023-09-18T11:57:37.000000Z

I'm pretty sure this is why libsql was created https://github.com/libsql/libsql

Ensorceled · 2023-09-18T10:22:06.000000Z

Which is why I was clarifying the original complaint and not supporting the original complaint.

hot_gril · 2023-09-18T21:52:48.000000Z

Once you pick ODF as a format, you're stuck with it forever... except I wouldn't categorize it as powerful.

fweimer · 2023-09-18T11:05:14.000000Z

I found the transactional aspect surprisingly difficult, especially with concurrent file access. SQLITE_BUSY handling was quite hard at the time. I know that serialization failures are expected in transaction processing, but for SQLite it was very difficult to tell persistent failures (say, due to self-deadlock) apart from transient concurrent update problems. For transient failure, you can re-execute the closure defining the transactional operation, but for persistent failure, that's of course pointless.

Part of the problem is that sqlite3_stmt combines aspects of both prepared statements and result sets. There is a tendency to keep them around to cache the compiled bytecode (prepared statement), but your might code might stop mid-iteration (result set), maybe holding a lock at this point. This can lead to surprising lock-upgrade failures. In the end, I wrote extensive error reporting using sqlite3_next_stmt, sqlite3_stmt_busy, sqlite3_sql, just to weed out those issues. The entire transaction retry code I wrote is full of optional logging and many comments, even though it was just for my own personal use. Before that, I wrote transaction retry logic for PostgreSQL, and that was so much easier (but it was before fully SERIALIZABLE transactions arrived).

The other surprise is that “ A transaction committed in WAL mode with synchronous=NORMAL might roll back following a power loss or system crash.” (https://sqlite.org/pragma.html#pragma_synchronous), but that wasn't relevant to my application.

HelloNurse · 2023-09-18T12:26:06.000000Z

If you retry your write several times and it doesn't succeed you can tell the user it is a persistent failure without agonizing too much over the diagnosis: it is persistent enough to be a significant problem, even without proof that it is an application bug.

Who would attempt to make concurrent writes to an application document format? And how wouldn't such an attempt be a user mistake? Failing to write is the solution, not the problem.

fweimer · 2023-09-18T14:06:43.000000Z

These concurrency failures in transaction processing can be quite rare, but you have to fix them if you want 24/7 unattended operation. SQLite only has timeout-based conflict detection, so you basically have to decide whether you want to wait 60 seconds (or so, depends on how glitchy your storage is) before reporting a potential self-deadlock, which isn't great for development, or risk failing unnecessarily when actually running the job. I had no idea what the right timeout was, which is why I wrote some of the custom self-deadlock detection logic. I think I got to run it completely reliable in the end (no false aborts even under load), but as I said, it was surprisingly hard.

By the way, concurrent read/writes on locally stored documents happen, even on single-user machines. If the reading process uses the SQLite structure (say a document indexer that knows about the format), it has to take some locks and may also need to flush data from the WAL log (depending on implementation details). At this point you have to deal with concurrency issues in the application, too. Unless you rewrite the entire document from scratch on every save and put it in place with an atomic rename (which I ended up doing for a different application, not the transaction-processing one). But that loses some of the advantages of SQLite.

HelloNurse · 2023-09-18T15:01:54.000000Z

You are clearly discussing a shared database for concurrent transaction processing ("you want 24/7 unattended operation"), not people editing application document files.

Setting aside technological details, multiple clients operating on independent rows of the same table can only, at worst, waste time by retrying a transaction, while multiple concurrent users attempting to modify the same document are asking for trouble, and if they succeed they probably succeed at corrupting the document.

Even without lock contention the aggregate document state can be incoherent (for example, Alice and Bruno edit a text, but they accidentally modify the same section and the latest save prevails and nobody notices).

severak_cz · 2023-09-18T09:42:59.000000Z

SQLite is already used for exactly this purpose. It's used as OGC GeoPackage and Mabox/Maptiler datasets use this.

amiga386 · 2023-09-18T11:51:45.000000Z

Exactly. Some formats are designed, first and foremost, for interchange. SQLite is pitching that you, as an "app" owner, force the SQLite format upon your users to make it a de-facto standard, without putting the work in to make it a de-jure standard.

Show me a formalised ISO / IEC / ANSI / ETSI SQLite standard that the Richard Hipp and his company never deviates from, and the full legal search to ensure there are no patents that might affect it, and show me the multiple compatible implementations of SQLite that _all_ have these touted advantages, and _then_ we can talk about prosletizing it as a file format. If they don't, they're saying "take a hard dependency on a single-source implementation, and make all your users take it too".

XML is a formal standard. ASN.1 is a formal standard. JFIF is a formal standard. Even ZIP is a formal standard (adopted as part of standardising OpenDocument: ISO/IEC 21320-1:2015)

The most important thing about a document is that everyone _else_ can read it. Saving time on writing updates to disk is an irrelevant sideshow. Did we learn nothing from Microsoft perverting the standards bodies to try and keep its lock-in?

https://arstechnica.com/uncategorized/2008/10/norwegian-stan...

> A letter of resignation written by the departing members and made public by The Inquirer accuses the standards body of folding to pressure from Microsoft, violating its own procedural rules, and ignoring the analysis of the technical committee tasked with evaluating OOXML.

tl · 2023-09-18T14:44:55.000000Z

Performance matters and is sufficiently captured via working incremental updates. The single largest upside of a proposal like this is captured by using SQLAR over ZIP. That's what the Library of Congress does when SQLite claims them as a proponent. It's what Fossil does as others in this thread have pointed out. It's suggested as "first improvement" in the linked article. It's also the only part that should actually be considered for implementation.

You are right to point out the folly of deeper implementations like having and needing to understand table structures for things like slides. However, the current status quo involves Microsoft implementing a fairly esoteric "update the XML file's bytes as they would be encoded in a ZIP file" in their proprietary tool (where they have enough money to invest the engineering time) and all other tools use the slower "whole file in memory" approach.

User visible features like incremental fast saves (and shared editing) keep people on closed systems and give Microsoft the leverage to do the things you warn against. SQLite as a container format could have prevented that by giving everyone a shot at a lower cost but still fast implementation.

numeromancer · 2023-09-18T14:26:35.000000Z

How much work does it take to go from an engine that can read standard XML to one that can read an ODT document's XML and do something useful with it? At what point of complexity does that engine create a de facto standard?

indymike · 2023-09-18T19:13:57.000000Z

> SQLite is pitching that you, as an "app" owner, force the SQLite format upon your users to make it a de-facto standard, without putting the work in to make it a de-jure standard.

From TFA:

Note that this is only a thought experiment. We are not suggesting that OpenDocument be changed.

hot_gril · 2023-09-18T22:23:56.000000Z

I think SQL, a formal standard, has shown that formal standards fail to define a good way to interact with a database. The only real implementations all broke the standard. And an editable document isn't far from a database.

punnerud · 2023-09-18T09:35:12.000000Z

Have you checked the Apple apps? Most of them use SQLite as storage format. iMovie, iPhoto, Voice recording…

Same with Docker. Can’t be that wrong?

tuyiown · 2023-09-18T09:51:58.000000Z

App using a format specific to their own and unique implementation, that ends up kind of proprietary is perfectly ok.

Using it for a open specification which target is cross implementation compatibility makes the move way more hazardous. Meaning, every implementation has to run on environment targetable and compatible wit sqlite or has to re-implement a compatibility layer on something complex enough that you only reliable definitive source of truth is the very famous sqlite test suite.

It the same reason why Web SQL has being abandoned: if sqlite is the sole api implementor, it takes precedence on any others specs, and you have no control on your standard.

I would be 100% for a specification on how to map open docs files to an relational structure, though, with a well know sqlite-backed implementation.

bawolff · 2023-09-18T10:04:30.000000Z

> Meaning, every implementation has to run on environment targetable and compatible wit sqlite

Well i get what you are saying, sqlite has been ported all over the place. It probably wouldn't be the limiting factor portability wise.

tuyiown · 2023-09-18T11:59:39.000000Z

Yes, right now, no problem, and there little foreseeable future where a sqlite ported to anything would pose a problem. Still, decisions with no way back like this one requires extra cautions.

JimDabell · 2023-09-18T10:22:17.000000Z

The Apple apps are using Core Data, which uses SQLite as its persistent store by default. So Apple could in theory migrate away from SQLite by changing Core Data’s behaviour without any application-level impact. So in a way, these applications are already decoupled from SQLite in the way the parent comment suggests.

rakoo · 2023-09-18T12:01:38.000000Z

To add to your point, fossil, the versioning system designed by the people of SQLite, and using SQLite, doesn't even use SQLite as a file format. It's all a bunch of blobs, each with its own format, that happen to be stored on SQLite. SQLite offers safe storage and a bunch of helpful indexes and views, but is not necessary for fossil-the-data to work.

livrem · 2023-09-18T13:32:25.000000Z

Looking in sqlite.fossil there are 27 tables in it and most are not used for storing blobs. I know when looking up how to do things in the past the answer has sometimes been "run this SQL query". The event table for instance looks like a list of all commits with dates and comments etc. There is a config table that looks like the kind of stuff git stores in .git/config (URL to upstream repo etc) and so on. Well, yes there are some blobs in it too.

rakoo · 2023-09-18T14:27:34.000000Z

As described in https://fossil-scm.org/home/doc/trunk/www/tech_overview.wiki, all the commits are stored as artifacts, and then fossil creates metadata tables for quick access to useful information.

Configuration of a repo indeed isn't defined as an artifact but as a SQLite table. One may wonder if this should be part of a repo, and I would say it should, so it actually is surprising that it's not also stored as artifacts

hot_gril · 2023-09-18T21:32:01.000000Z

You do need all these things for these applications. Efficiently and safely querying and writing data is central to any document format; you'll leverage both the file structure and in-memory structs to do this. SQL would probably work for this, in fact it's especially natural for spreadsheets (rows x cols).

orf · 2023-09-18T22:09:59.000000Z

I mean, clearly you don’t: OpenDocument works just fine without it.

You need key/value lookup, a way to list/paginate, and transaction semantics for updates.

AKA: a zip file with entries as keys, and XML documents or attachment blobs for values. What’s lacking and causes issues is the update semantics.

You can wack a SQL query language over those 3 operations if you’d like. Or don’t. Up to you, because the format is defined and can be reimplemented rather than the large, complex library api.

hot_gril · 2023-09-18T22:17:45.000000Z

OpenDocument also leverages the file/dir structure for efficient querying like we're both saying. If you mean that you don't need SQLite over ODF, well yeah, ODF works too. I just wouldn't prefer it.

What's the issue with the update semantics, though?

orf · 2023-09-18T22:23:18.000000Z

The issue is incremental updates, and this is where things get complex. If you have a file embedded in the middle of a zip file that is 100 bytes, and you want to resize it to 150 bytes, how do you do that?

You can’t squeeze it in without moving everything else about, which disrupts other readers. You could append it to the end maybe, but you need to handle concurrent writers. Compression also is an issue here - I expect zip compression is applied to multiple files at once, rather than per file? So now you might need to update multiple seemingly unrelated files.

You need to step down from the concept of a whole file as a unit and move towards pages of data that can be incrementally updated/reused/freed, where each page might contain one, many or even only a part of a “unit” (file/row/whatever)

This makes things more complex for sure

hot_gril · 2023-09-18T22:32:01.000000Z

So basically ODF loads everything into memory, relies heavily on in-memory structs for quick unsaved updates, and is crash-safe by writing the whole zip to a temp location during saves. Kinda similar to MS Office. The file structure also helps a little. This is good enough for small docs.

Many times have I encountered large docs, often spreadsheets, that push the limits here and become noticeably slow. If you want to get more sophisticated with the indexing and paging, SQLite is a very natural path. Anything else would be reinventing the same wheels SQLite has spent decades refining.

orf · 2023-09-18T22:41:31.000000Z

> Anything else would be reinventing the same wheels SQLite has spent decades refining.

Which is exactly the problem. They (I.e one dude?), and they alone have spent decades refining a single implementation.

Before we go and lock the entirety of the worlds documents into what’s essentially a proprietary format specific to a single implementation of a single library written by a single dude… we should double check if that’s a good idea or not, and if we can, collectively, solve some of these issues without reimplementing the whole of SQLite.

Because that’s complex. Perhaps more complex than it needs to be for most applications, which would benefit from the storage part more than the query part. And then we are back at the start of our discussion?