File Attachments: Databases can now store files and images

adrianmsmith · on Aug 30, 2023

I think for smaller projects just storing images as BLOBs in e.g. PostgreSQL works quite well.

I know this is controversial. People will claim it's bad for performance.

But that's only bad if performance is something you're having problems with, or going to have problems with. If the system you're working on is small, and going to stay small, then that doesn't matter. Not all systems are like this, but some are.

Storing the images directly in the database has a lot of advantages. They share a transactional context with the other updates you're doing. They get backed up at the same time, and if you do a "point in time" restore they're consistent with the rest of your data. No additional infrastructure to manage to store the files.

bastawhiz · on Aug 30, 2023

I think the one big problem with BLOBs, especially if you have a heavily read-biased DB, is you're going to run up against bandwidth/throughput as a bottleneck. One of the DBs I help maintain has some very large JSON columns and we frequently see this problem when traffic is at its peak: simply pulling the data down from Postgres is the problem.

If the data is frequently accessed, it also means there are extra hops the data has to take before getting to the user. It's a lot faster to pull static files from S3 or a CDN (or even just a dumb static file server) than it is to round trip through your application to the DB and back. For one, it's almost impossible to stream the response, so the whole BLOB needs to be copied in memory in each system it passes through.

It's rare that any request for, say, user data would also return the user avatar, and so you ultimately just end up with one endpoint for structured data and one to serve binary BLOB data which have very little overlap except for ACL stuff, but signed S3 URLs will get you the same security properties with much better performance overall.

dfee · on Aug 31, 2023

I’m sure you’re right, but it’s unbelievable to me how cost effective people claim S3 is. I’ve just never been able to get the pricing calculator to show me “cheap”. And I guess I’ve never really gotten comfortable with the access patterns, as it’s not a file system.

Where could I explore situations where people have used S3 with extremely favorable conditions relative to local storage(?), in terms of: price, access latency, and any other relevant column?

I want to believe, it’s just hard for me to go all in on an Amazon API.

rrrix1 · on Aug 31, 2023

> relative to local storage

That's the catch. S3 is ideal for when the sum total of your blobs can't easily fit on local storage - unless you want to use a NAS, SAN or something else with a load of spinning rust.

Storing your data on a single HDD you got off NewEgg will always win if you only use the one metric of $/GB.

S3's main draw isn't $/GB. It's actually more like ($/GB) * features

E.g. 12-9s of durability, Lambda events, bucket policies, object tags, object versioning, object lock, cross region replication

Doing that with anything over 100TB starts to get very expensive very quickly. Especially if you need that data for, you know, your business to survive...

rewmie · on Aug 31, 2023

> Storing your data on a single HDD you got off NewEgg will always win if you only use the one metric of $/GB.

That right there is the mistake you're making. Storage is not the only way that AWS charges you for S3. You're also billed for stuff like each HTTP request, each metadata tag, and data transferred out once you drop off the free tier. You're basically charged for every time you look at data you put in a S3 bucket the wrong way.

I strongly recommend you look at S3's pricing. You might argue that you feel S3 is convenient, but you pay through the nose for it.

https://aws.amazon.com/s3/pricing/

dfee · on Aug 31, 2023

The pricing calculator has given me the perspective you've shared, not the one of your comment's parent. It looks outrageously expensive.

Cloudflare's R2 pricing seems a lot more competitive, but it's still lock in: https://r2-calculator.cloudflare.com

Another pain is the testing story. I just want to be able to write to a FS. There are S3 fuse bindings, though. Maybe I'm just a dinsosaure these days.

bastawhiz · on Sept 2, 2023

Well in fairness, if you're comparing Postgres to S3, there's no universe where S3 doesn't win on every way of pricing things out (unless you're only ever using the data from the same machine running Postgres perhaps).

avereveard · on Aug 30, 2023

Also cost is a factor, databases are usually attached to expensive disk and their storage layer is tuned for iops, and blobs will be using gigs of that for just sitting there being sequentially scanned.

SOLAR_FIELDS · on Aug 30, 2023

I have to worked with BLOBs in DB fields in almost a decade, but when I worked with them one really annoying problem I ran into was that the DB wanted to morph the blob in transit by default. So JPEGs and PDFs etc would get corrupted because SQL Server wanted to add its own little byte signature to it during read/write operations.

Not sure if that was an endemic problem or a specific SQL server shiftiness

sroussey · on Aug 30, 2023

Ignoring ACL stuff for the moment, if you put images for a user's blog post inside a database, then front it with a CDN, thing might work out fine.

But if the usage pattern was such that the blog post and images were being transformed based on the time or something, then nothing would get cached, and back to the issue you describe.

That said... you could setup read replicas into groups and direct the blob stuff to a pool that is separate from the higher priority data requests for normal db stuff.

refulgentis · on Aug 30, 2023

Do you have any thoughts on when a JSON column is too large? I've been wondering about the tradeoffs between a jsonb column in postgres that may have values, at the extreme, as large as 10 MB, usually just 100 KB, versus using S3.

xmprt · on Aug 31, 2023

Wouldn't the same reasoning for BLOBs apply to JSON columns? Unless you're frequently querying for data within those columns (eg, filtering by one of the JSON fields), then you probably don't need to store all the JSON data in the DB. And even if that is the case, you could probably work out a schema where the JSON data is stored elsewhere and only the relevant fields are stored in the DB.

At the same time, I'm working with systems where we often store MBs of data in JSON columns and it's working fine so it's really up to you to make the tradeoff.

bastawhiz · on Sept 2, 2023

If you're only querying the JSON data and not returning it in full often, it's almost certainly fine. It's the cost of transit (and parsing/serialization) that's a problem.

bastawhiz · on Sept 2, 2023

It depends on how much you're getting back total. 100 rows returning a kb of data is the same as one row with 100kb. I get worried when the total expected data returned by a query is more than 200kb or so.

fovc · on Aug 31, 2023

We had this problem as well. For us a big part of the latency was just the bandwidth required for Postgres’s verbose text mode. It’s a shame there’s no way to compress the data on the wire

commonlisp94 · on Aug 31, 2023

You can cache them on disk. The DB blob is just the source of truth.

throwaway2990 · on Aug 31, 2023

Using PostgreSQL as a document database so everything is stored as JSONB and it’s only around 800gb but it’s still lightning fast.

mohamedattahri · on Aug 30, 2023

BLOB on Postgres are awesome, but there's also a full-featured file API called Large Objects for when the use-case requires seeking/streaming.

Wrote a small Go library to interface with it: https://github.com/mohamedattahri/pgfs

Large Objects: https://www.postgresql.org/docs/current/largeobjects.html

dventimi · on Aug 30, 2023

Large Objects works great for images and stylesheets and other file assets for my small PostgREST projects.

ibgeek · on Aug 30, 2023

TIL thanks!

ltbarcly3 · on Aug 30, 2023

I think there is too much emphasis on 'the right way' sometimes, which leads people to skip over an analysis for the problem they are solving right now.

For example, if you are building something that will have thousands or millions of records, and storing small binaries in postgresql lets you avoid integrating with S3 at all, then you should seriously consider doing it. The simplification gain almost certainly pays for any 'badness' and then some.

If you are working on something that you hope to scale to millions of users then you should just bite the bullet and integrate with S3 or something, because you will use too much db capacity to store binaries (assuming they aren't like 50 bytes each or something remarkably small) and that will force you to scale-out the database far before you would otherwise be forced to do so.

sorintm · on Aug 30, 2023

Brilliant point! It always depends on the problem you're trying to solve. What Xata did, was integrate files as a database column, BUT stored the binary content in S3. Thus offering both transactional file management and point in time recovery as well as high performance direct (S3) file access, through a CDN. Basically Xata implements the orchestration and takes away the pain of managing 2 services.

ltbarcly3 · on Aug 31, 2023

The last thing I would do is use a proprietary solution from a startup to store data. Sorry, but that is just absolutely insane.

harshitaneja · on Aug 30, 2023

Going on a tangent here but why s3? Why not avoid vendor lock in and just serve static files?

My setup is to post blobs over a small rest api which balances storage load over all the file servers using nginx. It responds with the serverSubdomainId which in turn gets saved in the database. So the url for file access can be generated https://{serverSubdomainId}.domain.tld/{resourceId} The only costly part is writing which with caching of available storage of all the servers isn’t bad. The whole system is horizontally and vertically scalable. What am I missing that s3 is still the defacto even with vendor lock in and high egress costs?

aforwardslash · on Aug 30, 2023

There are plenty of oss solutions that talk "s3", like swift, ceph and seaweedfs.

Why object storage? Its easier to backup, version and to scale horizontally. Most solutions will also provide encryption capabilities that can be commanded either via external key management systems or from the client application. Also, custom access policies are great for private documents and file uploads.

Using static files is a good solution in some cases, nothing against it. But in many scenarios, there are huge benefits on using object storage instead, even if it is quite slower.

timando · on Aug 30, 2023

"S3" is often used to refer to any S3 compatible object storage system. You can even self-host it with something like MinIO.

connordoner · on Aug 30, 2023

I can't speak for everyone but, in my case, it comes down to the ease of setup and having someone else manage it.

crooked-v · on Aug 30, 2023

> and that will force you to scale-out the database far before you would otherwise be forced to do so

Disk space is cheap, even when it's on a DB server and not S3. Why worry that much about it?

tomnipotent · on Aug 30, 2023

> Disk space is cheap

Disk I/O less so. An average RDBMS writing a 10MB BLOB is actually writing at minimum 20-30MB to disk - once to journal/WAL, once to table storage, and for updates/deletes a third copy in the redo/undo log.

You also get a less efficient buffer manager with a higher eviction rate, which can further exasperate disk I/O.

crooked-v · on Aug 30, 2023

But if the files are important enough that you need transactional updates or point-in-time recovery for them, you're stuck with the equivalents of those things anyway, wherever those files end up, plus the extra overhead of coordinating two systems.

ltbarcly3 · on Aug 31, 2023

But there is a hard limit to db scale up and you don't want to hit it. Also, who said you need transactional updates or pit recovery in all cases?

EvanAnderson · on Aug 30, 2023

The issue I've run into re: storing files as BLOBs in a database has been the "impedance mismatch" coming from others wanting to use tools that act on filesystem objects against the files stored in the database.

That aside I've had good experiences for some applications. It's certainly a lot easier than keeping a filesystem hierarchy in sync w/ the database, particularly if you're trying to replicate the database's native access control semantics to the filesystem. (So many applications get this wrong and leave their entire BLOB store, sitting out on a filesystem, completely exposed to filesystem-level actors with excessive permission.)

Microsoft has an interesting feature in SQL Server to expose BLOBs in a FUSE-like manner: https://learn.microsoft.com/en-us/sql/relational-databases/b...

I see postgresqlfs[0] on Github, but it's unclear exactly what it does and it looks like it has been idle for at least 6 years.

[0] https://github.com/petere/postgresqlfs

harlanji · on Aug 30, 2023

Minio is easy enough to spin up, S3-compatible. That seems like my default path to persistence going forward. More and more deployments options seem like they'll benefit from not using the disk directly but instead using the dedicated storage service path, so might as well use a tool designed for that.

S3 can be a bit of a mismatch for people who want to work with FS objects as well, but there are a couple options that are a lot easier than dealing with blob files in PGsql. S3cmd, S3fs; perhaps SSHfs to the backing directory of Minio or direct access on the host (direct routes untested, unsure if it maps 1:1).

vbezhenar · on Aug 30, 2023

It makes backups PITA. I migrated blobs to S3 and managed to improve backups from once a month to once a day. Database is now very slim. HDD space is no longer an issue. Can delete code which serves files. Lots of improvements with no downsides so far.

crooked-v · on Aug 30, 2023

Of course, that also means that now you don't have backups of your files.

0x457 · on Aug 30, 2023

They do have backups if versioning is enabled on S3.

ericbarrett · on Aug 30, 2023

Yes, S3 is extremely reliable and versioning protects against “oopsies.” I do always recommend disallowing s3:DeleteObjectVersion for all IAM roles, and/or as a bucket policy; manage old versions twith a lifecycle policy instead. This will protect against programmatic deletion by anybody except the root account.

brazzledazzle · on Aug 30, 2023

I second these sensible protections. Also would recommend replicating your bucket as a DR measure. Ideally to another (infrequently used and separately secured) account. Even better if you do it to another region but I think another account is more important since the chances of your account getting owned is higher than amazon suffering a total region failure.

lovasoa · on Aug 30, 2023

How big does the database need to be for large BLOBs to become a problem ? "big" and "small" are quite subjective terms.

How many BLOBs does one need to have, and how often to we need to touch them for this solution to become untenable ?

spiffytech · on Aug 30, 2023

One datapoint on BLOB performance: SQLite: 35% Faster Than The Filesystem

https://www.sqlite.org/fasterthanfs.html

hot_gril · on Aug 31, 2023

The Postgres TEXT type is limited to 65,535 bytes, to give a concrete number. "Big" usually means, big enough that you have to stream rather than sending all at once.

asjo · on Aug 31, 2023

The PostgreSQL manual states:

> the longest possible character string that can be stored is about 1 GB.

· https://www.postgresql.org/docs/current/datatype-character.h...

See also: https://www.postgresql.org/docs/current/limits.html

Where do you have the 64KB number from?

A small test:

    test=# create table test_table (test_field text);
    CREATE TABLE
    test=# insert into test_table select string_agg('x', '') from generate_series(1, 128*1024);
    INSERT 0 1
    test=# select length(test_field) from test_table;
     length 
    --------
     131072
    (1 row)

hot_gril · on Aug 31, 2023

You're right. Bad Google suggestion for "max string length postgres" that pointed to https://hevodata.com/learn/postgresql-varchar/ saying varchar max is 64KB, which is also wrong. I gotta stick to the official docs. Anyway, TEXT is the one we care about.

In MySQL, the TEXT (and BLOB) limit is 65KB, but you can get 4GB using LONGTEXT. According to their docs: https://dev.mysql.com/doc/refman/8.0/en/string-type-syntax.h...

crabbone · on Aug 30, 2023

In storage you measure things in blocks. Historically, blocks were meant to be 512 bytes big, but today the tendency is to make them bigger, 4K would be the typical size in server setting.

So, the idea here is this: databases that store structured information, i.e. such that needs to store integers, booleans, short strings are typically something like relational databases, eg. PostgreSQL.

Filesystems (eg. Ext4) usually think about whole blocks, but are designed with the eye for smaller files, i.e. files aren't expected to be more than some ten or hundred blocks in size for optimal performance.

Object stores (eg. S3) are the kinds of storage systems that are supposed to work well for anything larger than typical files.

This gives the answer to your question: blobs in a relational database are probably OK if they are under one block big. Databases will be probably able to handle bigger ones too, but you will start seeing serious drops in performance when it comes to indexing, filtering, searching etc. because such systems optimize internal memory buffers in such a way that they can fit a "perfect" number of elements of the "perfect" size.

Another concern here is that with stored elements larger than single block you need a different approach to parallelism. Ultimately, the number of blocks used by an I/O operation determines its performance. If you are reading/writing sub-block sized elements, you try to make it so that they come from the same block to minimize the number of requests made to the physical storage. If you work with multi-block elements, your approach to performance optimization is different -- you try to pre-fetch the "neighbor" blocks because you expect you might need them soon. Modern storage hardware has a decent degree of parallelism that allows you to queue multiple I/O requests w/o awaiting completion. This later mechanism is a lot less relevant to something like RDBMS, but is at the heart of an object store.

In other words: the problem is not the function of the size of the database. In principle, nothing stops eg. PostgreSQL from special-casing blobs and dealing with them differently than it would normally do with "small" objects... but they aren't probably interested in doing so because you already have appropriate storage for that kind of stuff, and PostgreSQL, like most other RDBMS sits on top of the storage for larger objects (filesystem), so they have no hopes of doing it better than the layer below them.

brazzy · on Aug 30, 2023

Most of what you wrote there is simply not true for modern DBMS, specifically PostgreSQL has a mechanism called TOAST (https://www.enterprisedb.com/postgres-tutorials/postgresql-t...) that does exactly what you claim "they aren't probably interested in doing" and completely eliminates any performance penalty of large objects in a table when they are not used.

crabbone · on Aug 31, 2023

sigh... lol, an "expert".

PostgreSQL just like most RDBMS uses filesystem as a backend. It cannot be faster than the filesystem it uses to store its data. At best, you may be able to configure it to use more caching, and then it will be "faster" when you have enough memory for caching...

brazzy · on Aug 31, 2023

What does that have to do with my comment? I didn't say say that an RDBMS is faster than a filesystem, just that your statements about them "seeing serious drops in performance when it comes to indexing, filtering, searching etc." are clearly wrong.

It seems very clear that it is you who knows some vaguely relevant bit of trivia, but actually have no clue about the subject in general, and try to make up for that in arrogance. Which is, honestly, pretty embarrassing.

TristanBall · on Aug 30, 2023

"Filesystems (eg. Ext4) usually think about whole blocks, but are designed with the eye for smaller files, i.e. files aren't expected to be more than some ten or hundred blocks in size for optimal performance."

Sorry what?

I mean, ext4, as a special case, has some performance issues around multiple writers to a singe file when doing direct io, but I can't think of a single other place where your statement true... and plenty where it's just not ( xfs, jfs, zfs, ntfs, refs )

crabbone · on Aug 31, 2023

God, how do you come up with this nonsense? Can you read what you reply to, or do you just write this to show off because you happened to know some vaguely relevant bit of trivia, but actually have no clue about the subject in general?

Yes, the goal of filesystem is to perform best when files are greater than one block and smaller than some couple hundreds of blocks. This is what it's optimized for. Can it deal with smaller or bigger files? -- Yes, but this is beside the point. Filesystem by design are meant for the sizes I mentioned. It's stupid to store information in files much smaller than single block because filesystems store a lot of file metadata per file. If your files are too small, then you start paying exorbitant price for metadata (filesystem don't optimize for storing metadata in bulk for multiple files because such optimization would mean a lot of synchronization when dealing with multiple files in parallel).

Similarly, filesystems, by and large aren't good for dealing with large chunks of data, like, eg. database backups, VM images etc. Typically, a storage system for large objects will try to optimize its performance by larger-than-block compression and deduplication. Neither makes sense when your target chunk size is in single to double digits of blocks -- there won't be enough overlap between different data chunks and you will pay more for decompression of data that you don't want to read, if the granularity of compressed chunks is too big.

And this is not a secret at all... talk to anyone who works on either database or filesystem or object store -- they will tell you exactly this. (I worked on a filesystem, for the record.) This is why these things all co-exist and are filling their respective niches...

lovasoa · on Sept 1, 2023

Wow, this comment is very rude and completely out of place.

rajman187 · on Aug 30, 2023

Several years ago Walmart dramatically sped up their online store's performance by storing images as blobs in their distributed Cassandra cluster.

https://medium.com/walmartglobaltech/building-object-store-s...

doubled112 · on Aug 30, 2023

I've seen BLOBs in an Oracle DB used to store Oracle install ISOs, which I think is ironic on some level.

Let's attach what we're using to the ticket. All of it. Why is the ticketing DB huge? Well, you attached all of it.

richbell · on Aug 30, 2023

I once helped migrate data from a 3+ TB Oracle database for a ticketing system. It was supposed to be shut down more than a year prior but the task kept getting passed around like a hot potato.

I can't imagine how much money we were paying in licensing fees and storage costs.

hot_gril · on Aug 31, 2023

I'm on board with using temporary solutions for small projects, but I feel like BLOB isn't all that convenient either unless you have some particular reason you only want Postgres as a dependency. I can either set up something like S3 where I upload via some simple API and get URLs to send back to clients that will never break, or I can build my own mini version of that with BLOBs. The BLOB way is a little more manual if anything. Also sometimes gets annoying with certain DB drivers.

Similar story with caches. Sometimes I've used Postgres as a cache, but writing the little logic for that is already more hassle than just plopping in memcached or Redis and using a basic client.

andyp-kw · on Aug 30, 2023

I've seen companies store pdf invoices in the database too, for the same reasons you spoke of.

SigmundA · on Aug 30, 2023

Main issues with large blobs in DB is moving and or deleting them. Not sure if PG has this issue I would like to confirm, but in say SQL server if you delete a blob or try and move it to another file group you get transaction logging equal to the size of the blobs.

You would think at least with a delete especially in PG with the way it handles MVCC the blob would just have an entry saying which blob was deleted or something then if you need rollback you just undelete the blob.

So just being able to move and delete the data after it is in there becomes a real problem with a lot of it.

davedx · on Aug 31, 2023

We ran into issues doing this straightaway, because you also need to fetch the file out of the database in your application server, then send it to clients through the load balancer. That introduces bottlenecks.

Generally I agree don't prematurely optimize. But for us, not even having huge files (5-25 MB on average), it slowed our application down unacceptably storing file data in the database, and we ended up moving file data to S3 with a streaming API in front of it.

fiedzia · on Aug 30, 2023

> If the system you're working on is small, and going to stay small, then that doesn't matter.

Having good default solution saves a lot of problems with migration. Storing files outside of database today really isn't that more complicated, while benefits even for small files are significant: you can use CDN, you don't want traffic spikes to affect database performance, you will want different availability guarantees for your media than for database and so on.

plasticeagle · on Aug 31, 2023

I would absolutely agree with this, were it not for the fact that your table appears to grow without bound - and is very very hard to clean up.

I know this, because this is exactly what I did. Put the images and the logfiles from an automated test suite directly into the DB. Brilliant, I thought! No more tracking separate files, performance was great, everything was rosy.

And then I tried to prune some old data.

cryptonector · on Aug 30, 2023

> But that's only bad if performance is something you're having problems with, or going to have problems with.

The first part is easy enough, but how do you predict when you're going to hit a knee in the system's performance?

EMM_386 · on Aug 30, 2023

"Databases can now store files and images"

Just with SQL Server alone, for decades now you can store BLOB data.

Need more than 2GB?

There's FILESTREAM, where enormous files can sit on the disk, outside the DB, but be referenced as if they were in the database via a varbinary(max) column.

"Think of it as having a new database column type where you can store files of any size"

We have that. It might not be AWS S3 and cached on a CDN, but that is just adding additional cost, complexity, and lock-in.

BrentOzar · on Aug 31, 2023

> Just with SQL Server alone, for decades now you can store BLOB data.

Yes, and for decades, it's been a bad idea for SQL Servers that grow over time.

When you store files in the database, you massively inflate the size of the data file and the transaction log. It makes all your maintenance tasks take longer: backups, restores, high availability, disaster recovery, corruption checking.

To make matters worse, when a new version of the file arrives, people never overwrite the existing file - they store the new version as a new row or object, and keep the old version around too. Space requirements explode exponentially.

If you're dealing with a 10-100GB database, sure, it seems fine at first. But after you start storing files in there, you won't have a 10-100GB database - you'll have a terabyte-sized one, and that's when your maintenance task heartaches begin.

EMM_386 · on Sept 4, 2023

> Yes, and for decades, it's been a bad idea for SQL Servers that grow over time.

I never said storing BLOBs was a great idea, but for smaller databases, like you say, and with small files, it is an option and it won't wreck your performance like you think in many cases. The profiler knows how to handle this, and if you query doesn't need it, they aren't going to impact it like you are saying. As always test, test, test.

For large databases, that's why I meantioned FILESTREAM. Those are not in the database, they are essentially pointers to files on disk.

> To make matters worse, when a new version of the file arrives, people never overwrite the existing file - they store the new version as a new row or object

This may be a requirement? And if it is, you have to store it somewhere. Using FILESTREAM you are free to put the actual files wherever you want. The database column says where to find it.

TowerTall · on Aug 31, 2023

yes, I remember 10 - 20 years, long time ago, Microsoft made something similar to Google Earth to showcase how SQL Server could store and serve massive amounts of binary files.

tudorg · on Aug 30, 2023

hey HN, and thank you todsacerdoti for posting it!

This is a feature that we wanted for a long time, but we also wanted to get it right. It's somehow the equivalent of storing files and images in an S3 bucket and then putting URLs to them in the database, but then you have to take care of keeping things in sync, take care of security permissions, delete the files when the database row is deleted, etc. We automate all that for you and even more: we cache the files in a CDN, and if they are images we support transformations out of the box.

benatkin · on Aug 31, 2023

There are plenty of frameworks and libraries that link files in buckets with rows in a database. Django is one that is pretty simple.

donatj · on Aug 30, 2023

I have an ancient website I have been maintaining since 2001 that has all it's static assets and images as blobs in MySQL.

Cloudflare and some good headers make it not a terrible option.

The site it self is basically a historic artifact at this point. I probably would never build a site that way again but with foreign keys it has some major upsides when it comes to maintaining data consistency between posts and associated images and files.

bob_theslob646 · on Aug 31, 2023

Why would you never build a site like that again? If you were to build something similar today, how would you build it?

donatj · on Aug 31, 2023

I'd almost certainly have all my static content in S3 or an S3-like store and just put the address in the database. Treat the data as largely immutable. Don't delete anything unless we have good reason to.

This is how we generally do things where I work now.

maerF0x0 · on Aug 30, 2023

You may not like Mongo, but it's had this feature for a long time. https://www.mongodb.com/docs/manual/core/gridfs/ Title maybe needs to say "Xata now can store..." ?

winrid · on Aug 30, 2023

With built in replication, horizontal scalability, sharding, detailed monitoring, built in LRU memory cache, reading your own writes, automatic file expiration, etc.

benatkin · on Aug 30, 2023

It's hard to find when CouchDB and MongoDB got attachments, but I suspect CouchDB had them first, so it may have been an obvious choice for MongoDB.

thecosmicfrog · on Aug 31, 2023

I've worked on a project where MongoDB GridFS was used quite successfully as the backing store for a CMS system:

https://news.ycombinator.com/item?id=37331037

saulrh · on Aug 30, 2023

Being able to demote files from "this is an object on disk" to "this is just another record, albeit a very big one" is a huge win for data architecture. I've seen this implemented ad-hoc in dozens of different instances. Having it work out-of-the-box with a plugin and custom column type would be really nifty.

benatkin · on Aug 31, 2023

Yet it uses S3. This is bucketless like AWS Lambda is serverless.

TOGoS · on Aug 30, 2023

Skimmed the article, grepped for 'hash', didn't find it.

> they are stored in AWS S3

Not sure why anyone would feel the need to couple this sort of thing to S3. Just store file hashes in the database, and have a blobstore mapping hashes to content. I like bitprint URNs for this purpose[1] but any strong cryptographic hash function will do. Then you can use any old datastore you want, be it a plain old filesystem, S3, a Git repo, or something else. It's trivial to bolt on alternate backends because you can always guarantee that what you asked for is what you got. If not, try the next one.

[1] http://www.nuke24.net/docs/2015/HashURNs.html

yawnxyz · on Aug 30, 2023

I agree that this is the "better way to do it" but as someone who's done all that, it's a much better DX for whatever system to just take care of this for me. I do think it'd be nice to decouple S3 (hopefully they're S3 compatible, meaning you can switch out minio or R2 or whatever in the back)

jillesvangurp · on Aug 31, 2023

I think that was a feature Oracle was pimping in the late nineties already. These days most databases can store blobs and if they can't you can just store things in base64 encoded form in a clob.

So, this is nothing new. Nice of course if you need it but probably a bad idea to use at scale as there are cheaper ways to do that that don't involve hammering your database with requests. It's not particularly hard to make that perform well of course but why bother? I suppose transactional behavior could be nice for mutations. But beyond that, I don't see a good reason to do that.

thecosmicfrog · on Aug 31, 2023

I've worked on a project where MongoDB GridFS was used quite successfully as the backing store for a CMS system. If I recall correctly, GridFS gives you versioning out of the box, which the CMS leveraged to let users roll back files/images/videos to older versions.

That was a production system and used to get hammered with requests. It didn't even have a CDN in front of it for the first few years.

And that was MongoDB 2.4 so I'm assuming GridFS has matured even more since then.

orf · on Aug 30, 2023

Just a note, if you are reading this post then click the logo in the top right, you cannot navigate back to the post - it immediately jumps to the homepage again.

Safari on iOS.

meepmorp · on Aug 30, 2023

Works fine for me with Safari on iOS.

GreeningRun · on Aug 30, 2023

Works fine on chrome/android

excalibur · on Aug 30, 2023

This sounds useful for malware persistence after a backup recovery

fiedzia · on Aug 30, 2023

>Think of it as having a new database column type where you can store files of any size, and behind the scenes they are stored in AWS S3 and cached through a global CDN.

So the files are not stored in the database, but somewhere else. This belongs to a web framework, not database.

crabbone · on Aug 30, 2023

Coming from storage, I really hate it when people (unknowingly or deliberately) misuse terminology in this field.

Whoever Xata is, they aren't storing files in their database. They are storing blobs, or objects... depends on how you look at it, but definitely not files.

Simple way to see that that's not true: can they store write-only files in their database (eg. /dev/null?) Or can they store UNIX socket files? What about device files? Can they recognize that the file has setuid bit? And so on... practically no useful attributes of files are stored in their database.

They obviously didn't even want to store files... It just feels like some kind of marketing trick (especially since the article is excessively peppered with self-congratulatory quotes from people who have allegedly found the product useful). Just say it as it is: Blob attachments.

berkes · on Aug 30, 2023

I'm no expert in this field. I presume you are technically correct. But also probably too focused on technical details.

Send a file or path to a db server, in an INSERT query (INSERT email, pwhash, avatar (?,?,?) into users), and that than handles the actual storage, all abstracted away, for the user (programmer) of this system, this is "storing the file in the DB". If I can then do a "SELECT email, variation('tiny', avatar)::cdn_url FROM users", that is "getting the image from the DB. Well, technically getting the email and a URL where this file cane be retrieved.

To me, as a user of such a database, it matters not if some engine stores it as a blob, base64, pointer, or just a URI to an S3 block. What matters is that it's in the same layer of abstraction.

DaiPlusPlus · on Aug 30, 2023

> To me, as a user of such a database, it matters not if some engine stores it as a blob, base64, pointer, or just a URI to an S3 block. What matters is that it's in the same layer of abstraction.

By "user of such a database" you mean the end-user of an application, right? As opposed to the devs, DBAs, SREs, et cetera? Because those users absolutely care about how this feature works and absolutely do not want anything to be abstracted away, because that's the dif

Also, data: URIs need to die. It started off with lazy JS devs using <canvas>'s to-data-uri (instead of toBlob) and now it's infecting the entire dev ecosystem: just go to StackOverflow and there'll be a post every few hours where some asker genuinely believes we're supposed to store large, gigabyte-sized files in a Postgres or MSSQL table as a Base64 string in a varchar column with a text collation (aiiieeeeee).

berkes · on Aug 31, 2023

No. I meant a developer.

I did not mean SREs and DBAs, as those typically need to know far too much of the innards of the systems they are maintaining.

But a developer? I honestly don't really care that "(BIG) TEXT" is stored using something called TOAST[1] in postgres. All I care about, is that I can store large texts of varying lengths.

The original argument, to me, sounded very much like someone with a lot of knowledge coming in and explaining "no, VARCHAR and TEXT are very different!". Sure, for someone hacking away on the storage-engine or even someone tuning pg, they are. But for "the average developer"? They are the same.

Sure, I am aware that all abstractions are leaky. But e.g. TEXT in pg is a good example, because I really didn't have to care, it just worked the same as storing ints, strings, etc. Untill it didn't and I had to go down that rabbithole, hence why I do know about it, as software dev - as user of postgres.

And about URIs: I didn't mean you'd get a data:uri. I didn't even imply that. I really meant a URI to where the asset can be found online: some S3 URI or so.

[1] https://blog.rustprooflabs.com/2020/07/postgres-storing-larg...

DaiPlusPlus · on Sept 1, 2023

Thank you for the clarification.

(I didn’t notice that my post was unfinished until a few hours later but the Edit button was disabled, grumble)

rat9988 · on Aug 30, 2023

"practically no useful attributes of files are stored in their database."

Probably because you work on the field, you miss what the layman wants to see. The most useful attrivute for me is the data. The other use case you cited are legitimate, it's just that I, as a layperson, don't think about them when we talk about files.

crabbone · on Sept 1, 2023

That's part of the reason why experts exist: to tell laypeople how to correctly use the language.

In other words: why should laypeople (who have no clue about the technology) decide how to call things? And why should someone who dedicated substantial effort to understanding, cataloguing and organizing the terminology give any thought to how non-experts, who haven't spent as much time and effort decide to go about it?

Imagine going to a medical doctor and demanding that they switch terminology they use in medical reporting or pharmaceutical nomenclature to work by the "rules" established by people who have no knowledge of, say, anatomy or pharmaceutics? Like, say, you decide that it's convenient for you to call all pills "paracetamol" -- do you think a doctor would humor such an "initiative"?

So, why should anyone in the filesystem making business humor laypeople concepts of files?

deadbeeves · on Aug 30, 2023

Early operating systems did not support directory trees, and I'm sure some didn't even support timestamps and such. Would you say that those system do not store files at all? What even is the defining characteristic of a "file"?

crabbone · on Aug 31, 2023

> What even is the defining characteristic of a "file"?

This is where you have to start when you read b/s articles like the one in OP.

But, to answer your previous question, lets first look at a different example: the word "car". Back in the days, before we had horseless carriages, "car" was, basically, a contraction of "carriage". I.e. two hundreds years ago it would be completely natural to picture cars as being pulled by horses, and cars that weren't pulled by horses or other beasts of burden would be the stuff you find in fairy-tales.

Were English speakers stupid two hundreds years ago to so grandly misuse the word "car"? -- I don't think so. In their context it made it perfectly fine. The context is gone now, so, whoever calls a buggy with a horse a car today is not using the language correctly.

And such is the case of the early filesystems. By today's standards they wouldn't have qualified to be called that way. Probably, the early filesystem more closely resemble key-value stores we use today. The discrepancy happened due to semantic drift caused by changes in technology.

Similarly to how you wouldn't use the word "car" to describe a buggy today, you shouldn't use the word "filesystem" to describe a key-value store today. Well, unless you make sure the readers understand that you are talking about what happened some 50-60 years ago.

And people who get to decide what to call a filesystem and what not to call filesystem are the people who make filesystems (and I'm one of them).

deadbeeves · on Sept 1, 2023

If the people designing the storage system get to decide what is and what isn't a file, then the database does store files, because the developers of the database system are telling you that it does. That those files have properties unlike the files on the file system you have worked on is entirely irrelevant.

See what appeal to authority gets you?

hot_gril · on Aug 31, 2023

At my job, there's a thing where owners of internal systems always say "X is not a database" or "X is not a filesystem" in reference to things that anyone else would call databases or filesystems, to the point of it becoming an inside joke. I don't mind if people misuse terminology from my field (databases). Like I'm not going to stand up every time someone says "Postgres is a type of database" instead of saying "DBMS."

chipsa · on Aug 31, 2023

Can you send a write only file over the network? UNIX socket file? Device file? Setuid bit? What does setuid even mean when you push the file onto a Windows box?

Most people are only concerned with blob attachments when they say file.

crabbone · on Aug 31, 2023

> Can you send a write only file over the network?

Yes. NFS can do that, for example.

> UNIX socket file?

I don't know a system off the top of my head that does this, but rsync could be possibly preserving file type given some flags.

> Device file?

Not immediately, but sort of. Eg. device files under /dev are created by udev based on the information it reads from a socket. In principle, you could have the user-space side of udev running on a machine other than the one physically attached to the block device thus exposed.

Ultimately, however, you can do all of the above over the network if you use iSCSI, or have a driver (similar to Ceph, Lustre, BeeGFS etc.) that exposes filesystem over netwrok.

Most importantly, however, I don't see how your question is relevant to anything I wrote. Suppose you couldn't do any of the things you naively thought you cannot do... then what? Why is sending something over the network is an indication of anything?

> Most people are only concerned with blob attachments when they say file.

So what? Most "people" (by which, I think, you mean "programmers") are irredeemably dumb. It doesn't matter what a majority of a group of idiots think. But even if they were smart, again, who cares what the majority of smart people thinks? -- majority isn't what establishes the truth.

chipsa · on Sept 4, 2023

You're stuck in the UNIX-ism of everything is a file. So things are files in Unix that in other operating systems are not. COM1 is a file in Windows. It doesn't make sense to transfer COM1 over the network. You can't store it. You can't send /dev/null over the network. You can send a representation of it via NFS, but anything you write to it via NFS isn't writing to the file, you're writing it back to the original server that you mapped NFS from. The representation of it isn't the file, just like a map isn't the terrain. If I map a filesystem over the network, I'm still not sending the file over the network. I'm just sending a hook for the server that's getting mapped to know what's going on.

When I say most people, I mean most users. Most users aren't concerned with socket files, or write only files, or device files, because they aren't things they're concerned with. They think of a file as a blob of data, with maybe metadata for who can access it (et al). So, for the purposes of storing files, a file is a blob of data, not something being exposed across the network through special drivers.

HumanOstrich · on Aug 30, 2023

I think you're conflating files with file metadata.

hparadiz · on Aug 31, 2023

For many people the meta data is just as important as the file itself so storing only a blob isn't exactly what they want.

hot_gril · on Aug 31, 2023

That's what the other cols are for. Or put the metadata in the blob if you really want.

crabbone · on Aug 31, 2023

No, I am not.