More

dmpetrov · 2024-11-06T02:57:09 1730861829

Good point - it does sound a bit like marketing bs. We’ll rephrase it.

dmpetrov · 2024-11-05T06:48:29 1730789309

Please share your feedback!

dmpetrov · 2024-11-05T02:21:26 1730773286

Lance is just a data format. Lance DB might be more comparable to DataChain.

DataChain focuses on data transformation and versioning, whereas LanceDB appears to be more about retrieving and serving data. Both designed for multimodal use cases.

From technical side: Lance has it's own data format and DB engine while DataChain utilizes existing DB engines (SQLite in open-source and ClickHouse/BigQuery in SaaS).

In SaaS, DataChain has analytics features including data lineage tracking and visualization for PDFs, videos, and annotated images (e.g., bounding boxes, poses). I'm curious to understand the unique value of LanceDB's SaaS — insight would be helpful!

You could think of it as OLTP (Lance) versus OLAP (DataChain) for multimodal data, though this analogy may not be perfect.

m0sth8 · 2024-11-05T02:38:04 1730774284

How about daft https://github.com/Eventual-Inc/Daft - also looks like a new multimodal dataframe framework

jaychia · 2024-11-05T23:03:23 1730847803

One of the maintainers of Daft here.

Just dug through the datachain codebase to understand a little more. I think while both projects have a Dataframe interface, they're very different projects!

Datachain seems to operate more on the orchestration layer, running Python libraries such as PIL and requests (for making API calls) and relying on an external database engine (SQLite or BigQuery/Clickhouse) for the actual compute.

Daft is an actual data engine. Essentially, it's "multimodal BigQuery/Clickhouse". We've built out a lot of our own data system functionality such as custom Rust-defined multimodal data structures, kernels to work on multimodal types, a query optimizer, distributed joins etc.

In non-technical terms, I think this means that Datachain really is more of a "DBT" which orchestrates compute over an existing engine, whereas Daft is the actual compute/data engine that runs the workload. A project such as Datachain could actually run on top of Daft, which can handle the compute and I/O operations necessary to execute the requested workload.

dmpetrov · 2024-11-05T03:24:58 1730777098

Good question! I’m not so familiar with it.

It looks like Daft is closer to Lance with it’s own data format and engine. But I’d appreciate more insights from users or the creators.

dmpetrov · 2024-11-04T20:13:13 1730751193

I guess, it involves splitting a file into smaller document snippets, getting page numbers and such, and calculating embeddings for each snippet—that’s the usual approach. Specific signals vary by use case.

Hopefully, @jerednel can add more details.

jerednel · 2024-11-04T22:56:54 1730761014

For HTML it's markup tags...h1's, page title, meta keywords, meta descriptions.

My retriever functions will typically use metadata in combination with the similarity search to do impart some sort of influence or for reranking.

dmpetrov · 2024-11-04T19:44:54 1730749494

Exactly! DataChain does lazy compute. It will read metadata/json while applying filtering and only download a sample of data files (jpg) based on the filter.

This way, you might end up downloading just 1% of your data, as defined by the metadata filter.

dmpetrov · 2024-11-04T19:11:14 1730747474

DataChain has no assumptions about metadata format. However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.

Extract metadata as usual, then return the result as JSON or a Pydantic object. DataChian will automatically serialize it to internal dataset structure (SQLite), which can be exported to CSV/Parquet.

In case of PDF/HTML, you will likely produce multiple documents per file which is also supported - just `yield return my_result` multiple times from map().

Check out video: https://www.youtube.com/watch?v=yjzcPCSYKEo Blog post: https://datachain.ai/blog/datachain-unstructured-pdf-process...

nbbaier · 2024-11-04T21:17:36 1730755056

> However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.

Forgive my ignorance, but what is "json-pair"?

dmpetrov · 2024-11-04T21:23:56 1730755436

It's not a format :)

It's simpliy about linking metadata from a json to a corresponding image or video file, like pairing data003.png & data003.json to a single, virtual record. Some format use this approach: open-image or laion datasets.

nbbaier · 2024-11-05T03:03:21 1730775801

Thanks for the explanation!

spott · 2024-11-04T20:12:30 1730751150

> DataChain has no assumptions about metadata format.

Could your metadata come from something like a Postgres sql statement? Or an iceberg view?

dmpetrov · 2024-11-04T20:15:33 1730751333

Absolutely, that's a common scenario!

Just connect from your Python code (like the lambda in the example) to DB and extract the necessary data.

dmpetrov · 2024-11-04T18:24:51 1730744691

Yes, it's not meant to replace data engineering tools like Prefect or Temporal. Instead, it serves as a transformation engine and ad-hoc analytics for images/video/text data. It's pretty much DBT use case for text and images in S3/GCS, though every analogy has its limits.

Try it out - looking forward to your feedback!

dmpetrov · 2024-11-04T18:11:28 1730743888

Yay! Excited to see DataChain on the front page :)

Maintainer and author here. Happy to answer any questions.

We built DataChain because our DVC couldn't fully handle data transformations and versioning directly in S3/GCS/Azure without data copying.

Analogy with "DBT for unstractured data" applies very well to DataChain since it transforms data (using Python, not SQL) inside in storages (S3, not DB). Happy to talk more!

dmpetrov · 2024-11-02T22:21:40 1730586100

Can this work statistically? For a giving number of attempts, you can ger a required number of successes to make sure it's a statistically meaningful result.

In theory, this approach could help address the non-determinism of LLMs.

seeknotfind · 2024-11-03T19:38:16 1730662696

There are a few examples of repeated testing being used by alignment groups to either test how aligned a model is, or to aggregate results to get something that is more aligned. For instance this is one related discussion: https://artium.ai/insights/taming-the-unpredictable-how-cont...

The non-determinism is a feature, and it can be disabled. This article also mentions doing that to get more deterministic alignment tests.

Theoretically if you aggregate enough results, it might become improbable to ever see an unaligned output. However, from a practical standpoint, we clearly much prefer much smarter models than running dumber models in parallel to get alignment that way. It's inefficient. The other thing is that given the number of possible ways to jailbreak a model, you can probably find something that would still bypass ensemble-based protections.

One other concept is relativism - there is a large grey area here. What is okay for someone is not okay for someone else, so even getting consensus among people what is okay, it's just not going to happen.

dmpetrov · 2024-10-20T04:57:20 1729400240

Right, DVC caches data for consistency and reproducibility.

If caching is not needed and streaming required, we've created a sister tool DataChain. It's even supports WebDataset and can stream from tar archives and filter images by metadata.

WebDataset example: https://github.com/iterative/datachain/blob/main/examples/mu...

notrealyme123 · 2024-10-21T05:01:29 1729486889

Thank you! Thats news to me. I will absolutely give it a try