Hacker News new | past | comments | ask | show | jobs | submit | jerednel's comments login

Cool! Does this assume the unstructured data already has a corresponding metadata file?

My most common use cases involve getting PDFs or HTML files and I have to parse the metadata to store along with the embedding.

Would I have to run a process to extract file metadata into JSONs for every embedding/chunk? Would keys created based off document be title+chunk_no?

Very interested in this because documents from clients are subject to random changes and I don’t have very robust systems in place.


DataChain has no assumptions about metadata format. However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.

Extract metadata as usual, then return the result as JSON or a Pydantic object. DataChian will automatically serialize it to internal dataset structure (SQLite), which can be exported to CSV/Parquet.

In case of PDF/HTML, you will likely produce multiple documents per file which is also supported - just `yield return my_result` multiple times from map().

Check out video: https://www.youtube.com/watch?v=yjzcPCSYKEo Blog post: https://datachain.ai/blog/datachain-unstructured-pdf-process...


> However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.

Forgive my ignorance, but what is "json-pair"?


It's not a format :)

It's simpliy about linking metadata from a json to a corresponding image or video file, like pairing data003.png & data003.json to a single, virtual record. Some format use this approach: open-image or laion datasets.


Thanks for the explanation!


> DataChain has no assumptions about metadata format.

Could your metadata come from something like a Postgres sql statement? Or an iceberg view?


Absolutely, that's a common scenario!

Just connect from your Python code (like the lambda in the example) to DB and extract the necessary data.


What relevant metadata is there in an HTML file?


I guess, it involves splitting a file into smaller document snippets, getting page numbers and such, and calculating embeddings for each snippet—that’s the usual approach. Specific signals vary by use case.

Hopefully, @jerednel can add more details.


For HTML it's markup tags...h1's, page title, meta keywords, meta descriptions.

My retriever functions will typically use metadata in combination with the similarity search to do impart some sort of influence or for reranking.


It's not super clear to me how this interacts with data. If I have am using ADLS to store delta tables, and I cannot pull prod to my local can I still use this? Is there a point if I can just look at delta log to switch between past versions?


DVC is (at least as I use it) pretty much just git LFS with multiple backends (guess actually a more simple git annex). It further has some rather MLOps specific stuff. Is handy if you do versions model training with changing data on S3.


There’s another thread from October 2022 on that topic.

https://news.ycombinator.com/item?id=33047634

What makes DVC especially useful for MLOps? Aren’t MLFlow or W&B solving that in a way that’s open source (the former) or just increases the speed and scale massively ( the latter)?

Disclaimer: I work at W&B.


DVC is much more basic (feels more unix style), integrates really well with any simple CI/CD scripting with git versioning without the need to set up any additional servers.

And it is not either or. People actually combine MLFlow and SVC [0]

[0] https://data-ai.theodo.com/blog-technique/dvc-pipeline-runs-...


Speaking of git-annex, there is another project called DataLad (https://www.datalad.org/), which has some overlap with DVC. It uses git-annex under the hood and is domain-agnostic, compared to the ML focus that DVC has.


I've used it for storing rasters alongside georeferencing data in small GIS projects, as an alternative to git LFS. It not only works like git but can integrate with git repos through commit and push/pull hooks, storing DVC pointers and managing .gitignore files while retaining directory structure of the DVC-managed files. It's neat, even if the initial learning curve was a little steep.

We used Google Drive as a storage backend and had to grow out of it to a WebDAV backend, and it was nearly trivial to swap them out and migrate.


Same. I just bought a t480 off eBay a week ago while I already have a m2 pro MBP with 32gb RAM. I don’t know what compelled me but it’s just so cozy. By all measures the Mac is superior. Thinkpads are like the vinyl records of laptops.


I like that. I feel similar. I picked it up locally, and while I was driving over, I thought to myself: why am I even doing this, I have a perfect MacBook. But I'm happy I did it. It feels better in a way. Somehow like vinyl maybe, but not as fragile ;-) And my excuse is: can't run Arch on an M1 MacBook.


The temptation is strong to drop 3.75k on bl.ing


Existence is a funny term in Buddhism. The end goal of development stage practice is the recognition that the deity is inseparable from one’s self because the nature of any given deity is equal to our own in an absolute sense. After all, everything is empty of inherent existence and are all self arisen manifestations from a primordial ground of being, hence no you and no I as separate.

So some people do see them as external things to ask things of but others see them as mind created and used as supports for meditation to realize our own true nature. Depends on lineage and teacher.


Science consists of predictions based on measurements of our perception as mediated by our senses or tools that augment them. Dualism (or idealism for that matter) doesn't require religious beliefs. You can arrive at them analytically. Science is great for explaining the conventional reality of things as they appear though. But until I read a convincing explanation for consciousness and subjective experience that doesn't reduce to "neural correlates" I can't get on board with physicalism.

The fact that we're not all unconscious automatons that just evolved from one long cause-effect chain stemming from the big bang to now continues to bewilder me. "Mind" may play an "of the gaps" game here but it's the only reasonable explanation that currently makes some semblance of sense to me.

So currently I agree with the religious on your last point. Machines or AI cannot be conscious. Maybe if they were infused with some sort of biological material though I would say maybe. If our subjective experience is a localization of a thing called "consciousness" and certain forms (humans, animals) localize it to produce a sense of self then why not.


Funnily enough I actually had a conversation with ChatGPT about this and we concluded that conscious decision making / free will is basically a higher order Markov process.


I've been on a non-duality kick lately.

Losing Ourselves: Learning to Live without a Self by Jay Garfield


Because life is intrinsically frustrating and everything is impermanent including the satisfaction you derive from material gains. The only way to escape is to stop seeking happiness externally.


Seems like a failure on behalf of management to get so dependent on a single resource vs distributing work more evenly. All eggs in one basket and all that.

The bandaid had to be ripped at some point. They should have added resource over time though to take pieces bit by bit.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: