DataChain has no assumptions about metadata format. However, some formats are supported out of the box: WebDataset, json-pair, openimage, etc.
Extract metadata as usual, then return the result as JSON or a Pydantic object. DataChian will automatically serialize it to internal dataset structure (SQLite), which can be exported to CSV/Parquet.
In case of PDF/HTML, you will likely produce multiple documents per file which is also supported - just `yield return my_result` multiple times from map().
It's simpliy about linking metadata from a json to a corresponding image or video file, like pairing data003.png & data003.json to a single, virtual record. Some format use this approach: open-image or laion datasets.
I guess, it involves splitting a file into smaller document snippets, getting page numbers and such, and calculating embeddings for each snippet—that’s the usual approach. Specific signals vary by use case.
It's not super clear to me how this interacts with data. If I have am using ADLS to store delta tables, and I cannot pull prod to my local can I still use this? Is there a point if I can just look at delta log to switch between past versions?
DVC is (at least as I use it) pretty much just git LFS with multiple backends (guess actually a more simple git annex). It further has some rather MLOps specific stuff. Is handy if you do versions model training with changing data on S3.
What makes DVC especially useful for MLOps? Aren’t MLFlow or W&B solving that in a way that’s open source (the former) or just increases the speed and scale massively ( the latter)?
DVC is much more basic (feels more unix style), integrates really well with any simple CI/CD scripting with git versioning without the need to set up any additional servers.
And it is not either or. People actually combine MLFlow and SVC [0]
Speaking of git-annex, there is another project called DataLad (https://www.datalad.org/), which has some overlap with DVC. It uses git-annex under the hood and is domain-agnostic, compared to the ML focus that DVC has.
I've used it for storing rasters alongside georeferencing data in small GIS projects, as an alternative to git LFS. It not only works like git but can integrate with git repos through commit and push/pull hooks, storing DVC pointers and managing .gitignore files while retaining directory structure of the DVC-managed files. It's neat, even if the initial learning curve was a little steep.
We used Google Drive as a storage backend and had to grow out of it to a WebDAV backend, and it was nearly trivial to swap them out and migrate.
Same. I just bought a t480 off eBay a week ago while I already have a m2 pro MBP with 32gb RAM. I don’t know what compelled me but it’s just so cozy. By all measures the Mac is superior. Thinkpads are like the vinyl records of laptops.
I like that. I feel similar. I picked it up locally, and while I was driving over, I thought to myself: why am I even doing this, I have a perfect MacBook. But I'm happy I did it. It feels better in a way. Somehow like vinyl maybe, but not as fragile ;-) And my excuse is: can't run Arch on an M1 MacBook.
Existence is a funny term in Buddhism. The end goal of development stage practice is the recognition that the deity is inseparable from one’s self because the nature of any given deity is equal to our own in an absolute sense. After all, everything is empty of inherent existence and are all self arisen manifestations from a primordial ground of being, hence no you and no I as separate.
So some people do see them as external things to ask things of but others see them as mind created and used as supports for meditation to realize our own true nature. Depends on lineage and teacher.
Science consists of predictions based on measurements of our perception as mediated by our senses or tools that augment them. Dualism (or idealism for that matter) doesn't require religious beliefs. You can arrive at them analytically. Science is great for explaining the conventional reality of things as they appear though. But until I read a convincing explanation for consciousness and subjective experience that doesn't reduce to "neural correlates" I can't get on board with physicalism.
The fact that we're not all unconscious automatons that just evolved from one long cause-effect chain stemming from the big bang to now continues to bewilder me. "Mind" may play an "of the gaps" game here but it's the only reasonable explanation that currently makes some semblance of sense to me.
So currently I agree with the religious on your last point. Machines or AI cannot be conscious. Maybe if they were infused with some sort of biological material though I would say maybe. If our subjective experience is a localization of a thing called "consciousness" and certain forms (humans, animals) localize it to produce a sense of self then why not.
Funnily enough I actually had a conversation with ChatGPT about this and we concluded that conscious decision making / free will is basically a higher order Markov process.
Because life is intrinsically frustrating and everything is impermanent including the satisfaction you derive from material gains. The only way to escape is to stop seeking happiness externally.
Seems like a failure on behalf of management to get so dependent on a single resource vs distributing work more evenly. All eggs in one basket and all that.
The bandaid had to be ripped at some point. They should have added resource over time though to take pieces bit by bit.
My most common use cases involve getting PDFs or HTML files and I have to parse the metadata to store along with the embedding.
Would I have to run a process to extract file metadata into JSONs for every embedding/chunk? Would keys created based off document be title+chunk_no?
Very interested in this because documents from clients are subject to random changes and I don’t have very robust systems in place.