Hacker News new | past | comments | ask | show | jobs | submit | doppenhe's comments login

Could HyParquet's approach be extended to other data formats beyond Parquet?


I definitely think that UX is an underappreciated area for machine learning data. I want to make a set of libraries and tools that make it easier for people to work with ML data in the browser. The first step of good data science is to become one with your data.

I started with parquet because most datasets for modern LLMs are in parquet format. But there are other formats like JSONL which are common too.


nice thanks for sharing


The rapid advancement of large language models (LLMs) like ChatGPT has captured headlines and imaginations. AI systems can now generate amazingly human-like text on any topic with just a few prompts.These behemoths, with their unparalleled capabilities, have necessitated a reevaluation of governance models. As organizations explore integrating LLMs into business operations, it’s crucial to implement governance measures enabling innovation while managing risks. As executives, understanding the transition from traditional machine learning governance to LLM-centric AI governance is crucial.


this makes me happy , I was the PM for that feature :)


This is great, thanks for sharing. Key component in evolving FM based applications is making them feel as deterministic as possible vs probabilistic. Framework like this would enable generating trust in the outputs of these FMs.. exciting.


Author here would love to discuss with the community


hi all creator here. We built this version of our product focused on dynamic data science teams that just wanted to be able to deploy, scale and run their models without worrying about ops. Some more details:

https://algorithmia.com/developers/teams


Deployment, inference and management can participate in this as well!

Here is the missing part for a total e2e solution: https://github.com/marketplace/actions/algorithmia-ci-cd

{disclaimer, we built this Github action}


Hi doppenhe, we have that part already implemented using cml-send-github-check and dvc metrics diff. You can compare the metric that you prefer with dvc and then just set the status of the github check uploading your full report. Of course, you can also fail the workflow as your Github action does, but I think is more useful to see it as a report in the check.

disclaimer: I'm work with CML


coooool! going to try this out :)


Algorithmia here . What are you concerned about license wise? You own all ip always. There is some restrictions if you choose to commercialize on our service (mostly guarantee you won't take it down on users). System was built for this. Happy to answer questions


Oh hi! I was looking at the terms on this page (https://algorithmia.com/api_dev_terms)

The Software License section states:

> You do not transfer ownership of the Software to Algorithmia, but you do hereby grant Algorithmia, in its capacity as the provider of the Services, a worldwide, non-exclusive, perpetual, irrevocable, fully paid-up and royalty free license to use and permit others to use the Software (including the source code if made viewable) in any manner and without restriction of any kind or accounting to you, including, without limitation, the right to make, have made, sell, offer for sale, use, rent, lease, import, copy, prepare derivative works, publicly display, publicly perform, and distribute all or any part of the Software and any modifications, derivatives and combinations thereof and to sublicense (directly or indirectly through multiple tiers) or transfer any and all such rights; <and then some stuff about FOSS>

I'm no lawyer but my reading of this is that I own the IP but by using the platform Algorithmia receives a perpetual and irrevocable license to do as they please with the models, even if they're intended to be private.

Please correct me if I'm mistaken! I've been playing around with Algorithmia and quite like it, but that specific part is a bit off-putting and makes me hesitate to put the most important parts of our product on Algorithmia.


It's not used for private models. Its meant to make sure if someone builds an application based on your models you won't yank that version from under them. If you never expose your model (for profit) you can delete at will and we have no rights.


curious what kind of infrastructure have you built for deployment, serving and managing models?


We use TesnorFlow serving (https://www.tensorflow.org/serving) for serving the trained models. We also run Flask to transform the incoming JSON to match the way the data has been transformed at training time.


Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: