Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Chat with your data using LangChain, Pinecone, and Airbyte (airbyte.com)
220 points by mtricot on Aug 8, 2023 | hide | past | favorite | 59 comments
Hi HN,

A few of our team members at Airbyte (and Joe, who killed it!) recently played with building our own internal support chat bot, using Airbyte, Langchain, Pinecone and OpenAI, that would answer any questions we ask when developing a new connector on Airbyte.

As we prototyped it, we realized that it could be applied for many other use cases and sources of data, so... we created a tutorial that other community members can leverage [http://airbyte.com/tutorials/chat-with-your-data-using-opena...] and the Github repo to run it [https://github.com/airbytehq/tutorial-connector-dev-bot]

The tutorial shows: - How to extract unstructured data from a variety of sources using Airbyte Open Source - How to load data into a vector database (here Pinecone), preparing the data for LLM usage along the way - How to integrate a vector database into ChatGPT to ask questions about your proprietary data

I hope some of it is useful, and would love your feedback!




LangChain supports local LLMs like Llama 2 with Ollama (https://github.com/jmorganca/ollama) as of this morning, in both their Python and Javascript versions:

https://python.langchain.com/docs/integrations/llms/ollama

This can be a great option if you'd like to keep your data local versus submitting it to a cloud LLM, with the added benefit of saving costs if you're submitting many questions in a row (e.g. in batches)


I am sure we can build something around that. Going to take a look at it. Thanks for mentioning it.


Thanks. How would this differ from running Llama2 through the Huggingface-Langchain integration? I haven't tried it but it looked like the way to go until you shared this.


This one is for Mac


Huggingface launched swift/mac today or in recent days


Can llama 2 also be used to create the embeddings?


How well does it work?


Always so happy to see a tutorial with actual substance.

So much on LLMs lately is mostly blog spam for SEO but this actually is information dense and practical. Definitely bookmarking this for tonight.

Also really happy to see a bonus section on pulling in data from third party websites. I think this is where LLMs get really interesting. Not only is data much easier to query with these new models, its also orders of magnitude easier to ingest from traditionally malformated sources.


It's a little weird writing a comment worded like this if you work at Airbyte isn't it?


You know what, reading it, this is a very fair callout. Im sorry.

Last week I had seen a draft of this and thought to myself all these same things.

It was a well written tutorial, and that it was exactly the type of format I wouldve hopped for as a engineer. To the point, with a lot of examples.

It really is the same style I like to write in. So when I saw we were starting to share this I wanted to support what I thought was great work and writing.

But it was really unfair not to put a disclaimer on that I do work here.

Honestly, really sorry.

Also I owe Michel an apology because I dont think he realized either.


That's disappointing to see, and posted within 10 minutes no less.


Hey Kaveet, I dont know who have all seen the original post but I know you have so I wanted to say I'm sorry. This was my fault. I wanted to support good work as a member of the engineering community but I shouldve added a disclaimer about employment.


You're right, LLMs are really good at extracting and structuring data from third party sources. I've been working on a "Zapier for data extraction" for this reason: https://kadoa.com


Looks great! Does it work behind a login?

In any case, just signed up for the waitlist. Would love to get bumped up if possible!


A late but still very applicable disclaimer: I do work here!

My original comment had the intent of my own personal opinion as someone who reads/writes tutorials as fun.

But I owe an apology to the readers because I did not add any disclosure, and honestly I shouldve


Let me know how that works out for you and if you would add anything to this tutorial!


When there are so many awesome FOSS vector databases available, I wonder what motivated the airbyte team to use Pinecone, the one database that is anti-FOSS?


I don't know what datasets you guys are working with that have no issues being shared in plain text across three separate proprietary paid services, but this is a nonstarter for me.


When reading the tutorial, we are describing one stack to build a specific app. But the stack is made of building blocks that you can replace with others if you need to.

- Airbyte has two self-hosted options: OSS & Enterprise

- Langchain: OSS

- OpenAI: you can host an OSS model if you want to

- Pinecone: there are OSS/self-hosted alternatives


> - OpenAI: you can host an OSS model if you want to

Just to confirm: you mean models like Facebook's Llama 2 and variants right? Since OpenAI hasn't released any OSS models.


correct


What about the embedding?


A fantastic starting point for beginners! Personally, I believe this tutorial provides a solid foundation, but there's so much more to explore. Building something truly effective involves tackling various nuanced situations and special cases. While querying records in Pinecone can sometimes give you the right results, it can also be a bit unpredictable, depending on what and how you query. You might want to check out options like Weaviate, or even delve into the world of sparse indexes for an added layer of complexity. The models themselves have their own quirks too. For example, GPT3.5 Turbo tends to respond well when given clear instructions at the beginning of the context, while GPT4, although more flexible, still comes with its own set of challenges. Despite this, I'm genuinely excited about the push to highlight the potential of LLM applications (more of that, please!). Just remember, while tutorials like this are a great step, achieving seamless results might require some hands-on experience and learning along the way.


Thanks! I agree with your point. There is a lot of tuning that needs to happen, including context aware splitting and any other kind of transformation before the unstructured data gets indexed. This is one of the big challenge of productionizing LLM apps with external data. So far we are using internally since the team as experience dealing with building these connectors and that becomes a great co-pilot.

The great thing we get by plugging this whole stack together is that we get all the refreshed data as more issues/connectors get created.


I'm curious: did you have ChatGPT lightly edit this comment before posting? A few things about the style (like the final sentence) sound similar to GPT-4 output.


We are reaching peak HN. Animated discussions on everything by chatbots, while the humans are the lurkers.


Not the person you’re asking but yeah…that looks likes ChatGPT


Just remember ChatGPT always likes to end with an unsolicited warning and a smile. :-)


hi folks, when will you have pgvector as a destination ? we (https://github.com/arakoodev/edgechains) work with a lot of enterprises and they would not move away from using redis or pgvector even as their vector store. Is there a way where we can leverage that ?

Second, for a LOT of enterprises, they want to use non-openai embedding models (minilm, GTE, BGE), will you support that. For e.g. in Edgechains we natively support BGE and minilm. Would you be able to support that ?


This is cool, I would like access to the code contents as well, not just the issue. Is that possible with airbyte? If so, how?


I feel that there are too many moving pieces here especially for prototyping. There was a much more simpler app recently I took a look at on a recent hackernews post : https://news.ycombinator.com/item?id=36894142

They still have work to do with different connectors (e.g. PDF etc) but the realtime simple document pipeline is what helps a lot.


Very well written and illustrated, thank you.

When using a local vector db, what is the security model between my data and Airbyte? For example, do I need to permit Airbyte IPs into my enviro, and is there a VPN type option for private connectivity?


It depends.

Airbyte comes in 3 flavors: OSS, Cloud, Enterprise.

For OSS & Enterprise, data doesn't leave your infra since Airbyte is running in your infrastructure. For Cloud, you would have to allow some IPs to allow us to access your local db.


How are you thinking about preventing customer PII making it to OpenAI?


For the purpose of the tutorial that we built, it really comes down to the type of data that you're using.

If you have data with PII:

One option would be to use Airbyte and bring the data into files/local db rather than directly to the vector store, add an extra step that strips the data from all PII and then configure Airbyte to move the clean file/record to the vector store.

The option that jmorgan mention is relevant here, using a "self-hosted" model.


This is always the first good question to ask about any chat bot IMO


congrats team!

what was the thinking behind choosing to support "Vector Database (powered by LangChain)" instead of directly supporting Pinecone Chroma et al directly as you do in the other destinations? when is direct integration the right approach vs when is it better to have an (possibly brittle, but faster time to market) integration of an integration?


Great question :) We want to get to value as fast as possible. I am certain that at some point we will need to go deeper with those integrations and they will likely require to be separate destinations. It will also depend on how they differentiate from each others, we will need more granularity with configurations.


I ak playing around with langchain the last days as well and when I checked right all langchain is really doing for you is giving you a guideline about recommended steps for a vector assisted LLM. In your example it actually just adds some text to the prompt like: "Answer the following question with the context provided here, If you dont find the right info dont make something up" sth. Along those lines


have you considered supporting pgvector? I'd imagine that'd be easier since you already have pg as a destination.


On the roadmap! We want to get more clarity on how to fit the Embedding part in the ELT model. Once we figure it out we will add it to PG.


A version supporting Pinecone directly is coming soon!


I like to keep my tools simple so just give me a single AI that can do everything, browse through my data, generate pictures and give me suggestions in my code editor, etc. etc., instead of a different AI for every tool out there.


Isn't it the dream? Today there is a lot of stack that needs to be built to enable what you're describing. This is actually what we are doing with that post. What foundations do we need to build so that the UX for the end user is what you're describing. Will take some time to get there :)


The next great debate. MonolithicAI vs Micro-serviceAI.


How large of a dataset can I submit? I have hundreds of thousands of words of text.


Shouldn't have any limits here. Can you let us know how it goes?


hmm, as a person of low technical savvy, do you expect there will be a point at which I can upload a large text file and have you do all the work to let me chat with it? I'd pay for that today if it exists, but can't put a ton of effort into building/implementing something myself.


chatpdf..?


chatpdf doesn't support my volume -- files are too big.


So I guess after all those discussions we are still stuck with LangChain for everything to do with LLMs.


I spend all day talking to people shipping AI products and approximately zero of them actually use LangChain.

LangChain doesn't make sense for a ton of reasons, but the top few are the code quality being horrid, the scope being ill defined, and the fact that most of the tasks it does are better done with a prompt that was designed for your exact use case.


We're actually shipping "AI products" and love LangChain.

Your criticism doesn't line up with anything in my experience. Nearly all of the prompts you use are your own, and you can customize any of the prompts used under the hood for chains like routing.


We're using it in production for several products, and are quite happy with it.


nice to see a tutorial that recognizes the case where the underlying data can change and the embedding needs to be updated.

Any plans to write a tutorial for fine-tuning local models?


Not at the moment but let me bring that to the team so we can brainstorm what it could look like.


Why is the OpenAI from the article title missing?


No good reason. Does "it made the post's title too long" work?


Works for me!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: