Show HN: Chat with your data using LangChain, Pinecone, and Airbyte

jmorgan · on Aug 8, 2023

LangChain supports local LLMs like Llama 2 with Ollama (https://github.com/jmorganca/ollama) as of this morning, in both their Python and Javascript versions:

https://python.langchain.com/docs/integrations/llms/ollama

This can be a great option if you'd like to keep your data local versus submitting it to a cloud LLM, with the added benefit of saving costs if you're submitting many questions in a row (e.g. in batches)

mtricot · on Aug 8, 2023

I am sure we can build something around that. Going to take a look at it. Thanks for mentioning it.

hnhg · on Aug 8, 2023

Thanks. How would this differ from running Llama2 through the Huggingface-Langchain integration? I haven't tried it but it looked like the way to go until you shared this.

zwaps · on Aug 8, 2023

This one is for Mac

wahnfrieden · on Aug 9, 2023

Huggingface launched swift/mac today or in recent days

zarazas · on Aug 10, 2023

Can llama 2 also be used to create the embeddings?

bestcoder69 · on Aug 9, 2023

How well does it work?

bnchrch · on Aug 8, 2023

Always so happy to see a tutorial with actual substance.

So much on LLMs lately is mostly blog spam for SEO but this actually is information dense and practical. Definitely bookmarking this for tonight.

Also really happy to see a bonus section on pulling in data from third party websites. I think this is where LLMs get really interesting. Not only is data much easier to query with these new models, its also orders of magnitude easier to ingest from traditionally malformated sources.

BoorishBears · on Aug 8, 2023

It's a little weird writing a comment worded like this if you work at Airbyte isn't it?

bnchrch · on Aug 8, 2023

You know what, reading it, this is a very fair callout. Im sorry.

Last week I had seen a draft of this and thought to myself all these same things.

It was a well written tutorial, and that it was exactly the type of format I wouldve hopped for as a engineer. To the point, with a lot of examples.

It really is the same style I like to write in. So when I saw we were starting to share this I wanted to support what I thought was great work and writing.

But it was really unfair not to put a disclaimer on that I do work here.

Honestly, really sorry.

Also I owe Michel an apology because I dont think he realized either.

kaveet · on Aug 8, 2023

That's disappointing to see, and posted within 10 minutes no less.

bnchrch · on Aug 8, 2023

Hey Kaveet, I dont know who have all seen the original post but I know you have so I wanted to say I'm sorry. This was my fault. I wanted to support good work as a member of the engineering community but I shouldve added a disclaimer about employment.

hubraumhugo · on Aug 9, 2023

You're right, LLMs are really good at extracting and structuring data from third party sources. I've been working on a "Zapier for data extraction" for this reason: https://kadoa.com

omarfarooq · on Aug 9, 2023

Looks great! Does it work behind a login?

In any case, just signed up for the waitlist. Would love to get bumped up if possible!

bnchrch · on Aug 8, 2023

A late but still very applicable disclaimer: I do work here!

My original comment had the intent of my own personal opinion as someone who reads/writes tutorials as fun.

But I owe an apology to the readers because I did not add any disclosure, and honestly I shouldve

mtricot · on Aug 8, 2023

Let me know how that works out for you and if you would add anything to this tutorial!

wanderingmind · on Aug 9, 2023

When there are so many awesome FOSS vector databases available, I wonder what motivated the airbyte team to use Pinecone, the one database that is anti-FOSS?

ramesh31 · on Aug 8, 2023

I don't know what datasets you guys are working with that have no issues being shared in plain text across three separate proprietary paid services, but this is a nonstarter for me.

mtricot · on Aug 8, 2023

When reading the tutorial, we are describing one stack to build a specific app. But the stack is made of building blocks that you can replace with others if you need to.

- Airbyte has two self-hosted options: OSS & Enterprise

- Langchain: OSS

- OpenAI: you can host an OSS model if you want to

- Pinecone: there are OSS/self-hosted alternatives

samspenc · on Aug 9, 2023

> - OpenAI: you can host an OSS model if you want to

Just to confirm: you mean models like Facebook's Llama 2 and variants right? Since OpenAI hasn't released any OSS models.

mtricot · on Aug 9, 2023

correct

zarazas · on Aug 10, 2023

What about the embedding?

_pdp_ · on Aug 8, 2023

A fantastic starting point for beginners! Personally, I believe this tutorial provides a solid foundation, but there's so much more to explore. Building something truly effective involves tackling various nuanced situations and special cases. While querying records in Pinecone can sometimes give you the right results, it can also be a bit unpredictable, depending on what and how you query. You might want to check out options like Weaviate, or even delve into the world of sparse indexes for an added layer of complexity. The models themselves have their own quirks too. For example, GPT3.5 Turbo tends to respond well when given clear instructions at the beginning of the context, while GPT4, although more flexible, still comes with its own set of challenges. Despite this, I'm genuinely excited about the push to highlight the potential of LLM applications (more of that, please!). Just remember, while tutorials like this are a great step, achieving seamless results might require some hands-on experience and learning along the way.

mtricot · on Aug 8, 2023

Thanks! I agree with your point. There is a lot of tuning that needs to happen, including context aware splitting and any other kind of transformation before the unstructured data gets indexed. This is one of the big challenge of productionizing LLM apps with external data. So far we are using internally since the team as experience dealing with building these connectors and that becomes a great co-pilot.

The great thing we get by plugging this whole stack together is that we get all the refreshed data as more issues/connectors get created.

rahimnathwani · on Aug 9, 2023

I'm curious: did you have ChatGPT lightly edit this comment before posting? A few things about the style (like the final sentence) sound similar to GPT-4 output.

fakedang · on Aug 9, 2023

We are reaching peak HN. Animated discussions on everything by chatbots, while the humans are the lurkers.

replwoacause · on Aug 9, 2023

Not the person you’re asking but yeah…that looks likes ChatGPT

oefnak · on Aug 9, 2023

Just remember ChatGPT always likes to end with an unsolicited warning and a smile. :-)

sandGorgon · on Aug 9, 2023

hi folks, when will you have pgvector as a destination ? we (https://github.com/arakoodev/edgechains) work with a lot of enterprises and they would not move away from using redis or pgvector even as their vector store. Is there a way where we can leverage that ?

Second, for a LOT of enterprises, they want to use non-openai embedding models (minilm, GTE, BGE), will you support that. For e.g. in Edgechains we natively support BGE and minilm. Would you be able to support that ?

amanivan · on Aug 20, 2023

This is cool, I would like access to the code contents as well, not just the issue. Is that possible with airbyte? If so, how?

anupsurendran · on Aug 9, 2023

I feel that there are too many moving pieces here especially for prototyping. There was a much more simpler app recently I took a look at on a recent hackernews post : https://news.ycombinator.com/item?id=36894142

They still have work to do with different connectors (e.g. PDF etc) but the realtime simple document pipeline is what helps a lot.

gz5 · on Aug 8, 2023

Very well written and illustrated, thank you.

When using a local vector db, what is the security model between my data and Airbyte? For example, do I need to permit Airbyte IPs into my enviro, and is there a VPN type option for private connectivity?

mtricot · on Aug 8, 2023

It depends.

Airbyte comes in 3 flavors: OSS, Cloud, Enterprise.

For OSS & Enterprise, data doesn't leave your infra since Airbyte is running in your infrastructure. For Cloud, you would have to allow some IPs to allow us to access your local db.

r_thambapillai · on Aug 8, 2023

How are you thinking about preventing customer PII making it to OpenAI?

mtricot · on Aug 8, 2023

For the purpose of the tutorial that we built, it really comes down to the type of data that you're using.

If you have data with PII:

One option would be to use Airbyte and bring the data into files/local db rather than directly to the vector store, add an extra step that strips the data from all PII and then configure Airbyte to move the clean file/record to the vector store.

The option that jmorgan mention is relevant here, using a "self-hosted" model.

frankfrank13 · on Aug 8, 2023

This is always the first good question to ask about any chat bot IMO

swyx · on Aug 8, 2023

congrats team!

what was the thinking behind choosing to support "Vector Database (powered by LangChain)" instead of directly supporting Pinecone Chroma et al directly as you do in the other destinations? when is direct integration the right approach vs when is it better to have an (possibly brittle, but faster time to market) integration of an integration?

mtricot · on Aug 8, 2023

Great question :) We want to get to value as fast as possible. I am certain that at some point we will need to go deeper with those integrations and they will likely require to be separate destinations. It will also depend on how they differentiate from each others, we will need more granularity with configurations.

zarazas · on Aug 10, 2023

I ak playing around with langchain the last days as well and when I checked right all langchain is really doing for you is giving you a guideline about recommended steps for a vector assisted LLM. In your example it actually just adds some text to the prompt like: "Answer the following question with the context provided here, If you dont find the right info dont make something up" sth. Along those lines

mritchie712 · on Aug 8, 2023

have you considered supporting pgvector? I'd imagine that'd be easier since you already have pg as a destination.

mtricot · on Aug 8, 2023

On the roadmap! We want to get more clarity on how to fit the Embedding part in the ELT model. Once we figure it out we will add it to PG.

rschwabco · on Aug 8, 2023

A version supporting Pinecone directly is coming soon!

amelius · on Aug 8, 2023

I like to keep my tools simple so just give me a single AI that can do everything, browse through my data, generate pictures and give me suggestions in my code editor, etc. etc., instead of a different AI for every tool out there.

mtricot · on Aug 8, 2023

Isn't it the dream? Today there is a lot of stack that needs to be built to enable what you're describing. This is actually what we are doing with that post. What foundations do we need to build so that the UX for the end user is what you're describing. Will take some time to get there :)

kingforaday · on Aug 8, 2023

The next great debate. MonolithicAI vs Micro-serviceAI.

johndhi · on Aug 8, 2023

How large of a dataset can I submit? I have hundreds of thousands of words of text.

mtricot · on Aug 8, 2023

Shouldn't have any limits here. Can you let us know how it goes?

johndhi · on Aug 8, 2023

hmm, as a person of low technical savvy, do you expect there will be a point at which I can upload a large text file and have you do all the work to let me chat with it? I'd pay for that today if it exists, but can't put a ton of effort into building/implementing something myself.

tomr75 · on Aug 9, 2023

chatpdf..?

johndhi · on Aug 17, 2023

chatpdf doesn't support my volume -- files are too big.

zby · on Aug 8, 2023

So I guess after all those discussions we are still stuck with LangChain for everything to do with LLMs.

BoorishBears · on Aug 8, 2023

I spend all day talking to people shipping AI products and approximately zero of them actually use LangChain.

LangChain doesn't make sense for a ton of reasons, but the top few are the code quality being horrid, the scope being ill defined, and the fact that most of the tasks it does are better done with a prompt that was designed for your exact use case.

electrondood · on Aug 9, 2023

We're actually shipping "AI products" and love LangChain.

Your criticism doesn't line up with anything in my experience. Nearly all of the prompts you use are your own, and you can customize any of the prompts used under the hood for chains like routing.

electrondood · on Aug 9, 2023

We're using it in production for several products, and are quite happy with it.

everythingmeta · on Aug 8, 2023

nice to see a tutorial that recognizes the case where the underlying data can change and the embedding needs to be updated.

Any plans to write a tutorial for fine-tuning local models?

mtricot · on Aug 8, 2023

Not at the moment but let me bring that to the team so we can brainstorm what it could look like.

croes · on Aug 8, 2023

Why is the OpenAI from the article title missing?

mtricot · on Aug 8, 2023

No good reason. Does "it made the post's title too long" work?

replwoacause · on Aug 9, 2023

Works for me!