Could you please elaborate on how you utilize both of them together, and for which specific use case? I'm attempting to gain a better understanding of the hybrid approach.
The thing is to make ElasticSearch scores "comparable" to Milvus scores. Lots of ways to do this, but there's no single good solution. For example you could calculate BM25 score offline, or use TF-IDF score to do some kind of filtering. Again there's no single perfect answer. You'd have to do a lot of experiment according to your own use case and your own data to get the best results.
Also a lot of tuning needs to be done during all phases:
1) query pre-processing
2) query tokenizing
3) retrieval
4) ranking and reranking
I personally would not trust any universal "hybird-search" solutions. All toy demos.
It usually takes 5-10 good engineers to build a decent search engine/system for any real use case. It also requires a lot of turning, tricks, hand-written rules to make things work.
To keep the "memory", do you pass the embeddings along with the new text prompt in an API call? How do you combine embeddings and text prompts? I don't know much about this, sorry if the question sounds silly.
The below code takes a list of questions from an excel, and answers each one based on the directory I passed in. I use this for answering Statement of Works for proposals i write as a first path. Usually, I will have a number of different directorys that i pass in to 'Talk' to different intellegences and get a couple different answers for each prompt. One trains on the entire corpus of my past performance. One has a simple document discussing tone and other information, and one in training on only the SOW itself.
def excelGPT(dir, excel_file, sheet):
#my GPT Key
os.environ['OPENAI_API_KEY'] = 'sk-~Your open AI Key Here'
#Working Directory for training
root_folder = ''
documents = SimpleDirectoryReader(root_folder).load_data()
index = GPTSimpleVectorIndex(documents)
file_name = dir + excel_file
df = pd.read_excel(file_name, sheet_name=sheet)
answer_array = []
df_series = df.iloc[:,0]
for i,x in enumerate(df_series):
print("This is the index ", i)
print(x)
response = index.query(x)
answer_array.append(str(response))
zip_to_doc(df_series, answer_array, dir)
Hey, is it alright if you explain this in a bit more detail. I've playing around with llama-index myself. Do you have multiple indices? Or do you run each question through and get multiple responses. Isn't that quite expensive?
How do you also deal with the formatting of the various excel files. Would love to see the source code for this if you are willing to share?
I’ve being waiting for a product like this for some time now, I think there is a huge (not yet served) market for this. I’ve tried to implement something using Cloudflare workers, but failed, also tried to use Apollo Cloud trough a Apollo Federation server in front of my (non Apollo Server) API, failed too.
Some questions:
How it compares with Apollo Cloud on feature set terms?
My graphql server load is like 20 request/s average. At first the pricing looks a little bit intimidating for me, but running the numbers it looks like $500/m, is that right? Hopefully it will offset some of my origin servers costs.
What count as a request? Just request coming from the “outside” or also calls to purge for example?
Thanks @hcentelles, that's great to hear and gives us validation that there is a need!
Compared to Apollo Cloud: We're mostly focused on the caching part right now and have a different architecture where we are in your stack. Apollo runs a sidecar next to your application. We are a proxy in front of your API.
When it comes to the analytics part - which Apollo rather calls metrics, I think Apollo gives you field-level information, while we for now just have query-level information. However, we are fully server agnostic - you don't need to use Apollo Server. Any GraphQL API works. You just need to switch the URL in your clients. We even have customers just using the analytics part for now and disabling the caching in the beginning.
For the pricing: That is correct - you'd have about 50mio requests a month, so $500. However, the pricing there is not set in stone and we're happy to give you an early discount. Just contact us at support@graphcdn.io.
Right now only outside requests count as a request, no matter if cached or not. Purging calls might also count in the future.
We "support" batched and persisted queries in the sense that we don't break them, we pass them through to your origin, but we don't currently analyse them. Caching / analytics support for both of them is on our roadmap[0] in the near-term for sure!
Interesting to know. For enterprises we even go down with the price per million requests, as the volume is much higher and therefore the enterprise pays enough already.
It's been a while now since "El Paquete" became the main distribution channel of online content in Cuba, a lot has been written about this before.
A less known aspect of this topic is the net neutrality issues that this kind of distribution imply. At the end of the day, all the content come from this mighty anonymous source that download and distribute the content for a profit, presumably a huge profit. This source is god, he or she has the last word of what get in and what is left out.
So, since the beginning of "El Paquete" my website revolico.com went in. Revolico content (classifieds ads) is like a basic need in a market with almost 100% goverment control over the retail space (price fixing, availability, etc.), but about a month ago our content was left out, with a note that said that it would no longer be available because it has been used to for the purposes of “personal and political defamation against the country and its citizens.”, I was like WTF, is this the goverment infiltrating "El Paquete"? is a nasty move of our competitors? Who knows, the problem is that one guy has the power to decide what is ditribute it and what is not. This is not good by any mean.
Two weeks after revolico came back to "El Paquete", everything points that the customers were asking for it, so the producers were forced to include it again.
"El Paquete" is one of the best things that is happen in Cuba digital space right now, but a not centralized version is mandatory to make it less vulnerable to goverment control or other kind of arbitrariness.
Where is revolico hosted? What are your analytics like in a country with so little internet penetration? Does being in El Paquete mean that people are seeing ads that are now weeks or even months old?
The app is hosted in a typical cloud computing environment. The traffic from Cuba is 4M page views monthly. El Paquete gets updated every week so the people are seeing ads active the week before. We sell premium listing, our clients ask us for the right timing so its ads gets into El Paquete in the firsts positions.
Accurate snapshot of the internet state in Cuba, well written from a american point of view.
As the cofounder of one of the most popular cuban websites, revolico.com, I'm suffering this since 2007. We launch revolico on December '07, on march '08 the government blocked our IPs, then when we circumvent this censorship, they made a DNS spoofing nationwide.
Nevertheless revolico is still the #1 classifieds ads site of the island, way ahead of the government offering, our users are doing a lot of crazy and creative stuff to get acces to the site.
So Cuba, besides having an internet penetration of less than 5%, strongly censor the link, which is even sadder. I predict that access will increase in the near/medium term, but unfortunately proportionally with the censorship.
The Cuban government is reluctant to open internet access to the people, despite of they already have the needed bandwith through a submarine cable from Venezuela. Is really fascinating how the Cubans have developed a higly optimized offline distribution channel to share dowloaded content like websites, software, video games, tv shows, movies, with almost the same comsuption patterns of the connected world.
"Telecommunications providers will be allowed to establish the necessary mechanisms, including infrastructure, in Cuba to provide commercial telecommunications and internet services, which will improve telecommunications between the United States and Cuba."
If Cuban government allow this kind of companies to do business on or with Cuba, that could be huge. But if happens, this could be very slow, sadly.
Disclosure: I'm the cofounder of some Cuba related startups, a classifieds ads site censored by the Cuba government https://www.youtube.com/watch?v=GUmPkb44n_w, they block us by ip and dns, despite of the censorship, revolico is one of the most visited sites in the country, taking into account that cuba has a 5% internet penetration. Also a atypical remittances platform https://www.fonoma.com and crowfunding site for cuban artists shutted down by the USA goverment because of the kind of restriction that they are softening today http://www.yagruma.org
You started revolico!? That's awesome man, I think that project "opened" the mind of a lot of cuban entrepreneurs. I know a few cool projects over there, I also know a lot of plastic artists trying to start projects that connect the "exile" with the people of the island. As you might know, even among those oppose to the regime, there is a lot of bias against cubans from America(unless they are family/friends). I'm also cuban, living in NY, I'd like to help out with whatever I can. You'll find my email on my profile.
A book that any founder should read. It's a great set of interviews with founders telling relatively unsanitized versions of their startup stories. It serves as a great antidote to the business press's "all winners are perfect geniuses" school of reporting.