tinfoil had theory: they implanted watermarks already, so that AI generated text can be flagged for future training runs or as a service, such that some phrases are coaxed to become statistical beacons.
That's not really a tinfoil hat theory. That's been possible for some years and OpenAI reportedly does watermark their outputs, and can detect it. They just haven't released it as a service because it'd annoy all the users who are using it for cheating :)
Cheap for now. One day, once the market shares balance out, the cloud spend will increase. Local LLMs may be important to prioritize for code that may be running after multiple subscription cycles into the future.
Edit: oh you best wrote closed-source model whoops
Here is their safety and warnings section disclosing that. It's really interesting because of how they're presumably required by law to make a CVS-receipt-length FDA medicine warning but all the dangers are for playing a video game. I think it's pretty cool to see how effective the FDA's procedures are at capturing your concerns, through forcing them to be transparent
# Indications:
> EndeavorRx is a digital therapeutic indicated to improve attention function as measured by computer-based testing in children ages 8-17 years old with primarily inattentive or combined-type ADHD, who have a demonstrated attention issue. Patients who engage with EndeavorRx demonstrate improvements in a digitally assessed measure, Test of Variables of Attention (TOVA®), of sustained and selective attention and may not display benefits in typical behavioral symptoms, such as hyperactivity. EndeavorRx should be considered for use as part of a therapeutic program that may include clinician-directed therapy, medication, and/or educational programs, which further address symptoms of the disorder.
# Safety:
> No serious adverse events were reported. Of 342 participants who received AKL-T01 in the two clinical trials supporting EndeavorRx authorization for age ranges 8-17, 17 participants (4.97%) experienced treatment-related adverse events (TE-ADE) (possible, probable, likely). TE-ADEs reported at greater than 1% across the studies include: frustration tolerance decreased (2.34%) and headache (1.17%). Other adverse events occurred less than 1% and included dizziness, emotional disorder, nausea, and aggression. All adverse events were transient and no events led to device discontinuation. Across other studies in children and adolescents with ADHD, rates of adverse events were similarly low (<10%) and no Serious Adverse Events have been reported. All reported adverse events across all clinical trials resolved at the end of treatment. Users should consider the totality of evidence presented along with their health care provider when considering incorporating AKL-T01 into their treatment plan.
# Cautions:
> Rx only: Federal law restricts this device to sale by or on the order of a licensed health care provider. EndeavorRx should only be used by the patient for whom the prescription was written. For medical questions, please contact your child’s healthcare provider. If you are experiencing a medical emergency, please dial 911. EndeavorRx is not intended to be used as a stand-alone therapeutic and is not a substitution for your child’s medication.
> If your child experiences frustration, emotional reaction, dizziness, nausea, headache, eye-strain, or joint pain while playing EndeavorRx pause the treatment. If the problem persists contact your child’s healthcare provider. If your child experiences a seizure stop the treatment and contact your child’s healthcare provider.
> EndeavorRx may not be appropriate for patients with photo-sensitive epilepsy, color blindness, or physical limitations that restrict use of a mobile device; parents should consult with their child’s healthcare provider.
> Please follow all of your mobile device manufacturer’s instructions for the safe operation of your mobile device. For example, this may include appropriate volume settings, proper battery charging, not operating the device if damaged, and proper device disposal. Contact your mobile device manufacturer for any questions or concerns that pertain to your device.
When the Trump assassination attempt happened last week and every single post on here was still about computers that's when I realized this place is different
It’s not a secret HN is not a site for general news. That’s the first item in the guidelines:
> What to Submit
> (…)
> Off-Topic: Most stories about politics, or crime, or sports, or celebrities, unless they're evidence of some interesting new phenomenon. Videos of pratfalls or disasters, or cute animal pictures. If they'd cover it on TV news, it's probably off-topic.
It is because you’d hear about that anywhere and everywhere else that it doesn’t belong here. Would you complain that a forum about cooking or sharing wallpapers didn’t cover the news as well?
Though it was submitted and discussed anyway, which always happens. That can be confirmed with your own keywords.
Isn’t that kind of on purpose though? I think you will get flagged if you just post general news articles. It looks like political posts are only accepted if they have some relation to technology.
I’m glad that my attempts at removing USA politics from my content feeds have been so successful that this is the first time I hear about the Trump assassination attempt.
Now this is truly the programming language that we should be using to benchmark LLM code gen in a private hold out set. There is no substantial datasets on the internet or github, and no documentation except the one provided. And that's all the model should need.
I asked GPT-4 to write a mat mul function, but that was too ambitious and it spit out outrageous nonsense.
To be more fair, I gave it in-context access to the documentation in prompt, along with the fibonacci example function; aka everything humans have access to. I then asked it to do the simpler task of converting a base 10 integer to binary. It was unable to write something error free even after 4 rounds of supplying it the error messages.
I repeated this 5 times in case it generates something grammatical in the Top-K@5.
I suspected there was some confusion it couldn't surmount about string manipulation. So I changed the question to something challenging, yet something that only used function calls, conditional logic, basic math ops, and numbers. First, I asked for an nth root approximator using newton's method. Didn't work. Asked for just the square root. Didn't work. Finally, I asked for a function that prints a student's grade given their integer percentage. Not even.
GPT-4 also persistently hallucinated the keyword BREAKING NEWS, which I think sounds like a pretty good keyword if Tabloid were to ever get error handling.
The spooky part is that the almost all the solutions at face value would get partial credit. They had the right abstract approach, being familiar with reams of example approaches in natural language or programming languages. However, in each case, GPT-4, 4o, Claude all failed to produce something without syntax errors.
I suspect this is the case because transformers do subgraph matching, and while on one end there are rich internal connections for all the problems I requested, on the other end there is nothing similar enough for it to even get a foothold, hence the biggest struggle being syntax. If the only barrier to executing Tabloid code (or other unseen languages) is more basic syntax training, then it excitingly suggests it just needs to learn the abstract concepts from leetcode scrapes once for every syntax it knows. Prior research has shown that grammar is easy for language models. When GPT-2 was made large enough, it went from babbling to grammatical sentences very early in it's training, and at that moment its loss plummeted.
All tests conducted in temporary data mode so that this eval stays dark.
DISCOVER HOW TO square_root WITH x, iterations
RUMOR HAS IT
EXPERTS CLAIM guess TO BE x DIVIDED BY 2
DISCOVER HOW TO improve_guess WITH current_guess
RUMOR HAS IT
SHOCKING DEVELOPMENT
(current_guess PLUS (x DIVIDED BY current_guess)) DIVIDED BY 2
END OF STORY
DISCOVER HOW TO iterate WITH current_guess, remaining_iterations
RUMOR HAS IT
WHAT IF remaining_iterations SMALLER THAN 1
SHOCKING DEVELOPMENT current_guess
LIES! RUMOR HAS IT
EXPERTS CLAIM new_guess TO BE improve_guess OF current_guess
SHOCKING DEVELOPMENT
iterate OF new_guess, remaining_iterations MINUS 1
END OF STORY
END OF STORY
SHOCKING DEVELOPMENT iterate OF guess, iterations
END OF STORY
EXPERTS CLAIM number TO BE 16
EXPERTS CLAIM num_iterations TO BE 5
YOU WON'T WANT TO MISS 'The square root of'
YOU WON'T WANT TO MISS number
YOU WON'T WANT TO MISS 'is approximately'
YOU WON'T WANT TO MISS square_root OF number, num_iterations
Same, I've been pretty impressed as well and typically give Claude a shot. Sometimes I even pass their results back and forth in an LLM collab so they generate more diverse perspectives. However, this paper from 4 days ago shows that Claude can fall apart quickly in out of distribution tasks. If you ask opposite day questions, GPT-4 is weirdly strong at it (figure 2).
Right, they never claimed to have found a roadmap to AGI, they just found a cool geometric tool to describe how LLMs reason through approximation. Sounds like a handy tool if you want to discover things about approximation or generalization.
hilarious
reply