Hacker News new | past | comments | ask | show | jobs | submit | clay_the_ripper's comments login

This is totally unsurprising, but an interesting example of how people are expected to conform to “normal” behaviors. People that fall outside this narrow definition of what’s acceptable are still labeled “deviants” which is basically what the university is saying.

Also, the first amendment applies to the government restricting speech. A university is not the government so they are free to say that to be employed by the university you must not also do porn.

Bold moves by this guy, way to go in not being ashamed of who you are.


I think to a university administrator, money is paramount. So they are worried about donors, they are worried about alumni, they are worried about state and federal funds, and they are worried about the attitudes of the parents of prospective students.

Free speech on campus is a very distant concern, to be taken out and dusted off when there’s no real money at stake.


>Bold moves by this guy, way to go in not being ashamed of who you are.

Is porn-star an identity? It's a job. Some people sort trash, others fill Jira tickets, and others record themselves nude. Like drug dealing, it's highly profitable because the risk is built into the compensation. There ain't no such thing as a free lunch. If our society was A-OK with this then there would be no money in it.


It can be both right? Some people get paid as software developers because they love to build things. Some people post themselves having sex because they love exhibition or because they consider it art. Either way, what should it matter? It’s not illegal and nobody is forcing you to watch it.


I don't see why he would be ashamed? It's a calculated risk is all I'm saying. He knew maybe in the back of his mind it could blow up.

>what should it matter?

University donors don't like it. That's all. They have the right to feel that way.


I don’t find that to be a compelling argument. Donors, investors, employers, and governments should not have a right to tell us how to behave in our personal lives. If he was recording his videos with university resources, on university time, or while claiming to represent the university then sure. Otherwise it’s just people trying to enforce their version of morality on others, and we should all reject that.


Do you think the Catholic Church doesn’t have the right to force their version of morality on priests? Grow up.


> Also, the first amendment applies to the government restricting speech. A university is not the government so they are free to say that to be employed by the university you must not also do porn.

That's only true for private universities. Public universities are bound by the first amendment, and there are many examples of it. One example from my university happened in 2020, when a student posted a disgusting tweet about George Floyd, but the university knew it could not do anything to the student that would be considered punishment: https://fox4kc.com/news/kansas-state-will-not-expel-student-...


This is a shame that many gay people in professional spaces have to deal with every day. Sure sexual orientation is a protected class. But who on the HR team, the C-suite, your senior engineering team, etc. secretly thinks you’re a disgusting sexual deviant? That your voice is “unprofessional”? There are activities that are completely normal in gay circles that would get you put on HR’s shortlist if you ever mentioned them. Makes the whole “bring your whole self to work” business laughable. Only if your whole self happens to be completely white-bread WASPy Lexapro-laden cold soup.


Is that set of activities specifically related to sex/love, or are they more of a circle/subculture thing? Heterosexuals can engage in a variety of loving and erotic activities and some of them would be looked on with disapproval in a vanilla work world, even if they're not super unusual. Gays might have activities that are only possible because they're with the same sex, or they might have activities that heteros could also do but don't. (I'm going to self-censor on asking whether activity X or activity Y would count in your assessment.)


Your point was lost when you condemned individuals on Lexapro. Are you suggesting that individuals on Lexapro are lesser? Worse?


China's strategy has now changed: they used exports to build massive growth in their economy, but they are now using their exports as a kind of economic weapon that can be used in much the same way that Amazon crushed their competitors by artificially lowering prices to put others out of business...then raising the prices once they were the only game in town. China is doing something similar by making their currency artificially cheap to boost exports and by subsidizing production. Developed nations are taking note and building up their own industries again that were largely decimated by offshoring. Without protectionism, Chinese companies will threaten domestic businesses that are critical to the functioning of our economy. China is trying to save its ailing economy by dangling ever-cheaper goods to the world, but the world should not fall for this ploy.


Netlify for the front-end, google cloud for extended storage and Akamai for hosting. Akamai is great as they are less customizable than AWS (which is a good thing if you don’t need extra config) and much less expensive.


I can relate to this, especially on a side project where there is no one looking over your shoulder or holding you to a deadline, it’s very easy to spend too much time on the wrong things.

The stress you feel is the feeling of not having enough time to actually do the thing you’re wanting to do. But it’s important to recognize that the reason you don’t have enough time is because the thing you are working on, will not move the needle on the side project.

Deep down you know this, but it doesn’t matter because it’s hard to stop when you get stressed and emotionally deregulated. This is a very common trait in ADHD.

When I start to feel that stress, I typically will try to force myself to step away from whatever I am doing and take a five minute break. After I take a break if I still want to work on whatever it is, I force myself to put an end time on it, and also to relate the task back to a larger goal that will move the needle on my project. If I can’t do either of those things, then I stop the task. Hope it can help, it’s not easy.

See if you can get yourself obsessed with working on things that will move the needle like a new feature instead of rewriting old code.


This is something I would use - not to steal but so I could listen to certain podcasts for sleep. The host intros and ads for this certain podcast are terribly distracting from what is otherwise incredible content.


Other use case is relistening to old podcasts - about half the stuff I listen to is historical which lends itself well to a relistening 10 years later. The content may well be still relevant but the ads are surely not.

The other podcasts I listen to are current and mostly ad supported - this isn't great for them


Right now someone, somewhere, is thinking of a way to use this to identify the ad portions in old podcasts and overlay new, perpetually relevant, dynamic ads.


This is already being done. Same as geo targeting dynamically inserted ads based on the IP of the downloading client.


They use the IP! That explains why I was getting Spanish ads for a few days after returning home.


I lived in CA and got advertisements for Ralphs. I spent some time in Sedona and all of the sudden I got the same advertisements, same jingle etc but instead of Ralphs it was Fry's. That definitely threw me off.


My podcast downloader is always on an IP from another country, I don't mind the ads nearly as much when I can't understand a word they're saying!


I'm fairly sure that most podcast platforms now include dynamic ad segments - I listen to a few and whilst some (usually where the hosts record ads themselves) seem to have static ads, others definitely have ads which are updated automatically


It's a shame - other countries' ads often strike me as quaint, hilariously awful, at best amusingly weird, and I'm willing to tolerate them to a point. But when an ad breaks in that's obviously targeted based on my location it feels ingratiating, dirty, offensive.


Yeah for sure. But is it being done for the huge back catalog of much older podcasts? Maybe the big players already have teams of people hunting for old ad breaks, but this could alleviate the burden.


Not listening to ads is never stealing


Right? Every player comes with those "skip forward 30s" buttons, and surely the creators and advertisers are aware that many folks tap that button anyway.


I'm hoping now that Google Podcast moved to Youtube Music which can be added to Youtube Playlist, podcast listeners are going to figure out sponsorblock applies. I wonder if that's also why a bunch of my podcasts haven't migrated to Youtube.


My experience as a bootstrapped saas founder:

-at the early stages, usage models didn’t work for us because it attracted customers who were uncommitted to using the product

-the advantage of a subscription is that it filters out people who just want to use something occasionally (which is fine once you’re huge, but at the beginning you want people who are committed)

-I found it impossible to make enough money on the usage model. It’s maybe great for users but if you go out of business your users won’t be able to use it anyway

-we do automated real estate reports. When we charged per report people ran a lot less reports and tended to be stingy about running them. We wanted to encourage users to use it so an unlimited subscription made users happier, even if it meant their per report cost was higher. Also, we got tired of people computing the cost of a report mentally. Since the margins per report are good, an unlimited subscription works much better.

-once you reach a certain size having free and lower usage tiers can be a good thing but first you need a bedrock of recurring revenue that you can count on. Unless you have boatloads of VC money to burn.

-this will vary greatly depending on the specifics of your product. For example, developer tools that are easy to test out, but require a credit card to go to production make a lot of sense, but that was not us.

-if you can sign up very large customers that you know will use it a lot, usage models can make sense. But this requires a really really good product that already has PMF and requires that you can sell to mid market or enterprise customers and the sales cycles can be long. You can easily go out of business waiting to sign those customers.


> the advantage of a subscription is that it filters out people who just want to use something occasionally

One way around this is to require a subscription but the user gets the amount paid for the subscription as "free" usage and then they pay the normal rate for usage above that amount.


It sounds like you didn't limit the number of report runs in your subscription model, is that correct?


Tinfoil hat time. The recent gpt2 chatbot that everyone thought was a new open ai product - could it be?

“ You start with the gpt2.c pure CPU implementation, and see how fast you can make it by the end of the course on GPU, with kernels only and no dependencies.”

Remarkably similar nomenclature. I give it 1% chance this is related. I did play with that chatbot and it was smarter than gpt4 whatever it was.


It does seem like this is a possible method to test if an LLM has your data in it.

People have found other ways to do that of course, but this is pretty clever.


Not necessarily. This also uncovers the weakness of the NYT lawsuit.

Imagine in your corpus of training data is the following:

- bloga.com: "I read in the NYT that 'it rains cats and dogs twice per year'"

- blogb.com: "according to the NYT, 'cats and dogs level rain occurs 2 times per year."

- newssite.com: "cats and dogs rain events happen twice per year, according to the New York Times"

Now, you chat with an LLM trained on this data, asking it "how many times per year does it rain cats and dogs?"

"According to the New York Times, it rains cats and dogs twice per year."

NYT content was never in the training data, however it -is- mentioned a lot on various sources throughout commoncrawl-approved sources, therefore gets a higher probability association with next token.

Zoom that out to full articles quoted throughout the web, and you get false positives.


They were getting huge chunks, verbatim of NYT articles out. I remember being stunned. Then I remember finding out there was some sort of trick to it that made it seem sillier.


Was it that NYT articles are routinely pirated on reddit comments and the like?


Does it matter? What's the legal view on "I downloaded some data which turns out to be copied from a copyrighted source and it was probably trivial to figure it out, then trained the LLM on it"? I mean, they work on data processing - of course they would expect that if someone responds with 10 paragraphs in reporting style, under a link to NYT... that's just the article.


I genuinely don’t know the answer but I can see it being more complicated than “OpenAI purposefully acquired and trained on NYT articles”.

If Stack Overflow collects a bunch of questions and comments and expose them as a big dataset licensed as Creative Commons but it actually contains a quite bit of copyrighted content, whose responsibility is it to validate copyright violations in that data? If I use something licensed as CC in good faith and it turns out the provider or seller of that content had no right to relicense it, am I culpable? Is this just a new lawsuit where I can seek damages for the lawsuit I just lost?


discussed 20 years ago https://ansuz.sooke.bc.ca/entry/23

> I think Colour is what the designers of Monolith are trying to challenge, although I'm afraid I think their understanding of the issues is superficial on both the legal and computer-science sides. The idea of Monolith is that it will mathematically combine two files with the exclusive-or operation. You take a file to which someone claims copyright, mix it up with a public file, and then the result, which is mixed-up garbage supposedly containing no information, is supposedly free of copyright claims even though someone else can later undo the mixing operation and produce a copy of the copyright-encumbered file you started with. Oh, happy day! The lawyers will just have to all go away now, because we've demonstrated the absurdity of intellectual property!

> The fallacy of Monolith is that it's playing fast and loose with Colour, attempting to use legal rules one moment and math rules another moment as convenient. When you have a copyrighted file at the start, that file clearly has the "covered by copyright" Colour, and you're not cleared for it, Citizen. When it's scrambled by Monolith, the claim is that the resulting file has no Colour - how could it have the copyright Colour? It's just random bits! Then when it's descrambled, it still can't have the copyright Colour because it came from public inputs. The problem is that there are two conflicting sets of rules there. Under the lawyer's rules, Colour is not a mathematical function of the bits that you can determine by examining the bits. It matters where the bits came from.


I don't think that's what I was driving at. Monolith users in this scenario would be knowingly using copyrighted content with the clear intent to "de-copyright" it for distribution purposes by mixing it up into a new output via a reversible process. Which seems like it probably violates copyright because the intent is to distribute a copyrighted work even if the process makes programmatic detection difficult during distribution. This may operate within the wording of the law but it clearly is being done in bad faith to the spirit of the law (and this seems like standard file encryption of a copyrighted work where you are also publicly distributing the decryption key... and transmitting a copyrighted work over TLS today doesn't absolve anyone of liability). You seem to be suggesting this is what OpenAI has done via the transformer model training process - and acting in bad faith. Which is certainly possible but won't be proven unless their court case reveals it. I'm asking about the opposite: what if they acted in good faith?

What I'm getting at is that it's plausible that a LLM is trained purely on things that were available and licensed as Creative Commons but that the data within contains copyrighted content because someone who contributed to it lied about their ownership rights to provide that content under a Creative Commons license, i.e. StackOverflow user UnicornWitness24 is the perpetrator of the copyright violation by copying a NYT article into a reply to bypass a paywall for other users and has now poisoned a dataset. And I'm asking: What is the civil liability for copyright violations if the defendant was the one who was actually defrauded or deceived and was acting in good faith and within the bounds of the law at the time?


Fair use in copyright:

it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports.

But yes, open to interpretation as far as where LLM training falls.


I dunno, I'm not a lawyer, it might matter.


This is an interesting analysis but it ignores the fact that although a cheap drone can destroy an $11MM tank, 2, 000 cheap drones do not give the possessor of the drones the same power as having a tank.

Even if Iran or whoever can produce $800 rockets that cost $50k to destroy, we are able to outspend them by such a gigantic margin that it still won’t matter. They could never bankrupt us this way. They will go bankrupt first.

We spend double Irans entire GDP just on defense. Even if they switched to 100% wartime production, we could outspend them forever. Who cares if a tank costs $11MM? We spend nearly $1T per year on defense. Good luck beating that with cheap rockets.

War doesn’t really work this way - we spend $1T on defense but the economic advantage to the US by being the de facto world power is so far in excess of that the cost would be worth it at 10x the price. We could “lose” a million dollars per missile and it still wouldn’t matter.


I think this fundamentally misunderstands how to use LLMs. Out of the box, an LLM is not an application - it’s only the building blocks of one. An application could be built that answered this question with 100% accuracy - but it would not solely rely on what’s in the training data. The training data makes it “intelligent” but is not useful for accurate recall in this way. Trying to fix this problem is not really the point - this shortcoming is well known and we have already found great solutions to it.


What are the solutions?

As pointed out in the article, some LLM's appear to know the information when requested to list episodes, then deny it later. These are general inconsistencies.

It is not about looking up trivia, it is the fact you never know the competence level of any answer it gives you.


I think what the parent poster meant is that the most useful way to use today's LLMs is to accept their limitations and weaknesses and work around them. Better models will come, but for now this is what you have to do.

For example, use LLMs to transform text rather than generate it from scratch (where they are prone to hallucinate). General purpose chat-bot is not a great use case!

For this particular Gilligan's Island task it'd be better to first retrieve the list of episode titles (or descriptions if that was needed), then ask the LLM which of them was about "mind reading". There are various ways to do this sort of thing, depending on how specific/constrained the task is you are trying to accomplish. In the most general case you could ask a powerful model like Claude Opus to create a plan composed out of simpler steps, but in other cases your application already knows what it wants to do, and how to do it, and will call an LLM as a tool for specific steps it is capable of.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: