Hacker News new | past | comments | ask | show | jobs | submit login
DarkBERT: A Language Model for the Dark Side of the Internet (arxiv.org)
142 points by rajtilakjee on May 21, 2023 | hide | past | favorite | 59 comments



> While crawling the Dark Web, we take caution not to expose ourselves to content that should not be accessed. For example, illicit pornographic content (such as child pornography) are easily found on the Dark Web. However, our automated web crawler takes the approach of removing any non-text media and only stores raw text data.

> By doing so, we do not expose ourselves to any sensitive media that is potentially illegal.

This is the part I was looking for. Although I wonder if some of the text parsed might also be questionable in terms of legality.

I also was amazed at it cleanly categorizing the data into Pornography, Violence, Gambling, Hacking, etc., since to my naive mind, some of these categories likely have huge overlaps, especially so on the Dark Web.


You might think that naïvely, and it makes sense, but I think it's actually the opposite way; the dark web is highly compartmentalised it seems to me. Most people who use the dark web genuinely(i.e not just exploring out of curiosity) need it for one thing, and only one thing. Whether it's drugs, illicit porn of various genres, political forums, hacker forums. Illicit porn especially seems to be banned from any site not specifically set up for it(just like regular porn on the regular internet, I guess). I have no idea what the sites that do allow it looks like , but probably they're highly specialised like everything else I've seen on the dark web.


at what point does ascii art of an illegal image become illegal?


No content on the Internet should be illegal to access, period.


I'm with you.

Last week, my favourite German lawyer Udo Vetter made a blog post [1] about parents reporting questionable media in group chats to the police.

He warns urgently to never do this.

If you're in a large telegram group, discord, WhatsApp or whatever, and your phone downloads the questionable media automatically, you're already doing a crime (the mere possession of such media is illegal). In Germany, the minimum punishment for the possession of child porn is one year; a monetary fine is not possible and a dismissal of the case due to insignificance isn't possible either.

According to Udo Vetter there have been cases where "concerned parents have brought printouts of the chat messages to the police so that the things are not distributed further. Criminal proceedings for possession of child pornography were the result."

Further down in the blog post how there were funny cases like the police seizing a mobile phone of a drug dealer, and then finding group chats with thousands of users where questionable images were shared. Which will then start a mass investigation on thousands of innocent people, who weren't even looking at that stuff. No one looks at group chats 24/7, especially in fast-moving chats with like 10 messages/minute.

The issue isn't even new. I remember many times when IRC networks were flooded with such media by troll collectives like myg0t, 4chan and whatnot.

[1] https://www.lawblog.de/archives/2023/05/12/lustiger-gruppenc...


there is a possibility LE is doing it also, not just trolls. there seems to be no better way to shut down an internet location than posting that and the public is vastly on the side of LE here. But this stuff can cause any chat, app, whatever to be immediately shut down and everyone involved be criminally investigated. As an example, see how blasphemy laws in Pakistan are weaponized for political ends https://web.archive.org/web/20230211002446/https://www.latim...


My greatest concern is the planting of this material on political activists' computers. It's by far the easiest way for the state to take their political opponents down. The general public will believe the accusations without questioning it. Russia is known to do this. Live in the UK, speak out about Vladimir Putin, go to jail...

https://en.wikipedia.org/wiki/Vladimir_Bukovsky#"Prohibited_...

These laws over image and text possession[1] are complete and utter madness. It is like Islamic fundamentalism. No way this should be happening in the UK or the USA either, especially in this day and age. We are supposed to be past all of this, just as with blasphemy and heresy laws. Something's gone really wrong here.

1. 'Simple' possession, with no intent to distribute.


I used to run a web site that focused on British TV shows. A competitor used to call up our hosts and PayPal (we used for donations) and tell them the site was dealing in CSAM. There was nothing of the sort on the site, but it would get us cut off for days at a time until investigations were completed. This would happen every couple of months. There is no protection against such an attack.


And GitHub has added a new "Report Repository" link on every repo now, and it gave me the creeps that there's a very easy way to report CSAM through it. It's listed right there out of the several types of prohibited content. So 4chan, trolls, etc. could quite easily mass report someone's repository. And it frightens me just to contemplate what might happen to someone if they got 100 or 1000 false reports on their repository. So everyone is at risk from being labelled as the modern day version of a witch or heretic, just at the push of a button.

My God, what has the world come to nowadays. All done in the name of protecting us from a so-called widespread threat to our safety. Straight out of a fascist's playbook there. History repeating itself again?


I could easily see some web sites just forwarding complaints of this nature to one of the national CSAM organizations (that are privately held and therefore unregulated and not open) who then add you to their database for future reference, even if the allegation is totally unfounded.


There might be a mandatory reporting requirement, so even if it's unfounded, it still ends up on a NCMEC database. Or that law enforcement is still notified, even if no arrest happens.

Anyway it's all completely overwhelming to comprehend, emotionally. Time to take a break, to pull the network cable out of the back of my computer, and be back to normal, instead of living in Orwell's 1984 world, come true. Thankfully there's so much more to life than the Internet.


There's the soft power LE uses also as a result of this issue. This covers things like abolishing encryption and privacy, scanning, liaisons, etc. I believe people self-censor when they know they are being monitored and LE knows that too.


Yes, this and the mass surveillance going on, is enough for me to say confidently that we're now living in a police state. That we have severely traded our liberty for supposed safety here. And it's happened slowly over the decades. Is this how totalitarianism begins?

I still can't believe this is going on and that people are being prosecuted for the act of viewing or reading certain things on the Internet. That belongs in dictatorships such as China or Russia, not the free world. My God, how the frog has boiled very slowly indeed.

I think it's the metaphorical canary in the coal mine for much worse things down the line...


Doesn't "possession" refer to "holding the copyright"? If someone illegally downloads (pirates) the copyrighted content, how do they become the possessor?


In this case it means "the possession of a copy". In Germany that means that you have the file on your storage device and are able to retrieve them (put them on a screen, make a copy or send them to someone else. 'Deleted files' are also retrievable until overwritten).


I suspect you’d change your tune pretty fast if there were some horrifying content involving a family member involved.


No, absolutely not, the distribution of it should be prosecuted, not people looking at it. Criminalizing looking at things is a throwback from medieval times. And is a hallmark of totalitarianism, a police state.

No other person, or government has the right to dictate to you what you can read or not. It's a fundamental, natural right in my opinion. And nobody can take that right away from you.

They can say anything they like, make any excuses for trying to do so, they still don't have the right to control your reading habits.


Suppliers are motivated by demand. Consumption of CP causes child exploitation.


I never quite bought this. It is known that this is even done live, on request. And in those cases, yes, the viewer who made requests or payments to make it happen is definitely complicit.

Where the argument loses me is that I simply don't believe that the demand in any way causes people to become rapists just to produce content. They were gonna become rapists anyway.

The better argument that seeking it out at all should be illegal is that it's a (very severe) privacy violation. And I would agree with that.

But it is hard to get away from the fact that part of this really is thought crime. In some countries, even things like loli manga are illegal even though no children were harmed making them.

And speaking of demand and exploitation of children, should we prosecute people who bought clothes made with child labour? Their demand also caused child exploitation in some sense. Serious question, because I think prosecuting based on vague economic theories of "motivating suppliers" sets an interesting precedent for all sorts of different behaviours normally thought of as perfectly legal.


Yes, as a thought experiment, maybe we should prosecute those who purchase eggs, even free range ones, from so called 'ethical' farms, for animal abuse? Each hen lays 600 eggs in its lifespan [1]. For every hen there was a male chick that was shredded alive in a macerator in the hatchery [2,3].

The link between abuse and consumption is concrete in this case. Buying more eggs requires the farm to raise more hens, which causes more male chicks to be shredded. There is a direct supplier and customer here, and money is clearly transferred.

So the average person consumes 277 eggs per year [4], that means approximately every 2 years, we all shred a male chick alive, by proxy. And by using the same argument as for the illegal pornography, we should be going to jail for animal abuse?

1. https://backyardpoultry.iamcountryside.com/chickens-101/how-...

2. https://thehumaneleague.org.uk/article/what-happens-to-male-...

3. https://demeter.net/chick-culling/

4. https://www.statista.com/statistics/183678/per-capita-consum...

With illegal pornography, which is information, and can be copied effortlessly at zero cost, and when no money is transferred, the link between consumption and abuse is much less clear. In fact most of those images are distributed for free, so I guess it's pirated illegal pornography that people are viewing then? That hardly makes a direct case for increasing the production of such images?


I think in general some externalities should be banned at the source, i.e macerating male chicks en masse should just be illegal(I'm not a huge animal rights person in general, but this one seems clear cut to me). Whereas others that can't just stop overnight(say carbon emissions) should be taxed at the source to phase them out naturally over time. I think prosecuting the consumer directly is too much of a slippery slope to go down. Because like it or not almost everything we do in modern society has some pretty nasty externalities attached that are also so far removed from the consumer that they can't reasonably be expected to factor it into their decisions.


> Where the argument loses me is that I simply don't believe that the demand in any way causes people to become rapists just to produce content.

Consumers trade material with each other, and a high value is placed on original or custom content - which in turn leads them into finding victims so they can become producers.

It is absolutely vile and consumption does lead directly to abuse.


I don't buy your implication that the production of content is somehow the prime motivator for abusers, and not simply the primal desire for sex along with either a complete lack of compassion or a deeply disturbed understanding of human relationships and complete lack of impulse control. I think the main motivation for filming it is probably just the same as most people who make their own sex tapes. To watch again later.

In any case, I already pointed out that there are gradations here, and you didn't address any of those points.


In case when someone deliberately seeks such material, purchases it on a darknet market or distributes it to others - I totally agree.

But how does prosecuting someone for accidentally viewing it, because some sicko posted it on a public forum, helps exploited children?


It doesn't. But it does help LE and our elected officials to stand behind these practices and legislation to prove that people are prosecuted for the heinous crime of being part of collateral damage.

I've emailed my senators on the EARN IT Act here in the US. None of them have engaged in the conversation even after multiple times reaching out. What's really annoying to me is that they are elected and paid with my tax dollars, yet the best they can do is send me some precanned non-response and all of them hide behind form contact these days. There's no way to email them directly anymore because they don't want to have to deal with that open line of communication. While I understand that I'm sure there are a lot of horrible things that get sent to them - that's the job. And there's plenty of ways to filter abusers in 2023.


They only serve us on paper, in reality they end up serving the oligarchs in big business and powerful lobbyists and pressure groups. Some of those pressure groups being front organizations for radical feminists and evangelical Christians, having hidden agendas. Who are big proponents of censorship in the name of supposedly protecting our morals. This has been going on for decades, since the 1970s at least. Some of these organizations have even renamed themselves to hide their religious origins, operating under the name of preventing human trafficking and sexual abuse. There's even a name for it: "femi-servatives", a portmantau of feminism and conservatives. Yes, an unusual alliance.

Citation below, from a reliable source:

https://blogs.lse.ac.uk/gender/2013/11/21/rescuers-redeemers...

" The focus on sex trafficking, therefore, becomes yet another channel through which Christian institutions are able to carry out their sex-negative agenda, particularly by casting themselves as Rescuer, Redeemer, and even the familiar role of Western male hero saving the passive and helpless female. "

Another one below, less reliable though:

https://www.thedailybeast.com/inside-exodus-cry-the-shady-ev...

Finally, it's disturbing in how the two-way (transmit/receive) nature of the Internet as a medium, has virtually let all these "morality police" organizations into our homes, to watch over us, through their lobbying efforts to change the law. That wasn't possible with radio and satellite television, which were receive only. Where you could watch or listen to anything you wanted and explore freely, without ever having to worry about the thought police breaking your door down, should you happen to stumble over something deemed "inappropriate" by them. The difference between that and the Internet is stark and highlights the sheer madness and insanity of what the Internet has made possible.

The Internet became like the telescreen in George Orwell's 1984 novel. You can never be sure if it's watching you or not.

In the past, this was the stuff of nightmares, but it's here for real now in the 21st Century, because of widespread Internet usage in society.

And your children are at risk from it, from sexting, from a teenage son coming across this material and viewing it, maybe even intentionally. Perhaps the risk of your children being prosecuted for breaking these laws is greater than the risk of them being physically abused by a pedophile?

https://www.teenvogue.com/story/how-sex-offender-registries-...

https://jlc.org/news/most-states-require-some-youth-be-sex-o...

https://magazine.jhsph.edu/2022/harms-placing-kids-sex-offen...

https://www.americanbar.org/groups/litigation/committees/chi...


We should be scrutinizing these aforementioned pressure groups and if any of their activities are unlawful in any way, they should be reported to the authorities. Many here in the UK are registered charities. It is about time we get together and put a stop to this, using all possible avenues within the law.


I would hope most jurisdictions allow for accidental viewing to not be a criminal act.

In my jurisdiction, firstly it requires knowing possession (i.e. if it was just in your web cache because it was downloaded in a hidden img tag) and you didn't know it was there, then you would technically not be guilty.

My jurisdiction also has a safety valve that allows you to immediately delete the image after viewing it and realizing the nature of the content. As long as the deletion happens within a reasonable time, there is no crime. The problem with this is that it is an affirmative defense, which means you have to prove your innocence, you are assumed guilty. (There are many crimes where you are legally assumed guilty rather than innocent)


Also, attacks on "forbidden topic" consumers is always an easy win.


I respect your stance on this. What's your opinion on using BitTorrent to "look at it", while (without intention) leaving the default settings of BitTorrent where it usually also uploads for a bit while downloading, but then remembering to manually stop the torrent seeding sometime "shortly" after downloading when the user gets around to it (1 hour to 1 week).


It would come under distribution, because your computer is physically uploading this information to others. So it's being sent in the upstream channel of your Internet connection. And it was an event initiated by the user knowingly browsing such material, it did not happen on the behalf of another, as would be the case if he/she ran an anonymizing proxy service for others to use, e.g. a Tor exit node. So that qualifies as distribution to me.

It depends if the quantity of data uploaded qualifies as an image of illegal abuse. If it contains fragments of the image, that do not represent such acts, then it's not illegal. The user was being negligent or reckless in not disabling the upload facility completely, it's his/her duty to do so in that case, to prevent further harm to the victim from occurring.

And I think that is where we can draw the line and it becomes a crime? And sentencing will take into account the intent? Might not result in jail time if it was just one file?


No virtual act in general. Force is only justified in response to force.


Absolutely, the people taking the photographs is where the actual real abuse occurs. Those should be the number one target for prosecution.

Although I think distribution causes further distress to the victims, it's much less severe than the initial abuse.

The images, they are artifacts of abuse, which is different to the abuse itself. The actual abuse could have occurred decades ago, in some cases....


> Although I think distribution causes further distress to the victims, it's much less severe than the initial abuse.

A bold claim with no evidence provided


Murder and rape pictures and videos should be illegal


Should the viewing of a murder be illegal or the murder itself or both?

Should the reason we view change what should be viewed? If someone is viewing for pleasure is that a crime, if someone is viewing for healing does that change how we treat them?

Is someone's dead a privacy issue? Should deaths be personal and viewing such an act break privacy laws?

What about fake murder and rape pictures? What about newsworthy pictures/video?

Should we shield society from its negative elements or normalize them?


Only the murder itself should be illegal. If it's some kind of weird pornography, for example, then the filming and distribution of it could be illegal.

If we try to prosecute whether it's viewed for pleasure or not, then we are trying to prosecute thought crimes. We are trying to prosecute someone's state of mind there.

No, just viewing someone's death does not violate privacy laws, in my opinion. But if you share the material with others, than it might be right to make that illegal, in my opinion.

We should take such material down, but nobody should ever see jail time or any form of punishment for viewing or reading anything, in a free society. Any situation otherwise is a hallmark of a police state, and it shows how far the Overton window has shifted. How much the frog has boiled over the years.

And seeing the harm caused by the War on Drugs and countless other cases of overcriminalization, it may well be we are living in a de-facto police state, and unaware of it. We've internalized and accepted loss of our fundamental liberties over the years. And this trend started roughly around 1980 or so.


So the person who posted george Floyd's murder should be punished under your system?


No, it only applies to material produced as a type of pornography, for example. I've reworded the post to make it clearer.


Exactly. Cartel and ISIS style murder videos and live streams like the Christchurch and Buffalo massacres should be treated the same as child porn.

The creation of those videos are crimes and the distribution and possession should be too. Videos like George Floyd and JFK are recorded by bystanders and are evidence of crimes.


…so if someone recorded the buffalo massacre as a bystander then it’s evidence and ok?

Slippery slope


How is that a slippery slope? It's dry, level ground. Did the person filming participate in the murder? Nope. They are documenting evidence of a crime.


yea this is a terrible idea. I think of the Zapruder film.


Yes, to distribute, if produced for malicious reasons, but not to view. I personally believe that no part of the internet, if viewed or visited, should be illegal to browse, period. Everyone should be free to explore the entire Internet without any fear of prosecution, and encouraged to report illegal material so that it can be taken down promptly. As long you don't hack into anything, all of it should be legal to access.

I ended my career over this. I do not want to work in any industry that supports directly or indirectly the prosecution of such "thought crimes". Such prosecution has no place in a free society.

I will not work on any software (or hardware) which provides a traceability or logging facility that can be used to prosecute people for viewing or reading anything. Therefore participation in development of networking equipment is completely out of the question for me. To me it's no different from developing weapons systems, another thing I will not do.


Just by requiring an account to view a forum (without validating any of the information at all) this entire project would be foiled. All the 'bad guys' have to do is put up a login page with a 'create account' link and it would be ignored, according to the criteria they gave in the appendix (login pages were discarded). I can only imagine how much data was hidden in this way.


While true, I'm not sure that matters. Unless your intuition is that the data behind such login pages is quantitatively different from what they crawled, the training, model, and findings should still hold, right?


The paper claims there are massive differences between language used on the surface web and the dark web. That is, people talk differently when they're in a different environment that takes special effort to get into.

If we have to guess one way or the other, it seems like we should assume the same is true of the crawlable dark web versus the non-crawlable dark web.


Not my intuition, but experience. All of Dread is behind a login, for example.


Awww, I wish they’d called it DarkoBERT. Would’ve been fun especially for those who grew up with German Disney translations where Scrooge McDuck is called Dagobert.


On the bright side, if it turns evil we can enlist the help of LightERNIE to convince it be less of a grumpy guy.


Cloak and DaggerBERT would apply quite well with the darker web theme.


Hasn't this been posted 4x this week? Feel like I've seen this many times...


This is the third time. You can use the search field in the bottom of the page.

https://hn.algolia.com/?q=https%3A%2F%2Farxiv.org%2Fabs%2F23...


I don't think it ever reached the first 2 pages on HN? I haven't seen it at least.


speaking of the "dark web" I do wonder and worry about the seemingly growing pattern of censorship and chaos-sowing, from certain segments on both the political Left and the Right.

I'm trying to extrapolate from these trends, taking them into the future, and then parody them, in a game I've been building. (see my bio if curious)


Why? What purpose does this serve? Is this a fun project? What is funny about it? What problem does it solve? I just don't get it!


Do they have plans to release it on huggingface?


Is this truly necessary? I feel like it's not that hard to teach a regular model what jargon means.


The torment nexus comes to mind.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: