Journalism of Courage

Premium

An AI trained on dark web? Researchers may have unlocked a new weapon against hackers

Meet DarkBERT, a large language model developed with the goal of boosting cybersecurity.

Written by Zohaib Ahmed
New Delhi | Updated: May 24, 2023 04:27 PM IST

4 min read

(Image: TheDigitalArtist/Pixabay)

Listen to this article

Your browser does not support the audio element.

Large language models are all the rage these days and new ones are popping up every other day. Most of these linguistic behemoths, including OpenAI’s ChatGPT and Google’s Bard, are trained on text data from all over the internet – websites, articles, books, you name it. This means that their output is a mixed bag of genius.

But what if instead of the web, LLMs were trained on the dark web? Researchers have done just that with DarkBERT to some surprising results. Let’s take a look.

This story requires a subscription

Use Diwali code

POPULAR

EXPRESS EDGE

Daily ePaper access

Premium stories & archives

UPSC Section + UPSC Essentials Magazine

Premium stories & archives

UPSC Section + UPSC Essentials Magazine

Starts at ₹133/month

Best for UPSC aspirants

This story requires a subscription

Please Select A Plan

POPULAR

EXPRESS EDGE

Daily ePaper access

Premium stories & archives

UPSC Section + UPSC Essentials Magazine

Premium stories & archives

UPSC Section + UPSC Essentials Magazine

Starts at ₹133/month

Best for UPSC aspirants

What is DarkBERT?

A team of South Korean researchers have released a paper detailing how they built an LLM on a large-scale dark web corpus collected by crawling the Tor network. The data included a host of shady sites from various categories including cryptocurrency, pornography, hacking, weaponry, and others. However, due to ethical concerns, the team did not use the data as is. To ensure that the model wasn’t trained on sensitive data so that bad actors aren’t able to extract that information, the researchers polished the pre-training corpus through filtering, before feeding it to DarkBERT.

If you are wondering about the rationale behind the name DarkBERT, the LLM is based on the RoBERTa architecture, which is a transformer-based model developed back in 2019 by researchers at Facebook.

Meta had described RoBERTa as a “robustly optimized method for pretraining natural language processing (NLP) systems” that improves upon BERT, which was released by Google back in 2018. After Google made the LLM open-source, Meta was able to improve its performance.

Cut to the present, the Korean researchers have improved upon the original model even further by feeding it data from the dark web over the course of 15 days, eventually arriving upon DarkBERT. The research paper highlights that a machine with an Intel Xeon Gold 6348 CPU and 4 NVIDIA A100 80GB GPUs was used for the purpose.

What is the purpose of DarkBERT?

Despite its sinister-sounding name, DarkBERT is intended for security and law enforcement applications and not for any nefarious schemes.

Story continues below this ad

Since the model was trained on the dark web, the home of shady sites where huge datasets of stolen passwords are often found, DarkBERT is more effective in cybersecurity/CTI applications than existing language models. The researchers behind the model have demonstrated its use for detecting ransomware leak sites.

Hackers and ransomware groups often upload leaked sensitive data like passwords and financial information to the dark web with the purpose of selling it. The research paper suggests that DarkBERT can be useful for security researchers to automatically identify such websites. It can also be used to crawl through the slew of dark web forums and monitor them for any exchange of unlawful information.

But while DarkBERT is better suited for “dark web domain-specific tasks” than other models, the researchers acknowledge that due to the shortage of publicly available Dark Web task-specific data, some tasks may require some fine-tuning.

Also read | Google Bard for all professions: How to enhance your workflow with the new AI bot

Is DarkBERT available for the general public?

As of now, DarkBERT isn’t available to the public. The researchers say that plans to release the preprocessed version of DarkBERT – the version that wasn’t trained on sensitive data – are on the cards. But they haven’t specified when.

Story continues below this ad

Regardless, DarkBERT represents a future where AI models are tailored to specific tasks by training on highly specific data. Unlike ChatGPT and Google Bard, which are more like multi-purposed Swiss knives, DarkBERT is a specialised weapon for thwarting hackers.

Zohaib Ahmed

Zohaib is a tech enthusiast and a journalist who covers the latest trends and innovations at The Indian Express's Tech Desk. A graduate in Computer Applications, he firmly believes that technology exists to serve us and not the other way around. He is fascinated by artificial intelligence and all kinds of gizmos, and enjoys writing about how they impact our lives and society. After a day's work, he winds down by putting on the latest sci-fi flick. • Experience: 3 years • Education: Bachelor in Computer Applications • Previous experience: Android Police, Gizmochina • Social: Instagram, Twitter, LinkedIn ... Read More

Tags:

artificial intelligence cybersecurity

Journalism of Courage

Edition

Install the Express App for
a better experience

Featured

Today's E-paper
Dec 01, 2025