Journalism of Courage
Advertisement
Premium

An AI trained on dark web? Researchers may have unlocked a new weapon against hackers

Meet DarkBERT, a large language model developed with the goal of boosting cybersecurity.

dark web featured(Image: TheDigitalArtist/Pixabay)
Listen to this article Your browser does not support the audio element.

Large language models are all the rage these days and new ones are popping up every other day. Most of these linguistic behemoths, including OpenAI’s ChatGPT and Google’s Bard, are trained on text data from all over the internet – websites, articles, books, you name it. This means that their output is a mixed bag of genius.

But what if instead of the web, LLMs were trained on the dark web? Researchers have done just that with DarkBERT to some surprising results. Let’s take a look.

What is DarkBERT?

A team of South Korean researchers have released a paper detailing how they built an LLM on a large-scale dark web corpus collected by crawling the Tor network. The data included a host of shady sites from various categories including cryptocurrency, pornography, hacking, weaponry, and others. However, due to ethical concerns, the team did not use the data as is. To ensure that the model wasn’t trained on sensitive data so that bad actors aren’t able to extract that information, the researchers polished the pre-training corpus through filtering, before feeding it to DarkBERT.

If you are wondering about the rationale behind the name DarkBERT, the LLM is based on the RoBERTa architecture, which is a transformer-based model developed back in 2019 by researchers at Facebook.

Meta had described RoBERTa as a “robustly optimized method for pretraining natural language processing (NLP) systems” that improves upon BERT, which was released by Google back in 2018. After Google made the LLM open-source, Meta was able to improve its performance.

Cut to the present, the Korean researchers have improved upon the original model even further by feeding it data from the dark web over the course of 15 days, eventually arriving upon DarkBERT. The research paper highlights that a machine with an Intel Xeon Gold 6348 CPU and 4 NVIDIA A100 80GB GPUs was used for the purpose.

What is the purpose of DarkBERT?

Despite its sinister-sounding name, DarkBERT is intended for security and law enforcement applications and not for any nefarious schemes.

Story continues below this ad

Since the model was trained on the dark web, the home of shady sites where huge datasets of stolen passwords are often found, DarkBERT is more effective in cybersecurity/CTI applications than existing language models. The researchers behind the model have demonstrated its use for detecting ransomware leak sites.

Hackers and ransomware groups often upload leaked sensitive data like passwords and financial information to the dark web with the purpose of selling it. The research paper suggests that DarkBERT can be useful for security researchers to automatically identify such websites. It can also be used to crawl through the slew of dark web forums and monitor them for any exchange of unlawful information.

But while DarkBERT is better suited for “dark web domain-specific tasks” than other models, the researchers acknowledge that due to the shortage of publicly available Dark Web task-specific data, some tasks may require some fine-tuning.

Is DarkBERT available for the general public?

As of now, DarkBERT isn’t available to the public. The researchers say that plans to release the preprocessed version of DarkBERT – the version that wasn’t trained on sensitive data – are on the cards. But they haven’t specified when.

Story continues below this ad

Regardless, DarkBERT represents a future where AI models are tailored to specific tasks by training on highly specific data. Unlike ChatGPT and Google Bard, which are more like multi-purposed Swiss knives, DarkBERT is a specialised weapon for thwarting hackers.

Zohaib is a tech enthusiast and a journalist who covers the latest trends and innovations at The Indian Express's Tech Desk. A graduate in Computer Applications, he firmly believes that technology exists to serve us and not the other way around. He is fascinated by artificial intelligence and all kinds of gizmos, and enjoys writing about how they impact our lives and society. After a day's work, he winds down by putting on the latest sci-fi flick. • Experience: 3 years • Education: Bachelor in Computer Applications • Previous experience: Android Police, Gizmochina • Social: Instagram, Twitter, LinkedIn ... Read More

Tags:
  • artificial intelligence cybersecurity
Edition
Install the Express App for
a better experience
Featured
Trending Topics
News
Multimedia
Follow Us
Express PremiumFrom kings and landlords to communities and corporates: The changing face of Durga Puja
X