NLP technologies against online crime

Dr Nikos Nikolaou

Project Manager

Big Data in Artificial Intelligence (AI) and Machine Learning (ML) technologies offer new opportunities. The principle is that an extensive amount of data is required for the best result of the scenario of fighting crime ML models. In the ROXANNE project, the technical development targets to significantly enhance the criminal network analysis based on text, speech, language, and video technologies.

The Big Data landscape is ever so expanding. The numerous technologies behind this term are establishing their presence and strive to solve daily, but chronic issues in areas such as administration, education, healthcare, and security. The latter area, nowadays, has primarily focused on cybersecurity. It is a common practice for criminal offenders to use web services to organize and commit crimes. More than ever before, methods and approaches of using Big Data are needed to prevent, predict, and investigate criminal cases. This can be delivered by extensive quantitative and qualitative analysis of available information related to crimes while trying to establish strong relations between cause and effect.

It can be said that a police officer with access to Big Data technologies, via several algorithms, is well equipped for crime prevention. These tools maximize the output of any investigation, while at the same time minimizing the effort required from the person performing the job. In some cases, the law enforcement agencies succeed to detect crime before it happens. In other cases, when the analysis of a huge amount of data is required, an algorithmic approach can be extremely beneficial to identify and investigate an already committed crime.

The usage of Big Data in Artificial Intelligence (AI) and Machine Learning (ML) technologies offers new opportunities. The principle is, not only in the scenario of fighting crime, that an extensive amount of data is required for the best result of ML models. In the ROXANNE project, the technical development targets to significantly enhance the criminal network analysis based on text, speech, language, and video technologies.

For instance, some of the most prominent social problems related to crime are the detection of predatory communications, online offenders, child abuse, and cyber grooming in online conversation. Natural Language Processing (NLP) is currently one of the dominant techniques of AI, which deals with the natural language topic as a link between humans and computers. The goal of NLP is to read, understand, decrypt, and bring insights as an output of extensive analysis on the several human languages acting as input. The web and social media applications and platforms are an overall complex and multidimensional data landscape where NLP can be used.

A brief description of the technical background of NLP techniques, applied in the field of crime offender identification and detection, or other words in the never-ending fight against cybercrime, is following. Several NLP techniques are developed and can be used for broad text analysis. The ones that NLP experts are using more often are: (i) the Bag-Of-Words (BoW), (ii) the Word2Vec (W2V) and Word Embeddings (WE), (iii) the Term Frequency-Inverse Document Frequency (TF-IDF), and (iv) the Rules-Based (RB).

Each one has different characteristics and approaches on various text analysis tasks. The boW is a method that delivers word weighting, via counting the number of occurrences in a text dataset. This technique is used for the extraction of features, based on word frequency, through the comparative study between texts with similar content.  In W2V and WE techniques, the high-level approach is to replace words with encoded vectors. The vectors which are used to represent the encoded document can be further used for classification purposes. The BoW method is extended by also focusing on the total frequencies of texts, in the text which is examined. As one of the oldest to NLP, RB approaches focus on patterns that match or parse, while often being used to fill in the blanks. In addition to the above-mentioned NLP techniques, Machine Learning (ML) classifiers are possible solutions to the analysis to predict, identify or solve a criminal case. Several algorithmic approaches that support the classification tasks are (i) Logistic Regression, (ii) Ridge, (iii) Naive Bayes, (iv) Support Vector Machine (SVM), and (v) Neural Networks (NNs).

In the ROXANNE project, the whole consortium is carefully exploring its contribution to the extremely sensitive task of crime prevention, prediction, and identification, since one of the project’s objectives is to develop a ROXANNE analytics platform enhancing investigation capabilities especially for large criminal cases. During the first 2 years of the project, technical partners worked on several NLP sub-tasks on the ROXANNE simulated dataset (ROXSD) and the CSI dataset. Both datasets, the one after simulation with volunteers and screenplay and the other based on several episodes of the famous TV series, are supposed to be close enough to real-world data. Several NLP experiments related to topic detection, named entity recognition (NER), authorship attribution, semantic keyword extraction, and relation analysis based on the extracted entities were carried out throughout the project. At the same time, the connection of NLP subtasks with the Network Analysis is continuously examined to explore the identification of hidden criminal networks, besides the individual crime offenders.

ROXANNE project started in September 2019, its duration is 40 months and is coordinated by Idiap Research Institute. To find more about ROXANNE and its 25 partners (LEAs, SMEs, Industries and Academia) follow us on Twitter and connect with us on Linkedin.

ROXANNE project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 833635.

The contents of this publication do not necessarily reflect the opinion of the European Union. The article reflects only the author’s view and the sole responsibility of this publication lies with the author. The Research Executive Agency (REA) is not responsible for any use that may be made of the information contained therein.