Events Conference on Foundations and Advances of Machine Learning in Official Statistics, 3rd to 5th April, 2024

Session 2.2 Streamlining Processes and Data Integration

Choosing the Right Tool for the Job: Machine Learning vs. Regular Expressions in the Analysis of Text Data From the German Register of Driver Fitness

Daniel Kopper* 1

Abstract

The Federal Motor Transport Authority (Kraftfahrt-Bundesamt, KBA) administers the German Register of Driver Fitness (Fahreignungsregister, FAER) and is part of the European Statistical System (ESS) as an Other National Authority (ONA). The FAER holds information about driving-related misdemeanors and criminal offences. Some information is stored as free-form text with no further requirements. This data is highly unstructured and therefore difficult to analyze. Previous attempts to tap into it by using regular expressions show promising results, however, they are believed to be somewhat inaccurate, at least in some cases.

In light of the constant advances in artificial intelligence it should consequently be examined whether classification performed by text-based machine learning algorithms is sufficiently accurate to be incorporated into the production of official statistics on traffic violations. One possible application is the classification of substances with a multitude of associated names and spellings for drug-related violations.

To this end several machine learning algorithms for analyzing text data, like Naive Bayes or support vector machines, were identified. The research carried out also covered a series of pre-processing steps that might be applied to prepare the data for the computational tasks associated with machine learning (e.g. n-grams, word stemming, lemmatization or the removal of stop words).

The next step is to fit machine learning models to the pre-processed data, tune them and evaluate how well they perform in terms of accuracy compared to previous attempts of classifying text data based on regular expressions. The primary goal is to identify cases and text characteristics within the context of register data where text-based machine learning achieves significantly better results than the use of regular expressions so that the extra effort of manually classifying training and test data is justified.

*: Speaker

1: Federal Motor Transport Authority - Germany