AI as a Module – A New Kind of Service

Motivation and Concept

In order to meet the customers’ need for effective solutions in the field of Artificial Intelligence, various approaches – and combinations of them – are possible

  • Consulting: the complete development takes place on site at the customer, who also receives all rights
  • Software Development (“Software”): the customer aquires licences of a product and operates it on his own infrastructure
  • Software as a Service (SaaS): the customer aquires licences of a product that is operated on the remote infrastructure of the SaaS provider

For the customers, each of these approaches has specific advantages and disadvantages.

Today’s technological possibilities make a further prototypical approach – AI as a Module (AIaaM) – possible

  • the client acquires a pre-trained model (such as neural network) for specific tasks (such as translation of texts), which he operates independently
  • the model is not a software in the narrow sense, but requires e.g. a Python environment. The procedures and data for the training remain with the supplier

As shown below, this approach combines several advantages of consulting, software development, and SaaS for customers.

Advantages and Disadvantages from the Customer Point of View

For the customers, each of the mentioned service types has specific pros and cons

  • Consulting
    • Pros
      • Very high flexibility: Consulting projects are extremely tailored to customer needs
      • Know-how build-up: Customers can acquire knowledge from the consultants and use it for further tasks
      • Contact persons: There are contacts available for every requirement at least throughout the project
    • Cons
      • Time comsumption: Consulting projects are often extremely time-demanding and can take up to years
      • Expensiveness: Correspondingly, the costs are also very high – and can even continue to rise
  • Software
    • Pros
      • Low price: In general, software is relatively cheap – in some cases even free
      • Standardization: Software for specific tasks is often standardized and allows the customer to use best practice approaches
    • Cons
      • “Black box“: Often, customers have no possibility to examine how the used procedures in the software really work. (This is not the case for Open Source software, of course.)
      • Lengthy approval processes: Companies in regulated industries – like banks – have often lengthy approval processes in place. The effort can even reach proportions of projects.
      • Security risks: Complex software may have unknown security vulnerabilities, and open e.g. the door for hackers
  • SaaS
    • Pros
      • Low price: In general, SaaS is relatively cheap
      • Resource saving: Customers need only very limited resources for operating the service
    • Cons
      • “Black box“: As with software, customers often have no possibility to examine how the used procedures of the service really work
      • Supplier dependency: Customers depend on the service of third parties and are directly affected, e.g. in the case of an insolvency
      • Hurdles due to outsourcing: SaaS is often seen as a form of outsourcing. Depending on industry and legislation, legal hurdles can arise
  • AIaaM
    • Pros
      • High flexibility: Pre-trained models can be used extremely flexible inside an organization’s processes. E.g., a translator module can easily be a connected upstream of a classification routine
      • Build-up of relevant know-how: Customers can learn how to apply the modules with own, customized procedures
      • Lower regulatory hurdles: Since AIaaM is no full software in the narrow sense, hurdles should be significantly lower
      • Transparent tools: Customers can integrate the modules into their own and maintain an overall transparency
      • Low price: The price of a module is generally even lower than that of a commercial software since there is no there is no superstructure.
      • Efficiency: Customers are not forced to purchase unnecessary features that are already covered by other tools
    • Cons
      • No access to training procedures: Customers just acquire the trained model and not the training data or the training procedures. This is not necessarily a disadvantage, however, if only limited resources are available and the core business is a different one
      • Medium implementation effort: The installation needs some minimum programming, e.g. in Python. The knowledge for that should be available in each medium or large organization, however

In summary, AaaM combines several advantages of consulting, software, and SaaS – and avoids most disadvantages.

We are happy to support our customers with related issues.


Dr. Dimitrios Geromichalos
Founder / CEO
RiskDataScience UG (haftungsbeschränkt)
Theresienhöhe 28, 80339 München
Telefon: +4989244407277, Fax: +4989244407001
Twitter: @riskdatascience

Association Rules Analyzer

Our free association rules analyzer can be accessed via this link

General information

The main task of the association analysis is the identification and validation of rules for the common occurrence of variables on the basis of past observation histories (“item lists”).
The variables can be of a variety of types, such as jointly purchased products in (online) commerce (“market basket analysis”). Accordingly, the determined rules can be used in a variety of ways, such as for buying recommendations of books or shelves compilations in supermarkets.

Association analysis has become established in recent years, especially in online and retail trade. In addition, however, it can also be applied to countless other areas ranging from the analysis of co-occurring characters in television series to the identification of cause-and-effect relationships of operational loss events.

The basis of popular association analysis methods are powerful algorithms for rule determination, such as the rule-finding “Apriori” algorithm.
In addition, some helpful metrics have been established to further investigate the rules found. The most common are:

  • Support. This is the frequency of common occurrence of variables. For example, 15% support for the combination (milk, eggs) means that milk and eggs were purchased in 15% of all observed purchases. The support does not depend on the order of the variables and can be between 0% and 100%.
  • Confidence. This means the security of a determined rule. An 80% confidence for the rule “Milk -> Eggs” would mean, for example, that in 80% of the cases where milk was bought, eggs were also purchased (for example, to bake cakes). Confidence is directional and can range from 0% to 100%.
  • Lift. This refers to the factor by which the common occurrence of variables is more frequent than would be expected if they were independent of each other. A lift of e.g. 3 for the combination (milk, eggs) would mean that this combination is three times more common than would be expected by chance. The lift can in principle assume any value greater than or equal to zero. A value greater than one implies that variables tend to coexist, a value less than one implies that they are more likely to be mutually exclusive.

Since the number of determined rules is often very large, it is usually essential to limit the rules in advance to a reasonable number. In particular, this can be done by setting lower limits for support and confidence and thus determining only the most important rules. The remaining rules can then be tabulated or graphically examined.

Our App


The goal of this free web app is to enable simple association analysis using uploadable item lists and adjustable measure thresholds.
From this, the corresponding rules are automatically determined and provided as a downloadable table and graphically.

We offer all of the methods available here also offline and in various extensions to B2B customers, for example for the analysis of operational loss events. We are happy to assist you with related questions.


The use of our association rules website is straight-forward.

After the reCAPTCHA test, an own CSV file (comma-separated UTF-8 file with no special characters, no header, and no index) can be uploaded and the minimum values for support (in %), confidence (in %) as well as lift boundaries for common and uncommon co-occurences (in absolute numbers) can be set. If no CSV file is uploaded, a sample file is used instead.

After clicking on “RUN” the calculation is started and the association rules are obtained and displayed.

Please note that the values of the support, confidence and lift boundaries have a significant effect on the number of the determined rules. Too low figures may result in a time-out due to a too high number of the obtained rules – too high figures may result in no rules at all. Therefore it is recommended to test different parameter combination for new datasets until the required results are obtained.

As an output of the calculation, the top rules are obtained, viz with the highest support and confidence (see below). The rule components “antecedents” and “consequents” are shown as separate columns, as well as the according support, confidence and lift.

In order to obtain more “interesting” rules, it is also possible to obtain exclusively the rules with a minimum lift (“Lift 1”). These rules are shown in a separate table (see below) and provide an indication for items that occur much more often commonly than it would be expected by chance.

The rules for the common co-occurences are also displayed graphically (see below) to enable a more efficient analysis.

Similarly, rules for rare common occurrence (i.e., elements that normally exclude each other) are also calculated and displayed. These rules are filtered by setting an upper limit for “Lift 2”.

The rules can also be downloaded as an XLS file.


Dr. Dimitrios Geromichalos
Founder / CEO
RiskDataScience UG (haftungsbeschränkt)
Theresienhöhe 28, 80339 München
Telefon: +4989244407277, Fax: +4989244407001
Twitter: @riskdatascience

Data Science-based identification of co-occurring operational damage events

Overview Challenge and Offer

Operational risk is as great a threat as it is hard to analyze for both financial services and industrial companies.
In spite of complex models in practice, connections between different OpRisk events can hardly be identified in practice, and underlying causes often remain unrecognized.
On the other hand, data science methods have been already established for similar questions and allow the analysis of large amounts of different data in order to identify interdependencies, e.g. in the buying behavior of customers in online trading.

RiskDataScience  has adapted existing data science methods to the requirements of operational risk management and has developed algorithms to identify interdependencies between operational losses.
Herewith, companies are able to identify causal relationships between damages and spend less time in the search for common causes. The entire accumulated knowledge can be used efficiently in order to prevent future damage as far as possible or to anticipate it at an early stage.

Operational Risks


Operational risks can be assigned to the following categories, depending on the cause

  • People: e.g. fraud, lack of knowledge, employee turnover
  • Processes: e.g. .g. transaction errors, project risks, reporting errors, valuation errors
  • Systems: e.g. programming errors, crashes
  • External events: e.g. lawsuits, theft, fire, flooding


Usually, operational risks are categorized according to extent of damage and probability. Accordingly, suitable management strategies are:

  • Avoidance: for big, unnecessary risks
  • Insurance: for big, necessary risks
  • Mitigation: esp. for smaller risks with a high probability of occurrence
  • Acceptance: for risks that are part of the business model

Methods and Problem

The handling of operational risks is strictly regulated, especially in the financial services sector. For example, under Basel II / III, banks must underpin operational risks with equity capital. There are compulsory calculation schemes such as the Standardized Approach (SA) based on flat-rate factors and the Advanced Measurement Approach (AMA). The latter is based on distribution assumptions and will in future be replaced by the SA.

In terms of methodology, the following distinction is made among others between the treatment of operational risks:

  • Questionnaires and self-assessment: probablities and extents are determined in a rather qualitative way
  • Actuarial procedures: these are based on distribution assumptions based on past damage
  • Key risk indicator procedures: easily observable measures are identified that serve for early warning
  • Causal networks: interdependencies are mapped using Bayesian statistics

Interdependencies between and causes of operational risk can either not be determined at all or only in a very complex and error-prone manner.

Detecting relationships using data science techniques

Association analysis

For the analysis of the connections of several different events (“items“) methods from the field of association analysis are recommended.
The respective “market basket analysis” methods have already been established for several years and are used in particular in online commerce (for example, book recommendations in online commerce), search engine proposals or in retail (products on shelves).
Using association analysis, the common occurrence of different events can be identified directly and without distributional assumptions.
The enormous number of possible conclusions can be efficiently and properly limited by means of specially developed measures such as support, confidence and lift.
The analyses require programs based on appropriate analysis tools, e.g. Python, R or RapidMiner.

In addition, we offer a free web app for simple association analysis based on CSV files.

Analysis preparation

First, the damage data must be brought into a usable format for the analysis.
Depending on the type of damage, temporal aggregations (for example on a daily, weekly basis) must also be carried out.
Too often occurring or already explained types of damage have to be removed on the basis of expert assessments.

Analysis conduction

Before the start of the analysis, the criteria for the relevant inference rules should be set according to support and confidence. The determination of the criteria can be supported by graphics.
Subsequently, the conclusions of experts must be made plausible.
The steps should be repeated for all relevant time aggregations.

Use Case: analysis of a fictitious damage database

As an application example, a fictitious loss database of a bank was constructed for an entire year.
There were a total of 23 possible types of damage, including e.g. a flu epidemic, late reports, wrong valuations, and complaints about wrong advice. The following assumptions underlie the test example:

  • Bad transactions are very common
  • Deficiencies in the outsourcer hotline become apparent through requests for PC head crashes
  • Reporting staff usually drive by car and are affected by a snowstorm
  • After a valuation system crashes, wrong valuations occur
  • Thefts occur during work after fire in the meeting room
  • Staff shortages at suppliers lead to failed projects
  • Massive customer complaints after experienced employees leave customer service

Because the wrong transactions were very frequent and incoherent, they were removed first:

Damage frequency

First of all, all determined rules were graphically displayed to find the relevant support and confidence measurements.

Display of the rules on a daily basis

The restriction of the confidence to a minimum of 0.6 gives the list shown below.

Indentified interdependencies on a daily basis

Of the found coincidences, the green ones turn out to be valid after plausibility check.

On a weekly and monthly basis, the procedure was analogous:

Display of the rules on a weekly basis


Identified interdependencies on a weekly basis


Possible interdependencies on a monthly basis

After a plausibility check of possible causal relationships, all assumptions used in the preparation could be identified in the data.

Offer levels for using association analysis in OpRisk

RiskDataScience enables customers to use and develop the described processes efficiently and company-specifically. According to the respective requirements, the following three expansion stages are proposed.

Stage 1: Methodology

  • Introduction to the methodology of association analysis
  • Handover and installation of existing solutions based on Python, R and RapidMiner – or, depending on customer requirements, support of the on-site implementation
  • Transfer and documentation of the visualization and evaluation techniques

Customer is able to independently use and develop methodology.

Stage 2: Customizing

  • Stage 1 and additionally
  • Adaptation and possibly creation of criteria for rule selection according to circumstances of the respective customer
  • Analysis of specific risks, processes and systems to identify optimal applications
  • Development of a process description for an efficient use
  • Communication and documentation of results to all stakeholders

Customer has custom procedures and operational risk analysis processes.

Stage 3: IT Solution

  • Stage 1, Stage 2, and additionally
  • Specification of all requirements for an automated IT solution
  • Suggestion and contacting of potential providers
  • Support in provider and tool selection
  • Assistance in planning the implementation
  • Professional and coordinative support of the implementation project
  • Technical support after implementation of the IT solution

Customer has automated IT solution for efficient association analysis of operational risks.

Depending on customer requirements, a flexible design is possible. We are happy to explain our approach as part of a preliminary workshop.


Dr. Dimitrios Geromichalos
Founder / CEO
RiskDataScience UG (haftungsbeschränkt)
Theresienhöhe 28, 80339 München
Phone: +4989244407277, Fax: +4989244407001
Twitter: @riskdatascience

Machine Learning-Based Credit Rating Early Warning

Overview Challenge and Offer

As an important type of risk, credit risks are quantified using sophisticated rating procedures. Due to the time-consuming preparation and lack of up-to-date balance sheet data, ratings are only delayed. Banks have therefore already introduced market data-based early-warning systems for current credit risk signals, but these can not provide any indications in the event of missing market data.
On the other hand, corporate news and press articles often provide important information about problems and imbalances .
RiskDataScience has developed algorithms for the automatic detection and classification of news texts with regard to bankruptcy relevance (News-Based Early Warning).
This allows banks to extract valuable additional information about imminent insolvencies from news sources. An early recognition of credit risks is thus also possible for non-listed companies without direct market data.

Credit Risk Measurement


Credit risk is the risk of credit events such as default, late payment, credit downgrade or currency freeze.
Another distinction relates to the classification into issuer (for bonds), counterparty (for derivative transactions) and the – in the following considered – credit default risk of borrowers.
Credit risks are often the biggest bank risk and, in addition to market and operational risks, must be backed by equity under Basel II / III.

A frequently used indicator for quantifying credit risks is the expected loss of a loan. This results in the simplest case as a product

  • PD: Probability of Default
  • LGD: Loss Given Default
  • EaD: Exposure at Default

External and internal credit ratings mainly measure the PD (and LGD, for example) and are determined using complex procedures.

Determination and Early Detection

The methods for determining PD require well-founded statistical analyzes based on

  • quantitative balance sheet ratios such as debt ratio, equity ratio and EBIT
  • qualitative analyst key figures such as quality of management, future prospects and market position
  • general market data such as interest rates, inflation and exchange rates.

The rating models must be regularly validated against actual credit events and adjusted if necessary.
Credit ratings are therefore usually delayed – often only annually.
To address this issue, market-data-based early-warning systems have been introduced that provide signals based on significant changes in stock prices, credit spreads or other market-related correlated data. In general, however, only systematic or risks of listed companies can be identified.

Information from News Texts


The reasons for bankruptcies are often company-specific (idiosyncratic) and can not be derived from general market developments. examples for this are

  • Fraud cases by management
  • Bankruptcy of an important customer or supplier
  • Appearance of a new competitor

Negative events such as plant closures, short-time work, investigations and indictments are sometimes several months ahead of the actual bankruptcy.

In the case of non-listed companies, however, no market-data-based early warning is possible. On the other hand, news also provides up-to-date and often insolvency-relevant information in these cases.
News articles, blogs, social media and in particular local newspapers inform online and offline about problems of companies.
The efficient use of online texts makes it possible to extend the early warning to non-listed companies.

Efficient News Analysis

Methods for the efficient analysis of texts are a prerequisite for identifying the relevant news and, based on this, anticipating possible bankruptcies. For this are necessary

  • a timely identification of hundreds of data sources (newspapers, RSS feeds, etc.) taking into account the legal aspects
  • an automatic reading of the relevant messages about all customers based on given mandatory and exclusion criteria
  • a timely classification of the relevant texts on the basis of possible insolvency risks
  • an immediate analysis and visualization of the risk identification results

Already implemented machine learning algorithms serve as a basis for this seemingly impossible task.

Knowledge use through machine learning procedures

Automated Reading

As a first step, all relevant news sources (e.g., newspaper articles from specialized providers) must be identified on the basis of a sufficiently large sample of companies to be examined and irrelevant sources must be excluded wherever possible.

The messages are to be filtered according to relevance. In order to avoid confusion due to the name or erroneous parts of the text (for example regarding equities), word filters and possibly complex text analyzes are necessary.


For the classification of the extracted message texts different text mining methods from the field of data science / machine learning are considered. Supervised learning is done as follows

  • first, the words that are irrelevant for the classification are determined manually (“stop words”)
  • the algorithms are then “trained” with known data records to associate texts with categories
  • new texts can then be assigned to known categories with specific confidences

Methodically, the following steps are to be carried out

  • from the filtered texts, significant word stems / word stem combinations (“n-grams“) are determined
  • the texts are mapped as points in a high-dimensional space (with the n-grams as dimensions)
  • machine learning procedures identify laws for separating points into categories. For this purpose, dedicated algorithms such as naive Bayes, W-Logistic or Support Vector Machine are available

The analyzes require programs based on appropriate analysis tools, e.g. R or Python

Sample Case

For about 50 insolvent companies and 50 non-insolvent reference companies, (German) message snippets were collected for a multi-month time horizon (3M-3W) before the respective bankruptcy.
The illustrated tag clouds provide an exemplary overview of the content of the texts.
With a RapidMiner prototype, the message texts were classified for possible bankruptcies and the results were examined with in and out-of-sample tests.

Tagcloud news for companies gone bankrupt
Tagcloud news for companies not gone bankrupt

Already on the basis of the tagclouds a clear difference between the news about insolvent and not bankrupt companies can be seen.

The RapidMiner solution was trained with a training sample (70% of the texts) and applied to a test sample (30% of the texts).
Both for the training sample (in-sample) and for the test sample resulted in accuracy rates (accuracy) of about 80%. The Area Under the Curve (AUC) was also 90% in the in-sample case.
Based on the RapidMiner licenses and the actual insolvencies, a PD calibration could also be performed.

Even with the relatively small training sample, a significant early detection of insolvencies could be achieved. Further improvements are to be expected with an extension of the training data.

Cost-Effective Implementation

Starting Position

Since there has not yet been a single market for Internet news deliveries, prices are often inconsistent. Different requirements for the cleaning routines and different technical approaches lead to large price ranges.
On the other hand, high-quality analysis tools such as R or RapidMiner (Version 5.3) are currently available. even available for free.
In addition, about half of all online newspapers offer their headlines in the form of standardized RSS feeds.

Cost Drivers

The implementation and ongoing costs of message-based early warning systems may be limited in part to the following reasons, in particular: increase significantly:

  • An evaluation of news texts requires royalties to collecting societies (e.g. VG Wort in Germany) or a direct purchase
  • A automatied reading is technically complicated
  • Maintaining advanced NLP (Natural Language Processing) algorithms to identify relevant text is costly

It is therefore necessary to examine to what extent the points mentioned are actually necessary, at least for a basic implementation.

Cost-Efficient Basic Solution

The already developed cost-efficient RiskDataScience basis solution is based on the following assumptions.

  • information contained in headings and short snippets is sufficient for bankruptcy warnings
  • there are enough free RSS feeds that provide a sufficiently good overview of the situation (medium-sized) companies
  • the relevance of the news snippets can be determined by simple text searches

Hundreds of news sources can be searched and bankruptcy signals can be identified to potentially thousands of companies within minutes.

Copyright Issues

When implementing message-based early-warning systems, it is imperative to comply with the legal requirements that arise, in particular, from copyright law (e.g. UrhG in Germany).

This places narrow limits on the duplication and processing of news texts.
In particular, in the case of databases and further publications problems may occur in some jurisdictions.

On the other hand, there are many exceptions, especially with regard to temporary acts of reproduction, newspaper articles and radio commentary.

Although the processing of message snippets should be generally safe, due to the high complexity of the relevant laws legal advice is recommended.

Offer levels for using machine learning techniques for credit risk detection

RiskDataScience enables banks to use and develop the described procedures efficiently and institution-specifically. According to the respective requirements, the following three expansion stages are proposed.

Stage 1: Methodology

  • briefing in text classification methodology
  • transfer and installation of the existing solution for tag cloud generation
  • handover and installation of the existing solution – or, depending on customer requirements, support of the on-site implementation
  • transfer and documentation of the visualization and evaluation techniques
    Bank is able to independently use and develop methodology

Stage 2: Customizing

  • stage 1 and additionally
  • adjustment and possibly creation of reference groups according to portfolios of the respective bank
  • performing analyzes and method optimization based on the portfolios and customer history of the bank
  • adaptation of RSS sources
  • development of a process description for an efficient use
  • communication and documentation of results to all stakeholders
    Customer has customized procedures and processes for analyzing message texts

Stage 3: IT Solution

  • stage 1, stage 2 and additionally
  • specification of all requirements for an automated, possibly web-based IT solutions
  • suggestion and contacting potential providers
  • support in provider and tool selection
  • assistance in planning the implementation
  • professional and coordinative support of the implementation project
  • technical support after implementation of the IT solution
    Bank has an automated IT solution for message-based early detection of insolvency signals.

Depending on customer requirements, a flexible design is possible. We are happy to explain our approach as part of a preliminary workshop.


Dr. Dimitrios Geromichalos
Founder / CEO
RiskDataScience UG (haftungsbeschränkt)
Theresienhöhe 28, 80339 München
Phone: +4989244407277, Fax: +4989244407001
Twitter: @riskdatascience

Forex Risk Calculator

Our free forex risk calculator can be accessed via this link

General information

Especially in today’s internationally connected business world, it is often unavoidable even for smaller companies to engage in a wide variety of transactions in foreign currencies. Many orders are, for example, only possible or significantly cheaper in other currencies.

However, for each position entered into in a foreign currency – for assets as well as for liabilities – there are considerable risks from fluctuations in exchange rates which are characteristic for the foreign exchange or forex market.

Several procedures from Financial Risk Management have proven their worth in quantifying these so-called foreign exchange rate risks (FX risks) and have been established with financial service providers and larger companies for years.

Two key indicators are sensitivity and Value at Risk (VaR).

The sensitivity indicates how much the value of a foreign currency account changes if the exchange rate of the corresponding foreign currency increases by (here) 1%.

The VaR, on the other hand, indicates the extent of the loss that will not be exceeded at a certain confidence level (e.g. 95%) and a certain time horizon (“holding period”; e.g. 10 days). A key challenge in determining VaR is also to take account of correlations between exchange rates; in the case of uncorrelated exchange rates, this tends to lead to diversification and thus to a reduction in the overall risk compared with pure addition. Two following common methods are used here for determining VaR:

  • Delta Normal Approach: Variances and correlations are determined on the basis of the history and calculated using the normal distribution assumption of VaR. This approach is easy to implement, but underestimates unlikely events.
  • Historical simulations: The historically observed changes are used as simulation scenarios. This method implicitly takes correlations and possible shocks into account, but its quality depends strongly on the underlying history.

For our app we use the histories of the last 1000 days for both methods.

Our App

Our FX Risk calculator enables the determination of exchange rate risks for portfolios of up to 19 foreign currency positions from the perspective of 5 local currencies — thus considering cross-currency dependencies and correlations for long as well as short positions.

With or app, even smaller companies with no sophisticated financial risk procedures can obtain an indication about possible FX risks of planned or actual deals or transactions.

A batch run determines the currency pair exchange rates of the latest day for which data were obtained (value date); please note that all the calculations refer to this day. After selecting the local currency and entering the foreign currency positions (each in foreign currency units), the holding period (in days), and the confidence level (in %) for the value at risk, the calculation can be started.

Forex risk calculation results: PV, sensitivity, VaR, simulation scenarios
Exemplary calculation results

The results are as shown above and include

  • the total cash value (money amount) in local currency
  • the delta normal VaR and the historical VaR

In addition,

  • the present values of the foreign currency position in local currency
  • the sensitivities of the foreign currency position (1% increase in foreign currency) in local currency
  • and the total NPV scenarios from the historical simulation

are graphically displayed.

We support our customers with these as well as further financial risk methods e.g. for calculating interest rate or credit risks and with questions regarding more current data sources.



Dr. Dimitrios Geromichalos
Founder / CEO
RiskDataScience UG (haftungsbeschränkt)
Theresienhöhe 28, 80339 München
Telefon: +4989244407277, Fax: +4989244407001
Twitter: @riskdatascience

Regulytics®-Demo Bauordnungen


Die Applikation Regulytics® von RiskDataScience ermöglicht die automatisierte Analyse juristischer und interner Texte nach inhaltlichen Gesichtspunkten.
Bei der auf der Website “Bauordnungen” eingerichteten App handelt es sich um eine frei bedienbare und im Umfang eingeschränkte Version von Regulytics. Der Fokus liegt hierbei exemplarisch auf den Bauordnungen der deutschen Bundesländer.
Diese können damit objektiv auf Ähnlichkeiten untersucht werden. Ähnliche Paragraphen in den verschiedenen Texten können ebenso ermittelt werden wie Unterschiede.

Nähere Erläuterungen zu Hintergrund und Motivation semantischer Analysen finden Sie hier.

B2B-Kunden können einen kostenlosen Test-Account für die erweiterte Online-Version von Regulytics beantragen, dessen Schwerpunkt auf für den Finanzdienstleistungssektor relevanten Regelungen liegt.
Die Offline-Version von Regulytics bietet darüber hinaus die Möglichkeit beliebige weitere Texte mit zu berücksichtigen.

Desweiteren unterstützen wir unsere Kunden gerne im Rahmen von Beratungsprojekten bei der Einführung semantischer Analysen komplexer wie auch umfangreicher Texte.


Die Lösung ist einfach und unkompliziert zu bedienen. Sprachen sowie Start- und Zieltext können über Dropdown-Menüs ausgewählt werden.
Eine weitere Eingabe ist die Anzahl der Themen für das natürliche Sprachverarbeitungsmodell (wir empfehlen Werte zwischen 100 und 500).
Basierend auf diesen Eingaben werden – nach einigen Sekunden Rechenzeit – die allgemeinen Ähnlichkeiten berechnet und grafisch als Radar-Diagramm angezeigt (siehe Diagramm unten); das Maß hierfür ist dabei die Kosinus-Ähnlichkeit.

Zu Bauordnung Bayern ähnlichste Bauordnungen

Nach Auswahl eines “Startparagraphen” im Starttext wird die vollständige Ähnlichkeitsmatrix für alle Absätze berechnet. Helle Bereiche entsprechen hierbei ähnlichen Paragraphen zwischen Start- (x-Koordinate) und Zieltext (y-Koordinate; siehe Bild unten).

Paragraphen-Vergleich der Bauordnungen Bayerns (x-Achse) und Baden-Württembergs (y-Achse)

Außerdem werden Auszüge der Top 10 der ähnlichsten Absätze (des Zieltextes) zum Startabsatz angezeigt. Es werden auch Auszüge der Paragraphen des Starttextes mit den höchsten und niedrigsten Ähnlichkeiten mit dem Zieltext gezeigt. So können Überschneidungen sowie Unterschiede zwischen Vorschriften leicht identifiziert werden.

Wichtiger Hinweis: Bei der Bestimmung von ähnlichen und unterschiedlichen Absätzen sollte das quantitative Kosinus-Ähnlichkeitsmaß immer im Auge behalten werden.
Wenn z.B. der ähnlichste Paragraph zu einem bestimmten Paragraphen eine Ähnlichkeit von weniger als 0,5 aufweist, kann in den meisten Fällen angenommen werden, dass es keine tatsächliche Ähnlichkeit gibt, d.h. kein ähnlicher Paragraph existiert.
Gleiches gilt für die Bestimmung von Unterschieden.
Eine Bestimmung von unteren (für Ähnlichkeiten) und oberen Kosinus-Ähnlichkeitsgrenzen (für Unterschiede) ist möglich und sinnvoll, hängt aber von der jeweiligen Fragestellung ab.

Weitere, auf diesem Verfahren aufbauende Analysen sind problemlos möglich.

Wir unterstützen unsere Kunden bei diesbezüglichen Fragestellungen. Gerne erläutern wir unseren Ansatz auch im Rahmen eines Vorab-Workshops.


Dr. Dimitrios Geromichalos
Founder / CEO
RiskDataScience UG (haftungsbeschränkt)
Theresienhöhe 28, 80339 München
Telefon: +4989244407277, Fax: +4989244407001
Twitter: @riskdatascience

Regulytics® ist eine eingetragene Marke beim Deutschen Patent- und Markenamt.


Ein automatisierter Abgleich der Polizeigesetze verschiedener Bundesländer


Das Polizeirecht wird in Deutschland je nach Bundesland unterschiedlich geregelt. Dennoch sind aufgrund des analogen Aufgabenspektrums große Ähnlichkeiten zu erwarten.

Ziel der vorliegenden Analyse war es – unter Nutzung der Möglichkeiten der Digitalisierung – mittels der Verfahren des Machine Learning und des Natural Language Processing (NLP) semantische Ähnlichkeiten und Unterschiede zwischen den verschiedenen Gesetzen (Stand 11.05.2018) zu identifizieren.
Ausgangspunkt hierfür war das bayerische Polizeiaufgabengesetz (BayPAG), das mit entsprechenden Gesetzen der übrigen Bundesländer automatisiert abgeglichen wurde.
Neben allgemeinen Ähnlichkeiten zu den übrigen Gesetzen erfolgte darüber hinaus ein Paragraphen-spezifischer Abgleich mit den Polizeigesetzen von Thüringen (PAG), Baden-Württemberg (PolG) und Hamburg (SOG).

Methoden und Tools

Die Verfahren des NLP im hier verwendeten Sinne ermöglichen semantische Analysen von Texten anhand darin vorkommender Themen (Topics) zur Identifizierung von Ähnlichkeiten bei beliebiger Granularität.
Bei der verwendeten Methode Latent Semantic Analysis erfolgt eine Reduktion der betrachteten Begriffe auf eine vorgegebene Anzahl von Themen und hierdurch eine Abbildung von Texten auf einen „semantischen Raum“.
Neue Texte und Text-Komponenten können anschließend auf semantische Ähnlichkeiten hin untersucht werden.
Die Analysen erfordern Programme auf der Basis entsprechender Analysetools, wie z.B. Python oder R.


Zunächst werden die Einheiten bestimmt anhand derer die Texte zu untersuchen sind (Sätze, Paragraphen, usw.) .
Mittels eines „Trainings-Textes“ wird eine Abbildung auf eine vorgegebene Anzahl von Topics ermittelt („Modell“).
Die zu untersuchenden Texte werden ebenfalls mittels des Modells abgebildet und anschließend quantitativ auf Ähnlichkeiten hin untersucht.
Das Verfahren lässt sich automatisieren und auf eine große Anzahl von Texten anwenden.

Vorteile einer automatisierten Untersuchung

Automatische Analysen können per definitionem sehr schnell und standardisiert durchgeführt werden.
Selbst mit herkömmlichen Laptops können binnen Minuten semantische Ähnlichkeiten in Dutzenden komplexer Gesetze analysiert werden.

Der personelle Aufwand für die Nutzung und ggf. Weiterentwicklung ist äußerst gering und von der Anzahl der betrachteten Gesetze weitgehend unabhängig.

Die Ähnlichkeiten zwischen den Gesetzen auf Gesamt- und Paragraphen-Ebene liegen quantitativ vor und sind jederzeit reproduzierbar.
Unterschiede durch subjektive Präferenzen sind damit praktisch ausgeschlossen.
Analysen lassen sich nachvollziehbar dokumentieren.

Ergebnisse der Analyse

Die Analyse auf Gesamt-Ebene ergibt unterschiedlich große Ähnlichkeiten des bayerischen Polizeigesetzes zu dem anderer Bundesländer (s. Radar-Plot unten).
Die größte Ähnlichkeit besteht hierbei zum Polizeigesetz Thüringen, während sich das entsprechende Gesetz Hamburgs am stärksten von dem Bayerns unterscheidet.

Ähnlichkeit des bayerischen Polizeigesetzes zu dem anderer Bundesländer

Ein Abgleich auf Paragraphen-Ebene zwischen den Gesetzen Bayerns und Thüringens demonstriert die große Ähnlichkeit auf eindrucksvolle Art.
Die helle Diagonale der Ähnlichkeitsmatrix (Spalten: Polizeigesetz Bayern; Zeilen: Polizeigesetz Thüringen) weist auf starke Ähnlichkeiten für den größten Teil der Gesetze sowie auf eine fast identische Gesamtstruktur hin.
So sind BayPAG, Art. 66 (Allgemeine Vorschriften für den Schußwaffengebrauch) und § 64, PAG (Allgemeine Bestimmungen für den Schußwaffengebrauch) fast identisch.
Eine relative Ausnahme stellt hingegen BayPAG, Art. 73 (Rechtsweg) dar, zu dem es im POG keinen unmittelbar semantisch auffindbaren Paragraphen gibt.
Download: Liste ähnlicher Paragraphen BayPAG – POG

Ähnlichkeitsmatrix der Polizeigesetze Bayerns und Thüringens

Stärkere Unterschiede weist, wie erwartet, die Ähnlichkeitsmatrix zwischen dem BayPAG und dem baden-württembergischen PolG auf. Hier ist die Hauptdiagonale zwar noch erkennbar, aber unterbrochen und teils versetzt, was auf einen unterschiedlichen Gesamt-Aufbau schließen lässt.
Als ähnlichste Pragraphen hier wurden BayPAG, Art. 41 (Datenübermittlung an Personen oder Stellen außerhalb des öffentlichen Bereichs) sowie PolG, § 44 (Datenübermittlung an Personen oder Stellen außerhalb des öffentlichen Bereichs) identifiziert.
Kein Gegenstück wurde u.a. für BayPAG Art. 57 (Ersatzzwangshaft) ermittelt.
Download: Liste ähnlicher Paragraphen BayPAG – PolG

Ähnlichkeitsmatrix der Polizeigesetze Bayerns und Baden-Württembergs

Die großen Unterschiede zwischen dem BayPAG und dem hier untersuchten Polizeigesetz Hamburgs (SOG) sind auch in der Ähnlichkeitsmatrix gut erkennbar. Die Hauptdiagonale ist nur bruchstückhaft erhalten, mit großen Bereichen ohne nennenswerte Übereinstimmung.
Ähnliche Paragraphen hier sind insb. BayPAG, Art. 24 (Verfahren bei der Durchsuchung von Wohnungen) sowie SOG, § 16 a (Verfahren beim Durchsuchen von Wohnungen) und BayPAG, Art. 64 (Androhung unmittelbaren Zwangs) sowie SOG, § 22 (Androhung unmittelbaren Zwanges).
Download: Liste ähnlicher Paragraphen BayPAG – SOG

Ähnlichkeitsmatrix der Polizeigesetze Bayerns und Hamburgs

Abschließend lässt sich festhalten, dass mit den uns verfügbaren Verfahren des Machine Learning und des NLP auf einfache Weise Ähnlichkeiten zwischen den länderspezifischen Polizeigesetzen auf Gesamt- wie auf Paragraphen-Ebene identifiziert werden konnten.
Die Gesamtstruktur ausgewählter Gesetze konnte grafisch gegenübergestellt werden, ähnliche Paragraphen konnten ebenso effizient ermittelt werden wie Unterschiede.

Weitere, auf diesem Verfahren aufbauende Analysen sind problemlos möglich.

Wir unterstützen unsere Kunden bei diesbezüglichen Fragestellungen. Gerne erläutern wir unseren Ansatz auch im Rahmen eines Vorab-Workshops.

Zudem bieten wir mit Regulytics® eine Web-Applikation zur automatisierten Analyse von Regularien auf Gesamt- und Paragraphen-Ebene.


Dr. Dimitrios Geromichalos
Founder / CEO
RiskDataScience UG (haftungsbeschränkt)
Theresienhöhe 28, 80339 München
Telefon: +4989244407277, Fax: +4989244407001
Twitter: @riskdatascience

Regulytics® ist eine eingetragene Marke beim Deutschen Patent- und Markenamt.

Automated Semantic Analysis of Regulatory Texts


The extent and complexity of regulatory requirements pose significant challenges to banks already at their analysis. In addition, banks must – often ad-hoc – process inquiries of regulators and comment on consultation papers.
On the other hand, procedures from the areas NLP and Machine Learning enable the effective and efficient usage of available knowledge resources.

Our application Regulytics® enables the automated analysis of regulatory and internal texts in terms of content.
The app does not provide cumbersome general reports, but concise, tailored and immediate information about content-related similarities.
Thereby, regulatorily relevant texts can be classified into the overall context. Similar paragraphs in regulations and internal documents can be determined as well as differences in different document versions.
Financial service providers can request a free trial account for the online version of Regulytics.

Regulatory Challenges

Regulations like IFRS 9, BCBS 239, FTRB or IRRBB require fundamental changes of the banks’ methods, processes and/or systems.
Many regulations have far-reaching impacts on the risks, the equity capital and thereby the business model of the affected banks.
Also, the large amount of the final and consultative regulations renders a monitoring of the requirements and the effects difficult.

Regulations can generally affect different, interconnected areas of banks, like risk, treasury, finance oder IT.
In addition, there exist also connections between the regulations; gaps to to one requirement generally correspond to additional gaps to further requirements.
The diverse legislation in different jurisdictions increases the complexity once again.

Inside banks, several impact analyses and prestudies are conducted in order to classify the relevance of regulations.
Several consulting firms conduct prestudies as well as the actual implementation projects which are often characterized by large durations and high resource requirements .
Projects bind considerable internal resources and exacerbate bottle necks.
External support is expensive and increases the coordination efforts, especially in the case of several suppliers.
Errors in prestudies and project starting phases can be hardly corrected.
Due to the high complexity, there exists the risk that impacts and interdependencies are not recognized in time.

Available Knowledge Ressources

Original texts of the regulations and the consultation papers are normally freely available in the Internet and are – in the case of EU directives – present in several languages.
Regulators and committees provide further information, e.g. in form of circular letters and consultation papers.
Several institutes, portals, and consulting firms supply banks with partially free articles, white papers and news letters.

In addition, banks have collected extensive experiences due to already finalized or ongoing projects (project documentations, lessons learned).
Banks also have available documentations of the used methods, processes, and systems as well as the responsibilities and organizational circumstances.
Internal blogs, etc. focus the expertise of the employees.

Advantages of an Automated Analysis

Speed Increase

Automated analyses can be done per definition in a very fast and standardized way.
Even with usual laptops, semantic similarities of dozens regulations can be analyzed within minutes.
Thereby, responsibilities and impacts can be recognized – e.g. in the case of consultation papers – in time and included into statements.

Resource Conservation

Our solution runs without expensive hardware and software requirements.
The human effort for usage and eventual enhancements is extremely low and practically independent from the number of the considered regulations.
Bottlenecks are reduced and experts can focus on the demanding tasks.
Thus, project costs can be minimized.


The similarities between regulations on total and paragraph level are quantitatively available and at any time reproducible.
Discrepancies caused by subjective preferences can be practically ruled out.
Analyses can be documented in a comprehensible way.
Prestudy results and statements of external suppliers can be checked without bias.

Error Reduction

Automated analyses pose an efficient additional control.
Non-trivial – and potentially ignored – interdependencies between regulations can be identified and considered.
Especially clerical errors and the overlooking of potentially important paragraphs can be minimized.
Also, potentially ignored gaps and impacts can be detected.

Knowledge Usage via Topic Analysis

Methods and Tools

The methods of Natural Language Processing (NLP) enable a semantic analysis of texts on the basis of the topics contained therein for an identification of similarities at any required granularity.
In the here used method “Latent Semantic Analysis” (LSA or “Latent Sementic Indexing”, LSI), the considered terms are mapped onto a given number of topics; accordingly, texts are mapped onto a “semantic space”.
The topic determination is equivalent to an unsupervised learning process on the basis of the available documents.
New texts and text components can then be analyzed in terms of semantic similarities.
The analyzes require programs on the basis of appropriate languages, like e.g. Python or R.


At first, the levels are determined by which the texts are to be analyzed (sentences, paragraphs, etc.).
Via a training text, a mapping onto a given number of topics is determined (”model”).
The texts to be analyzed are also mapped with the model and then quantitatively analyzed in terms of similarities.
As shown in the right sketch, the process can be automated and efficiently applied on a large number of texts.

Identification of similar paragraphs

Approach for an Analysis of Regulation-Related Texts

The approach for the analysis at total and paragraph level is determined by the bank’s goals.  We support you in detail questions and in the development of specific solutions.

Analysis of regulation-related texts


In the following, three possible analyses of regulation texts are drafted which differ in their objective. The analyses can be easily conducted also with internal texts.

Use Case 1: Identification of Similarities

In the analysis, the regulation Basel II and the regulation Basel III: Finalising post-crisis reforms (often called “Basel IV”) were considered.
The general comparison already indicates a strong cosine similarity between the two texts (s. radar plot).
The matrix comparison over all paragraphs yields high similarities over wide areas (bright diagonal, s. matrix plot).
The analysis at paragraph level yields numerous nearly identical sections concerning credit risks (s. table).

Radar plot at total level
Similarity plot at paragraph level
Similar paragraphs

Use Case 2: Determination of Differences

A comparison between the German regulations MaRisk of the years 2017 and 2012 was conducted.
As already seen at the general level (s. radar plot) and in the matrix plot over all paragraphs (bright diagonal), the texts are nearly identical.
However, disruptions in the main diagonal (red arrow, matrix plot), indicate some changes.
A respective analysis over all paragraphs yields the section „AT 4.3.4“ (stemming from BCBS 239) as biggest novelty.

Radar diagram at total level
Similarity matrix at paragraph level
“Novel” paragraphs

Use Case 3: Finding Similar Paragraphs

The regulations Basel III: Finalising post-crisis reforms (“Basel IV”) and Basel III (BCBS 189) were considered.
Despite differences, an area of relatively high similarities can be recognized at paragraph level (red arrow, matrix plot).
For an analysis of this area, a respective paragraph from “Basel IV” was selected and the most similar paragraphs from Basel III to this paragraph were determined.
As shown in the table, the respective paragraphs from the texts refer to the Credit Value Adjustments (CVA).

Similarity matrix at paragraph level
Similar target paragraphs

Our Offer – On-Site Implementation

RiskDataScience enables banks to use and enhance the described procedures in an efficient and institute-specific way. According to the requirements, we propose the following three configuration levels.

Level 1: Methodology

  • Introduction into the Latent Semantic Indexing methodology with a focus on regulatory texts
  • Handover and installation of the existing Python solution for the automated loading and splitting of documents as well as the semantic analysis via LSI – or, depending on customer requirements, support of the on-site implementation
  • Handover and documentation of the visualization and analysis methods

Bank has available enhanceable processes for the analysis of regulatory requirements.

Level 2: Customization

  • Step 1 and additional
  • Adaptations of analysis entities (e.g. document groups) according to the analysis goals of the bank
  • Analysis of the concrete regulations, projects, methods, processes, and systems for an identification of the optimal use possibilities
  • Development of processes for the achievement of the document goals
  • Documentation and communication of the results to all stakeholders

Bank has available customized processes for the analysis of regulatory requirements, e.g. in terms of responsibilities or methods.

Step 3: IT Solution

  • Step 1, Step 2 and additional
  • Specification of all requirements for a comprehensive IT solution
  • Proposal and contact of possible suppliers
  • Support in the supplier and tool selection
  • Support in the planning and implementation
  • Methodological and coordinative support during the implementation
  • Contact for methodological questions after the implementation

Bank has available an automated IT solution for an efficient semantic comparison of regulatorily relevant text components.

According to the customers’ requirements, a flexible arrangement is possible.

In addition, with our web app Regulytics® we offer a solution for an automated analysis of regulatory texts on total and paragraph level.


Dr. Dimitrios Geromichalos
Founder / CEO
RiskDataScience UG (haftungsbeschränkt)
Theresienhöhe 28, 80339 München
Telefon: +4989244407277, Fax: +4989244407001
Twitter: @riskdatascience

Automatisierte semantische Analyse regulatorischer Texte


Umfang und Komplexität regulatorischer Anforderungen stellen Banken bereits bei deren Analyse vor erhebliche Herausforderungen. Zudem müssen Banken – oft ad-hoc – Anfragen von Regulatoren bearbeiten und sich zeitnah zu Konsultationspapieren äußern.

Verfahren aus den Bereichen NLP und Machine Learning ermöglichen die effektive und effiziente Nutzung vorhandener Wissensressourcen.

Unsere Applikation Regulytics® ermöglicht die automatisierte Analyse regulatorischer sowie interner Texte nach inhaltlichen Gesichtspunkten.
Die App liefert keine umständliche allgemeine Berichte, sondern prägnante, ​​maßgeschneiderte und sofortige Informationen zu inhaltlichen Ähnlichkeiten.
Damit können regulatorisch relevante Texte objektiv in den Gesamtkontext eingeordnet werden. Ähnliche Paragraphen in Regularien und internen Dokumenten können ebenso ermittelt werden wie Unterschiede in verschiedenen Dokument-Versionen.
Finanzdienstleister können einen kostenlosen Test-Account für die Online-Version von Regulytics beantragen.
Zusätzlich bieten wir eine freie Regulytics-Demo-Version zur Analyse der Bauordnungen der verschiedenen deutschen Bundesländer: Link

Regulatorische Herausforderungen

Regularien wie IFRS 9, BCBS 239, FTRB, IRRBB oder die MaRisk-Novelle 2017 erfordern grundlegende Änderungen in Methoden, Prozessen und/oder Systemen der Banken.
Viele Regularien haben weitreichende Auswirkungen auf die Risiken, das Eigenkapital und damit das Geschäftsmodell der betroffenen Banken.
Die große Anzahl der finalen bzw. in Konsultation befindlichen Regularien gestaltet ein Monitoring der Anforderungen und Auswirkungen schwierig.

Die Regularien können verschiedene, miteinander zusammenhängende, Bereiche der Banken, wie Risk, Handel, Finance oder die IT betreffen.
Zudem existieren auch Zusammenhänge zwischen den Regularien; Gaps zu einer Anforderung entsprechen i.d.R. auch Gaps zu weiteren Anforderungen.
Die i.A. unterschiedliche Gesetzgebung in den verschiedenen Jurisdiktionen erhöht die Komplexität nochmals beträchtlich.

Innerhalb der Banken finden Auswirkungsanalysen und Vorstudien statt, um die Relevanz der Projekte einzustufen.
Zahlreiche Beratungsunternehmen führen Vorstudien sowie die eigentlichen Umsetzungsprojekte durch, die sich oft durch lange Laufzeiten und einen hohen Ressourcenbedarf auszeichnen.
Projekte binden interne Ressourcen und verschärfen Personalengpässe.
Externe Unterstützung ist kostspielig und erhöht den Koordinationsaufwand, insb. bei mehreren Dienstleistern.
Fehler in Vorstudien und Projekt-Anfangsphasen lassen sich nur schwer korrigieren.
Aufgrund der hohen Komplexität besteht Risiko, dass Auswirkungen und Interdependenzen nicht rechtzeitig erkannt werden.

Wissensressourcen zur Aufgabenbewältigung

Als externe Ressourcen stehen den Banken zunächst Originaltexte der Regularien sowie der Konsultationen zur Verfügung, die für gewöhnlich frei erhältlich sind. Zahlreiche einschlägige Online-Portale veröffentlichen regelmäßig Artikel über Inhalt und Auswirkungen der regulatorischen Anforderungen. Verschiedene Beratungsunternehmen, insbesondere die Big 4, stellen den Banken außerdem freie Artikel, Whitepapers und Newsletters zur Verfügung. Somit kann dann Internet in gewissem Umfang bereits als Medium für Vorab-Analysen aufgefasst werden.

Intern haben die Banken bereits umfangreiche Erfahrungen durch bereits abgeschlossene oder aktuell laufende Projekte gesammelt, wie Projektdokumentationen oder Lessons Learned. Banken verfügen zusätzlich über umfangreiche Dokumentationen der eingesetzten Methoden, Prozesse und Systeme sowie der Zuständigkeiten und organisatorischen Gegebenheiten. Interne Blogs, etc. bündeln darüber hinaus die Expertise der Mitarbeiter. Teil-Analysen sind damit bereits in beträchtlichem Umfang vorhanden.

Vorteile einer automatisierten Untersuchung


Automatische Analysen können per definitionem sehr schnell und standardisiert durchgeführt werden.
Selbst mit herkömmlichen Laptops können binnen Minuten semantische Ähnlichkeiten in Dutzenden komplexer Regularien analysiert werden.
Damit können – etwa im Falle von Konsultationspapieren – Zuständigkeiten und Auswirkungen rechtzeitig erkannt und in die Stellungnahmen einbezogen werden.


Unsere Lösung ist – je nach Ausprägung – ohne spezielle Hard- und Software-Anforderungen lauffähig.
Der personelle Aufwand für die Nutzung und ggf. Weiterentwicklung ist äußerst gering und von der Anzahl der betrachteten Regularien weitgehend unabhängig.
Engpässe werden reduziert und Experten können sich auf anspruchsvolle Tätigkeiten fokussieren.
Entsprechend lassen sich Projektkosten reduzieren.


Die Ähnlichkeiten zwischen den Regularien auf Gesamt- und Paragraphen-Ebene liegen quantitativ vor und sind jederzeit reproduzierbar.
Unterschiede durch subjektive Präferenzen sind damit praktisch ausgeschlossen.
Analysen lassen sich nachvollziehbar dokumentieren.
Ergebnisse von Vorstudien und Aussagen externer Dienstleister können unvoreingenommen überprüft werden.


Automatische Analysen stellen eine effiziente Zusatzkontrolle dar.
Nichttriviale – und möglicherweise übersehene – Interdependenzen zwischen Regularien können identifiziert und berücksichtigt werden.
Insbesondere Flüchtigkeitsfehler sowie das Übersehen ggf. wichtiger Passagen lassen sich damit verringern.
Ggf. unbemerkte Gaps und Impacts können darüber hinaus entdeckt werden.

Wissensnutzung mittels Topic Analysis

Methoden und Tools

Die Verfahren des Natural Language Processing (NLP) im hier verwendeten Sinne ermöglichen semantische Analysen von Texten anhand darin vorkommender Themen (Topics) zur Identifizierung von Ähnlichkeiten bei beliebiger Granularität.
Bei der verwendeten Methode Latent Semantic Analysis (LSA bzw. Latent Semantic Indexing, LSI) erfolgt eine Reduktion der betrachteten Begriffe auf eine vorgegebene Anzahl von Themen und hierdurch eine Abbildung von Texten auf einen „semantischen Raum“.
Die Topic-Ermittlung entspricht dabei einem Unsupervised Learning-Prozess anhand vorgegebener Dokumente.
Neue Texte und Text-Komponenten können anschließend auf semantische Ähnlichkeiten hin untersucht werden.
Die Analysen erfordern Programme auf der Basis entsprechender Analysetools, wie z.B. Python oder R.


Zunächst werden die Einheiten bestimmt anhand derer die Texte zu untersuchen sind (Sätze, Paragraphen, usw.) .
Mittels eines „Trainings-Textes“ wird eine Abbildung auf eine vorgegebene Anzahl von Topics ermittelt („Modell“).
Die zu untersuchenden Texte werden ebenfalls mittels des Modells abgebildet und anschließend quantitativ auf Ähnlichkeiten hin untersucht.
Wie rechts skizziert, lässt sich das Verfahren automatisieren und effizient auf eine große Anzahl von Texten anwenden.

Identifizierung ähnlicher Paragraphen

Vorgehen bei der Analyse regulatorisch relevanter Texte

Das Vorgehen für die Analyse auf Gesamt- und Paragraphen-Ebene richtet sich nach den jeweiligen bankfachlichen Zielen. Wir unterstützen Sie bei Detailfragen und bei der Erarbeitung spezifischer Lösungen.

Analyse regulatorisch relevanter Texte


Im folgenden werden drei mögliche Analysen regulatorischer Texte skizziert, die sich aufgrund der jeweiligen Zielsetzung unterscheiden. Die Analysen sind problemlos auf interne Texte erweiterbar.

Usecase 1: Identifizierung von Ähnlichkeiten

Bei der Analyse wurden die Regularie Basel II sowie die oft als „Basel IV“ bezeichnete Regularie Basel III: Finalising post-crisis reforms betrachtet.
Bereits der allgemeine Vergleich weist auf eine starke Kosinus-Ähnlichkeit zwischen diesen beiden Texten hin (s. Radar-Diagramm).
Der Matrix-Vergleich über alle Paragraphen liefert eine Übereinstimmung über weite Teile (helle Diagonale, s. Matrix-Diagramm).
Ein Abgleich auf Paragraphen-Ebene liefert zahlreiche fast identische Abschnitte bzgl. Kreditrisiken (s. Tabelle).

Radar-Diagramm auf Gesamt-Ebene
Ähnlichkeitsmatrix auf Paragraphen-Ebene
Ähnliche Paragraphen

Usecase 2: Ermittlung von Unterschieden

Es erfolgte ein Vergleich der deutschsprachigen MaRisk aus den Jahren 2017 und 2012.
Wie anhand des allgemeinen Vergleichs (Radar-Plot) sowie des Matrix-Vergleichs über alle Paragraphen (helle Diagonale) zu sehen, sind die Texte größtenteils identisch.
Unterbrechungen der Hauptdiagonalen (roter Pfeil) weisen jedoch auch auf einige Neuerungen hin.
Ein Ähnlichkeitsvergleich über alle Regularien liefert dabei den Punkt „AT 4.3.4“ als größte Änderung gegenüber MaRisk 2012.

Radar-Diagramm auf Gesamt-Ebene
Ähnlichkeitsmatrix auf Paragraphen-Ebene
“Neuartige” Paragraphen

Usecase 3: Finden ähnlicher Paragraphen

Es wurden die Regularie Basel III: Finalising post-crisis reforms („Basel IV“) sowie Basel III (BCBS 189) betrachtet.
Trotz Unterschiede ist ein Bereich relativ großer Ähnlichkeit auf Paragraphen-Ebene erkennbar (roter Pfeil, unten).
Zur näheren Analyse dieses Bereichs wurde ein entsprechender Paragraph aus Basel IV ausgewählt und hierzu die ähnlichsten Paragraphen aus Basel III ermittelt.
Wie in der Tabelle unten aufgeführt beziehen sich die entsprechenden Paragraphen aus Basel III und IV auf die CVA.

Ähnlichkeitsmatrix auf Paragraphen-Ebene
Ähnliche Ziel-Paragraphen







Angebotsstufen für einen regulatorischen Einsatz von Machine Learning-Verfahren

RiskDataScience ermöglicht Banken die beschriebenen Verfahren effizient und institutsspezifisch einzusetzen und weiterzuentwickeln. Entsprechend den jeweiligen Anforderungen werden dazu folgende drei Ausbaustufen vorgeschlagen.

Stufe 1: Methodik

  • Einweisung in Methodik zum Latent Semantic Indexing regulatorischer Texte
  • Übergabe und Installation der vorhandenen Python-Lösung zum automatisierten Einlesen und Zerlegen von Dokumenten sowie zur semantischen Analyse per LSI – bzw. je nach Kundenanforderung Unterstützung der Implementierung vor Ort
  • Übergabe und Dokumentation der Visualisierungs- und Auswertetechniken

Bank ist in der Lage Methodik zur Analyse regulatorischer Anforderungen eigenständig zu ver-wenden und weiterzuentwickeln

Stufe 2: Customizing

  • Stufe 1 und zusätzlich
  • Anpassung von Analyse-Einheiten (Dokument-Gruppen) gemäß Analysezielen der jeweiligen Bank
  • Analyse der konkreten Regularien, Projekte, Methoden, Prozesse und Systeme zur Identifizierung optimaler Einsatzmöglichkeiten
  • Entwicklung von Prozessen zur Erreichung der Analyseziele
  • Kommunikation und Dokumentation der Ergebnisse an alle Stakeholder

Bank verfügt über gecustomizte Verfahren und Prozesse zur Analyse regulatorischer Anforderungen, etwa bzgl. Zuständigkeiten oder Methoden

Stufe 3: IT-Lösung

  • Stufe 1, Stufe 2 und zusätzlich
  • Spezifikation aller Anforderungen für eine automatisierte, ggf. web-basierte IT-Lösung
  • Vorschlag und Kontaktierung möglicher Anbieter
  • Unterstützung bei der Anbieter- und Tool-Auswahl
  • Unterstützung bei der Planung der Umsetzung
  • Fachliche und koordinative Begleitung des Umsetzungsprojekts
  • Fachlicher Support nach Implementierung der IT-Lösung

Bank verfügt über automatisierte IT-Lösung zum effizienten semantischen Abgleich regulatorisch relevanter Textkomponenten.

Je nach Kundenwunsch ist eine flexible Ausgestaltung möglich. Gerne erläutern wir unseren Ansatz auch im Rahmen eines Vorab-Workshops.

Zudem bieten wir mit Regulytics® eine Web-Applikation zur automatisierten Analyse von Regularien auf Gesamt- und Paragraphen-Ebene.


Dr. Dimitrios Geromichalos
Founder / CEO
RiskDataScience UG (haftungsbeschränkt)
Theresienhöhe 28, 80339 München
Telefon: +4989244407277, Fax: +4989244407001
Twitter: @riskdatascience

Machine Learning-Based Classification of Market Phases


The experience of the recent years as well as research results and regulatory requirements suggest the consideration of market regimes. Nevertheless, the largest part of today’s financial risk management is still based on the assumption of constant market conditions.
Currently, neither “stressed” market phases nor potential bubbles are determined in an objective way.
Machine learning procedures, however, enable a grouping according to risk aspects and a classification of the current market situation.
RiskDataScience has already developed procedures to identify market phases.
Market regimes can be determined on the basis of flexible criteria for historical time series. The current market conditions can be assigned to the respective phases. Thus, it is possible to determine if the current situation corresponds to past stress or bubble phases. In addition, historic stress scenarios can be detected in a systematic way.

Market Phases

In contrast to the efficient market theory, markets are characterized by exaggerations and panic situations (new economy, real estate bubbles,…).
Crises exhibit their own rules – like increased correlations – and behave differently from “normal” phases. In the curse of the crises since 2007/2008, the situation has changed dramatically several times (negative interest rates, quantitative easing,…).

Regulators have realized that market situations can differ in a significant way and require the consideration of stressed market phases e.g. in the

  • determination of “stressed VaR” periods
  • definition of relevant stress scenarios

In the conventional market risk management of financial institutions, however, still only uniform market conditions are considered (e.g. in conventional Monte Carlo simulations).
Historic simulations implicitly consider market phases, but they don’t provide assertions which pase applies to specific situations.
Finally, models like GARCH or ARIMA could’t establish themselves outside academic research.

The neglection of market phases implies several problems and risks.
First, a non-objective determination of stressed market phases for regulatory issues can lead to remarks and findings by internal and external auditors. Thus, eventually sensible capital relief can be denied since a less conservative approach can’t be justified in an objective way.
Also, ignoring possibly dangerous current market situations increases the risk of losses by market price fluctuations. In addition, bubbles are not detected in a timely manner and the “rules” of crises (like increased correlations) are not considered in an appropriate way.
On the other hand, a too cautious approach may result in missed opportunities.

Machine Learning Approaches

For the analysis of the relevant market data, several data science / machine learning algorithms can be considered and implemented with tools like Python, R, Weka or RapidMiner. Here, the following groups of algorithms can be discerned:

  • Unsupervised learning algorithms: These algorithms can be used for the determination of “natural” clusters and the grouping of market data according to predefined similarity criteria. This requires appropriate algorithms like kmeans or DBSCAN as well as economic and financial domain expertise. Also, outlier algorithms can be used to detect anomalous market situations, e.g. as basis for stress test scenarios.
  • Supervised learning algorithms: The algorithms (e.g. Naive Bayes) are “trained” with known data sets to classify market situations. Then, new data – and especially the current situation – can be assigned to the market phases.

For a risk-oriented analysis, market data differences (e.g. in the case of interest rates) or returns (e.g. in the case of stock prices) must be calculated from the market data time series as a basis for the further analysis. Further, a “windowing” must be conducted, viz. the relevant values of the previous days must be considered as additional variables.

Use Case: Analysis of Illustrative Market Data

The analysis described below was based on a market data set consisting of the DAX 30 index, the EURIBOR 3M interest rate, and the EURUSD FX rate. The time period was end of 2000 till end of 2016. For the calculations, consistenly daily closing prices were used as basis for the return (DAX 30, EURUSD) and difference calculations (EURIBOR 3M). Eventual structural breaches were adjusted and missing return values were replaced by zeros. The windowing extended to the last 20 days.

Time series of analyzed market data

The data set was analyzed with the clustering algorithms kmeans and DBSCAN. As a result, most points in time could be assigned to a large “normal cluster”. The rest of the data points fell into a smaller “crisis” cluster.
Since – as it was observed – crisis phases often precede “real” crashes, the procedure could be helpful as “bubble detector”.

Identified market phases

The main identified outliers were the

  • spring of 2001: Burst of the dotcom bubble
  • autumn 2001: September 11
  • autumn 2008: Lehman insolvency
    The current time period is not classified as crisis, the extraordinary situation of negative interest rates counsels caution, however.

Based on a training set of 3,000 points of time, the classification algorithms were trained and applied on a test set of 1,000 points.
An appropriate simple algorithm was Naive Bayes; with this algorithm accuracies of over 90% were reached in in-sample as well as out-of-sample tests.

Hence, an efficiend distinguishing of market phases is already realized and a usage as bubble detector possible after economically and financially sound validations.


The methods can be enhanced to capture more complex cases and issues, e.g. for specialized markets like the electricity market as well as patterns and rules characteristic for the high-frequency trading (HFT).

We are developing respective methods and tools and support our customers in obtaining an overall perspective of the data in use.


Dr. Dimitrios Geromichalos
Founder / CEO
RiskDataScience UG (haftungsbeschränkt)
Theresienhöhe 28, 80339 München
Telefon: +4989244407277, Fax: +4989244407001
Twitter: @riskdatascience