The most important part of correcting data errors is having a clean data model, which was one of the first things we did at Castellum.AI. Our data categorization was designed by sanctions experts and data science Ph.Ds, and is automated for maximum efficiency. When we load a new list, we perform extensive QA, but afterwards, the process is almost entirely automated and involves three key steps:
1. Extraction: We extract valuable data from unstructured text through proprietary algorithms. Some list sources use categories, but many simply use one row, usually “name” or “additional information” to insert items like: place of birth, date of birth, ID numbers and more - obscuring useful information in a blob of text. Our algorithms pull passport numbers, citizenship information, vessel identification number (IMO), and more, categorizing the data and making it searchable.
2. Standardization: Watchlist data from different issuers comes in different file formats, including .CSV, PDF, XML, JSON and sometimes even images Every list issuer has different standards and formats: some organizations call a person’s passport their Citizenship, others a Nationality; different spellings and abbreviations exist for the same countries and locations, dates are ordered differently, and much more. Accurate data standardization involves having powerful matching algorithms and leveraging additional data sources and standards to do things like eliminate noise words and convert all country information to ISO 3166 standards.
3. Enrichment: We analyse an entry’s data and then enhance it by adding additional information. For example, anytime a watchlist has an entry that does not have a type (individual, entity, vessel, aircraft, location), we use a set of algorithms to assign it a type. This is a key data category that helps determine how entries are searched and screened, and a major driver in false positive reduction.