Data collection

The data collection framework for the Comparative Legislators Database (CLD) relies on the open-collaboration platforms Wikipedia and Wikidata.

The data collection is divided into four steps:

  • Entity identification
  • Data collection
  • Cleaning and verification
  • Database integration


1. Entity identification

The process starts from legislatures' Wikipedia index pages. During the first step of the data collection, you need to identify and scrape parliamentary members from the index sites to extract the member-specific Wikipedia and Wikidata IDs.


2. Data collection

With the Wikipedia identifiers, you can collect most of the information about the individual legislators from the properties in their Wikidata entry. In some instances, you can tackle incomplete data by integrating external data sources.


3. Cleaning and verification

During the third step, you will clean and verify our data. This normally means removing noise in values, harmonizing variables within and across legislatures, and performing checks against alternative data sources.


4. Database integration

The final step of the data collection process entails integrating the newly compiled data with the database interface. The database is built around the Wikipedia and Wikidata IDs and provide access via the legislatoR package in R.


Start the tutorial