Reproducible Pipeline for Cross-Country Roma Survey

Client

Year

2025

Location

Technologies

GitHub, ODK, R, Stata
Reproducible Pipeline for Cross-Country Roma Survey

Generating reliable, comparable data across three countries with separate field teams and survey instruments is one of the harder challenges in international development research. For a landmark baseline survey on the socio-economic vulnerabilities of Roma populations in Georgia, Moldova, and Ukraine, rowsquared led the full data validation and harmonization process — turning complex, inconsistent field data into a clean, reproducible, analysis-ready evidence base.

The survey was commissioned by UNDP in partnership with the World Bank and the EU’s Directorate-General for Enlargement and the Eastern Neighbourhood. It was the first of its kind in these three countries, designed to inform inclusive policy development and serve as the foundation for UNDP’s Digital Social Vulnerability Index and World Bank technical country reports. Data was collected by two separate survey companies, which made cross-country consistency both essential and technically demanding.

rowsquared covered the full scope of the validation work — from senior methodological oversight to hands-on technical data cleaning. We began by reviewing the survey methodology, sampling frameworks, and questionnaire documentation, then systematically audited the datasets against those standards.

The core of our work was a rigorous data cleaning and validation exercise. We flagged hundreds of individual issues: missing values, invalid codes, outlier responses, skip logic violations, and internal inconsistencies. Structured feedback was compiled and sent back to the survey firms in each country, triggering multiple rounds of review and correction that required close coordination across teams working in different languages and institutional contexts.

Alongside the cleaning work, we built a cross-country variable mapping matrix documenting all structural discrepancies between the Georgia–Moldova and Ukraine datasets. We then defined and applied harmonization rules to align variable names, coding schemes, value labels, and metadata formats across both sources. We also reviewed the weighting methodology and recalculated survey weights to ensure they were methodologically sound.

The centrepiece of our contribution was a push-button reproducible data pipeline. We structured the entire workflow — from raw field data through intermediate cleaning stages to final analysis-ready datasets — in a standalone repository. The repository includes clearly organized data folders covering raw, intermediate, and final outputs, a comprehensive README, and a data validation report that documents every issue found, every decision made, and every transformation applied.

UNDP and the World Bank now have clean, harmonized, and fully reproducible datasets for three countries, together with the documentation needed to use them with confidence — from field collection all the way through to analysis.