Economy Data Observatory

How We Add Value to Public Data With Imputation and Forecasting

Mon, 08 Nov 2021 10:00:00 +0100

Public data sources are often plagued by missng values. Naively you may think that you can ignore them, but think twice: in most cases, missing data in a table is not missing information, but rather malformatted information. This approach of ignoring or dropping missing values will not be feasible or robust when you want to make a beautiful visualization, or use data in a business forecasting model, a machine learning (AI) applicaton, or a more complex scientific model. All of the above require complete datasets, and naively discarding missing data points amounts to an excessive waste of information. In this example we are continuing the example a not-so-easy to find public dataset.

In the previous blogpost we explained how we added value by documenting data following the FAIR principle and with the professional curatorial work of placing the data in context, and linking it to other information sources, such as other datasets, books, and publications, regardless of their natural language (i.e., whether these sources are described in English, German, Portugese or Croatian). Photo: Jack Sloop.

Completing missing datapoints requires statistical production information (why might the data be missing?) and data science knowhow (how to impute the missing value.) If you do not have a good statistician or data scientist in your team, you will need high-quality, complete datasets. This is what our automated data observatories provide.

Why is data missing?

International organizations offer many statistical products, but usually they are on an ‘as-is’ basis. For example, Eurostat is the world’s premiere statistical agency, but it has no right to overrule whatever data the member states of the European Union, and some other cooperating European countries give to them. And they cannot force these countries to hand over data if they fail to do so. As a result, there will be many data points that are missing, and often data points that have wrong (obsolete) descriptions or geographical dimensions. We will show the geographical aspect of the problem in a separate blogpost; for now, we only focus on missing data.

Some countries have only recently started providing data to the Eurostat umbrella organization, and it is likely that you will find few datapoints for North Macedonia or Bosnia-Herzegovina. Other countries provide data with some delay, and the last one or two years are missing. And there are gaps in some countries’ data, too.

See the authoritative copy of the dataset.

This is a headache if you want to use the data in some machine learning application or in a multiple or panel regression model. You can, of course, discard countries or years where you do not have full data coverage, but this approach usually wastes too much information–if you work with 12 years, and only one data point is available, you would be discarding an entire country’s 11-years’ worth of data. Another option is to estimate the values, or otherwise impute the missing data, when this is possible with reasonable precision. This is where things get tricky, and you will likely need a statistician or a data scientist onboard.

What can we improve?

Consider that the data is only missing from one year for a particular country, 2015. The naive solution would be to omit 2015 or the country at hand from the dataset. This is pretty destructive, because we know a lot about the radio market turnover in this country and in this year! But leaving 2015 blank will not look good on a chart, and will make your machine learning application or your regression model stop.

A statistician or a radio market expert will tell you that you know more-or-less the missing information: the total turnover was certainly not zero in that year. With some statistical or radio domain-specific knowledge you will use the 2014, or 2016 value, or a combination of the two and keep the country and year in the dataset.

Our improved dataset added backcasted (using the best time series model fitting the country’s actually present data), forecasted (again, using the best time series model), and approximated data (using linear approximation.) In a few cases, we add the last or next known value. To give a few quantiative indicators about our work:

Increased number of observations: 65%
Reduced missing values: -48.1%
Increased non-missing subset for regression or AI: +66.67%

If your organization is working with panel (longitudional multiple) regressions or various machine learning applications, then your team knows that not havint the +66.67% gain would be a deal-breaker in the choice of models and punctuality of estimates or KPIs or other quantiative products. And that they would spent about 90% of their data resources on achieving this +66.67% gain in usability.

If you happen to work in an NGO, a business unit or a research institute that does not employ data scientists, then it is likely that you can never achieve this improvement, and you have to give up on a number of quantitative tools or visualizations. If you have a data scientist onboard, that professional can use our work as a starting point.

Can you trust our data?

We believe that you can trust our data better than the original public source. We use statistical expertise to find out why data may be missing. Often, it is present in a wrong location (for example, the name of a region changed.)

If you are reluctant to use estimates, think about discarding known actual data from your forecast or visualization, because one data point is missing. How do you provide more accurate information? By hiding known actual data, because one point is missing, or by using all known data and an estimate?

Our codebooks and our API uses the Statistical Data and Metadata eXchange documentation standards to clearly indicate which data is observed, which is missing, which is estimated, and of course, also how it is estimated. This example highlights another important aspect of data trustworthiness. If you have a better idea, you can replace them with a better estimate.

Our indicators come with standardized codebooks that do not only contain the descriptive metadata, but administrative metadata about the history of the indicator values. You will find very important information about the statistical method we used the fill in the data gaps, and even link the reliable, the peer-reviewed scientific, statistical software that made the calculations. For data scientists, we record the plenty of information about the computing environment, too-–this can come handy if your estimates need external authentication, or you suspect a bug.

Avoid the data Sisyphus

If you work in an academic institution, in an NGO or a consultancy, you can never be sure who downloaded the Annual detailed enterprise statistics for services (NACE Rev. 2 H-N and S95) Eurostat folder from Eurostat. Did they modify the dataset? Did they already make corrections with the missing data? What method did they use? To prevent many potential problems, you will likely download it again, and again, and again…

See our The Data Sisyphus blogpost.

We have a better solution. You can always rely on our API to import directly the latest, best data, but if you want to be sure, you can use our regular backups on Zenodo. Zenodo is an open science repository managed by CERN and supported by the European Union. On Zenodo, you can find an authoritative copy of our indicator (and its previous versions) with a digital object identifier, in this case, 10.5281/zenodo.5652118. These datasets will be preserved for decades, and nobody can manipulate them. You cannot accidentally overwrite them, and we have no backdoor access to modify them.

Are you a data user? Give us some feedback! Shall we do some further automatic data enhancements with our datasets? Document with different metadata? Link more information for business, policy, or academic use? Please give us any feedback!

How We Add Value to Public Data With Better Curation And Documentation?

Mon, 08 Nov 2021 09:00:00 +0100

In this example, we show a simple indicator: the Turnover in Radio Broadcasting Enterprises in many European countries. This is an important demand driver in the Music economy pillar of our Digital Music Observatory, and important indicator in our more general Cultural & Creative Sectors and Industries Observatory. Of course, if you work with competition policy or antitrust, than any industry may be interesting to you–but not all of them are well-serverd with data.

This dataset comes from a public datasource, the data warehouse of the European statistical agency, Eurostat. Yet it is not trivial to use: unless you are familiar with national accounts, you will not find this dataset on the Eurostat website.

The data can be retrieved from the Annual detailed enterprise statistics for services NACE Rev.2 H-N and S95 Eurostat folder.

Our version of this statistical indicator is documented following the FAIR principles: our data assets are findable, accessible, interoperable, and reusable. While the Eurostat data warehouse partly fulfills these important data quality expectations, we can improve them significantly. And we can also improve the dataset, too, as we will show in the next blogpost.

Findable Data

Our data observatories add value by curating the data–we bring this indicator to light with a more descriptive name, and we place it in a domain-specific context with our Digital Music Observatory and Cultural & Creative Sectors and Industries Observatory and a policy-specific context with our Competition Data Observatory and Green Deal Data Observatory. While many people may need this dataset in the creative sectors, or among cultural policy designers, most of them have no training in working with national accounts, which imply decyphering national account data codes in records that measure economic activity at a national level. Our curated data observatories bring together many available data around important domains. Our Digital Music Observatory, for example, aims to form an ecosystem of music data users and producers.

We added descriptive metadata that help you find our data and match it with other relevant data sources.

We added descriptive metadata that help you find our data and match it with other relevant data sources. For example, we add keywords and standardized metadata identifiers from the Library of Congress Linked Data Services, probably the world’s largest standardized knowledge library description. This ensures that you can find relevant data around the same key term (radio broadcasting) in addition to our turnover data. This allows connecting our dataset unambiguosly with other information sources that use the same concept, but may be listed under different keywords, such as Radio–Broadcasting, or Radio industry and trade, or maybe Hörfunkveranstalter in German, or Emitiranje radijskog programa in Croatian or Actividades de radiodifusão in Portugese.

Accessible Data

Our data is accessible in two forms: in csv tabular format (which can be read with Excel, OpenOffice, Numbers, SPSS and many similar spreadsheet or statistical applications) and in JSON for automated importing into your databases. We can also provide our users with SQLite databases, which are fully functional, single user relational databases.

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This makes the data easier to clean, and far more easier to use in a much wider range of applications than the original data we used. In theory, this is a simple objective, yet we find that even governmental statistical agencies–and even scientific publications–often publish untidy data. This poses a significant problem that implies productivity loses: tidying data will require long hours of investment, and if a reproducible workflow is not used, data integrity can also be compromised: chances are that the process of tidying will overwrite, delete, or omit a data or a label.

Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.

While the original data source, the Eurostat data warehouse is accessible, too, we added value with bringing the data into a tidy format. Tidy data can immediately be imported into a statistical application like SPSS or STATA, or into your own database. It is immediately available for plotting in Excel, OpenOffice or Numbers.

Interoperability

Our data can be easily imported with, or joined with data from other internal or external sources.

All our indicators come with standardized descriptive metadata, and statistical (processing) metadata. See our API

All our indicators come with standardized descriptive metadata, following two important standards, the Dublin Core and DataCite–implementing not only the mandatory, but the recommended descriptions, too. This will make it far easier to connect the data with other data sources, e.g. turnover with the number of radio broadcasting enterprises or radio stations within specific territories.

Our passion for documentation standards and best practices goes much further: our data uses Statistical Data and Metadata eXchange standardized codebooks, unit descriptions and other statistical and administrative metadata.

Reuse

All our datasets come with standardized information about reusabililty. We add citation, attribution data, and licensing terms. Most of our datasets can be used without commercial restriction after acknowledging the source, but we sometimes work with less permissible data licenses.

In the case presented here, we added further value to encourage re-use. In addition to tidying, we significantly increased the usability of public data by handling missing cases. This is the subject of our next blogpost.

Metadata

Wed, 07 Jul 2021 00:00:00 +0000

Adding metadata exponentially increases the value of data. Did your region add a new town to its boundaries? How do you adjust old data to conform to constantly changing geographic boundaries? What are some practical ways of combining satellite sensory data with my organization’s records? And do I have the right to do so? Metadata logs the history of data, providing instructions on how to reuse it, also setting the terms of use. We automate this labor-intensive process applying the FAIR data concept.

In our observatory we apply the concept of FAIR (findable, accessibe, interoperable, and reusable digital assets) in our APIs and in our open-source statistical software packages.

The hidden cost item

Metadata gets less attention than data, because it is never acquired separately, it is not on the invoice, and therefore it remains an a hidden cost, and it is more important from a budgeting and a usability point of view than the data itself. Metadata is responsible for industry non-billable hours or uncredited working hours in academia. Poor data documentation, lack of reproducible processing and testing logs, inconsistent use of currencies, keywords, and storing messy data make reusability and interoperability, integration with other information impossible.

FAIR Data and the Added Value of Rich Metadata we introduce how we apply the concept of FAIR (findable, accessibe, interoperable, and reusable digital assets) in our APIs.

Organizations pay many times for the same, repeated work, because these boring tasks, which often comprise of tens of thousands of microtasks, are neglected. Our solution creates automatic documentation and metadata for your own historical internal data or for acquisitions from data vendors. We apply the more general Dublin Core and the more specific, mandatory and recommended values of DataCite for datasets – these are new requirements in EU-funded research from 2021. But they are just the minimal steps, and there is a lot more to do to create a diamond ring from an uncut gem.

Map your data: bibliographis, catalogues, codebooks, versioning

Updating descriptive metadata, such as bibliographic citation files, descriptions and sources to data files downloaded from the internet, versioning spreadsheet documents and presentations is usually a hated and often neglected task withing organization, and rightly so: these boring and error-prone tasks are best left to computers.

Already adjusted spreadsheets are re-adjusted and re-checked. Hours are spent on looking for the right document with the rigth version. Duplicates multiply. Already downloaded data is downloaded again, and miscategorized, again. Finding the data without map is a treasure hunt. Photo: © N.

The lack of time and resources spend on documentation over time reduces reusability and significantly increases data processing and supervision or auditing costs.

Our observatory metadata is compliant with the Dublin Core Cross-Domain Attribute Set metadata standard, but we use different formatting. We offer simple re-formatting from the richer DataCite to Dublin Core for interoperability with a wider set of data sources.
We use all mandatory DataCite metadata fields, all the the recommended and optional ones.
It complies with the tidy data principles.

In other words: very easy to import into your databases, or join with other databases, and the information is easy to find. Corrections, updates can automatically managed.

What happened with the data before?

We are creating Codebooks that are following the SDMX statistical metadata codelists, and resemble the SMDX concepts used by international statistical agencies. (See more technical information here.)

Small organizations often cannot afford to have data engineers and data scientists on staff, and they employ analysts who work with Excel, OpenOffice, PowerBI, SPSS or Stata. The problem with these applications is that they often require the user to manually adjust the data, with keyboard entries or mouse clicks. Furthermore, they do not provide a precise logging of the data processing, manipulation history. The manual data processing and manipulation is very error prone and makes the use of complex and high value resources, such as harmonized surveys or symmetric input-output tables, to name two important source we deal with, impossible to use. The use of these high-value data sources often requires tens of thousands of data processing steps: no human can do it faultlessly.

What is even more problematic that simple applications for analysis do not provide a log of these manipulations’ steps: pulling over a column with the mouse, renaming a row, adding a zero to an empty cell. This makes senior supervisory oversight and external audit very costly.

Our data comes with full history: all changes are visible, and we even open the code or algorithm that processed the raw data. Your analysts can still use their favourite spreadsheet or statistical software application, but they can start from a clean, tidy dataset, with all data wrangling, currency and unit conversion, imputation and other low-priority but important tasks done and logged.

Survey Harmonization

Mon, 05 Jul 2021 08:00:00 +0000

We provide retrospecitve, ex post, and ex ante survey harmonization to our partners.

The aim of retrospective survey harmonization is to pool data from pre-existing surveys made with a similar methodology in different points in time and different countries or territories. Ex post survey harmonization is in a way a passive form of pooling research funding because you can utilize information from surveying that were made on somebody else’s expense.

The Arab Barometer surveys do not have a consolidated codebook, but our retroharmonize software created one, and put together data from three years and collected in many countries about various public policy issues.

The aim of ex ante survey harmonization is to maximize the value from future retrospective harmonization; in a way, it is an active form of pooling research funding, because you benefit from money spent on related open governmental and open science survey programs.

In this example we designed a survey representative among music professionals that it can be compared with large-sample, national surveys on living conditions and attitudes, and with occupational groups. Nationally representative surveys do not question enough musicians to allow such specific use; musician only surveys do not allow comparison.

retorhamonize is a peer-reviewed, scientfic statistcal software that allows the programmatic retrospective harmonization of surveys, such as the last 35 years of all Eurobarometer microdata, or all Afrobarometer microdata. Eurobarometer grew out of certain CEE member states’ need for comparable data about their music and audiovisual sectors. We commissioned surveys following ESSNet-Culture guidelines and combined our survey data with open access European microdata-level surveys.

regions solves the problems caused by Europe’s shifting regional boundaries, which have undergone changes in several thousand places over the last twenty years, meaning member states’ and Eurostat’s regional statistics are not comparable over more than two to three years. This software validates and, where possible, changes the regional coding from NUTS1999 until the not yet used NUTS2021, opening up vast, valuable, untapped data sources that can be used for longitudinal analysis or for panel analysis far more precise than what national data alone would allow. It was originally designed in a research project at IVIR in the University of Amsterdam to understand the geographical dynamics of book piracy. Because of the needs this software fills, it had 700 users in the first month after publication. It is particularly useful to re-code old surveys, as regional boundaries are changing in each decade several hundred times in Europe.

Including Indicators from Arab Barometer in Our Observatory

Mon, 28 Jun 2021 09:00:00 +0200

A new version of the retroharmonize R package – which is working with retrospective, ex post harmonization of survey data – was released yesterday after peer-review on CRAN. It allows us to compare opinion polling data from the Arab Barometer with the Eurobarometer and Afrorbarometer. This is the first version that is released in the rOpenGov community, a community of R package developers on open government data analytics and related topics.

Surveys are the most important data sources in social and economic statistics – they ask people about their lives, their attitudes and self-reported actions, or record data from companies and NGOs. Survey harmonization makes survey data comparable across time and countries. It is very important, because often we do not know without comparison if an indicator value is low or high. If 40% of the people think that climate change is a very serious problem, it does not really tell us much without knowing what percentage of the people answered this question similarly a year ago, or in other parts of the world.

With the help of Ahmed Shabani and Yousef Ibrahim, we created a third case study after the Eurobarometer, and Afrobarometer, about working with the Arab Barometer harmonized survey data files.

Ex ante survey harmonization means that researchers design questionnaires that are asking the same questions with the same survey methodology in repeated, distinct times (waves), or across different countries with carefully harmonized question translations. Ex post harmonizations means that the resulting data has the same variable names, same variable coding, and can be joined into a tidy data frame for joint statistical analysis. While seemingly a simple task, it involves plenty of metadata adjustments, because established survey programs like Eurobarometer, Afrobarometer or Arab Barometer have several decades of history, and several decades of coding practices and file formatting legacy.

Variable harmonization means that if the same question is called in one microdata source Q108 and the other eval-parl-elections then we make sure that they get a harmonize and machine readable name without spaces and special characters.
Variable label harmonization means that the same questionnaire items get the same numeric coding and same categorical labels.
Missing case harmonization means that various forms of missingness are treated the same way.

For the evaluation of the economic situation dataset, get the country averages and aggregates from Zenodo, and the plot in jpg or png from figshare.

In our new Arab Barometer case study, the evaulation of parliamentary elections has the following labels. We code them consistently 1 = free_and_fair, 2 = some_minor_problems, 3 = some_major_problems and 4 = not_free.

“0. missing”	“1. they were completely free and fair”
“2. they were free and fair, with some minor problems”	“3. they were free and fair, with some major problems”
“4. they were not free and fair”	“8. i don’t know”
“9. declined to answer”	“Missing”
“They were completely free and fair”	“They were free and fair, with some minor breaches”
“They were free and fair, with some major breaches”	“They were not free and fair”
“Don’t know”	“Refuse”
“Completely free and fair”	“Free and fair, but with minor problems”
“Free and fair, with major problems”	“Not free or fair”
“Don’t know (Do not read)”	“Decline to answer (Do not read)”

Of course, this harmonization is essential to get clean results like this:

For evaluation or reuse of parliamentary elections dataset get the replication data and the code from the Zenodo open repository.

In our case study, we had three forms of missingness: the respondent did not know the answer, the respondent did not want to answer, and at last, in some cases the respondent was not asked, because the country held no parliamentary elections. While in numerical processing, all these answers must be left out from calculating averages, for example, in a more detailed, categorical analysis they represent very different cases. A high level of refusal to answer may be an indicator of surpressing democratic opinion forming in itself.

Survey harmonization with many countries entails tens of thousands of small data management task, which, unless automatically documented, logged, and created with a reproducible code, is a helplessly error-prone process. We believe that our open-source software will bring many new statistical information to the light, which, while legally open, was never processed due to the large investment needed.

We also started building experimental APIs data is running retroharmonize regularly. We will place cultural access and participation data in the Digital Music Observatory, climate awareness, policy support and self-reported mitigation strategies into the Green Deal Data Observatory, and economy and well-being data into our Economy Data Observatory.

Further plans

Retrospective survey harmonization is a far more complex task than this blogpost suggest. Retrospective survey harmonization is a far more complex task than this blogpost suggest, because established survey programs have gathered decades of legacy data in legacy coding schemes and legacy file formats. Putting the data right, and especially putting the invaluable descriptive and administrative (processing) metadata right is a huge undertaking. We are releasing example codes, datasets and charts for researchers to comapre our harmonized results with theirs, and improve our software. We are releasing example codes, datasets and charts for researchers to comapre our harmonized results with theirs, and improve our software.

Use our software

The retroharmonize R package can be freely used, modified and distributed under the GPL-3 license. For the main developer and contributors, see the package homepage. If you use it for your work, please kindly cite it as:

Daniel Antal (2021). retroharmonize: Ex Post Survey Data Harmonization. R package version 0.1.17. https://doi.org/10.5281/zenodo.5034752

Download the BibLaTeX entry.

Tutorial to work with the Arab Barometer survey data

Daniel Antal, & Ahmed Shaibani. (2021, June 26). Case Study: Working With Arab Barometer Surveys for the retroharmonize R package (Version 0.1.6). Zenodo. https://doi.org/10.5281/zenodo.5034759

For the replication data to report potential issues and improvement suggestions with the code:

Daniel Antal, & Ahmed Shaibani. (2021). Replication Data for the retroharmonize R Package Case Study: Working With Arab Barometer Surveys (Version 0.1.6) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.5034741

Experimental API

We are also experimenting with the automated placement of authoritative and citeable figures and datasets in open repositories. For the climate awareness dataset get the country averages and aggregates from Zenodo, and the plot in jpg or png from figshare. Our plan is to release open data in a modern API with rich descriptive metadata meeting the Dublin Core and DataCite standards, and further administrative metadata for correct coding, joining and further manipulating or data, or for easy import into your database.

Join our open source effort

Want to help us improve our open data service? Include Lationbarómetro and the Caucasus Barometer in our offering? Join the rOpenGov community of R package developers, an our open collaboration to create the automated data observatories. We are not only looking for developers, but data curators and service design associates, too.

Open Data - The New Gold Without the Rush

Fri, 18 Jun 2021 17:00:00 +0200

If open data is the new gold, why even those who release fail to reuse it? We created an open collaboration of data curators and open-source developers to dig into novel open data sources and/or increase the usability of existing ones. We transform reproducible research software into research- as-service.

Every year, the EU announces that billions and billions of data are now “open” again, but this is not gold. At least not in the form of nicely minted gold coins, but in gold dust and nuggets found in the muddy banks of chilly rivers. There is no rush for it, because panning out its value requires a lot of hours of hard work. Our goal is to automate this work to make open data usable at scale, even in trustworthy AI solutions.

There is no rush for it, because panning out its value requires a lot of hours of hard work. Our goal is to automate this work to make open data usable at scale, even in trustworthy AI solutions.

Most open data is not public, it is not downloadable from the Internet – in the EU parlance, “open” only means a legal entitlement to get access to it. And even in the rare cases when data is open and public, often it is mired by data quality issues. We are working on the prototypes of a data-as-service and research-as-service built with open-source statistical software that taps into various and often neglected open data sources.

We are in the prototype phase in June and our intentions are to have a well-functioning service by the time of the conference, because we are working only with open-source software elements; our technological readiness level is already very high. The novelty of our process is that we are trying to further develop and integrate a few open-source technology items into technologically and financially sustainable data-as-service and even research-as-service solutions.

Our review of about 80 EU, UN and OECD data observatories reveals that most of them do not use these organizations’s open data - instead they use various, and often not well processed proprietary sources.

We are taking a new and modern approach to the data observatory concept, and modernizing it with the application of 21st century data and metadata standards, the new results of reproducible research and data science. Various UN and OECD bodies, and particularly the European Union support or maintain more than 60 data observatories, or permanent data collection and dissemination points, but even these do not use these organizations and their members open data. We are building open-source data observatories, which run open-source statistical software that automatically processes and documents reusable public sector data (from public transport, meteorology, tax offices, taxpayer funded satellite systems, etc.) and reusable scientific data (from EU taxpayer funded research) into new, high quality statistical indicators.

We are taking a new and modern approach to the ‘data observatory’ concept, and modernizing it with the application of 21st century data and metadata standards, the new results of reproducible research and data science

We are building various open-source data collection tools in R and Python to bring up data from big data APIs and legally open, but not public, and not well served data sources. For example, we are working on capturing representative data from the Spotify API or creating harmonized datasets from the Eurobarometer and Afrobarometer survey programs.
Open data is usually not public; whatever is legally accessible is usually not ready to use for commercial or scientific purposes. In Europe, almost all taxpayer funded data is legally open for reuse, but it is usually stored in heterogeneous formats, processed into an original government or scientific need, and with various and low documentation standards. Our expert data curators are looking for new data sources that should be (re-) processed and re-documented to be usable for a wider community. We would like to introduce our service flow, which touches upon many important aspects of data scientist, data engineer and data curatorial work.
We believe that even such generally trusted data sources as Eurostat often need to be reprocessed, because various legal and political constraints do not allow the common European statistical services to provide optimal quality data – for example, on the regional and city levels.
With rOpenGov and other partners, we are creating open-source statistical software in R to re-process these heterogenous and low-quality data into tidy statistical indicators to automatically validate and document it.
We are carefully documenting and releasing administrative, processing, and descriptive metadata, following international metadata standards, to make our data easy to find and easy to use for data analysts.
We are automatically creating depositions and authoritative copies marked with an individual digital object identifier (DOI) to maintain data integrity.
We are building simple databases and supporting APIs that release the data without restrictions, in a tidy format that is easy to join with other data, or easy to join into databases, together with standardized metadata.
We maintain observatory websites (see: Digital Music Observatory, Green Deal Data Observatory, Economy Data Observatory) where not only the data is available, but we provide tutorials and use cases to make it easier to use them. Our mission is to show a modern, 21st century reimagination of the data observatory concept developed and supported by the UN, EU and OECD, and we want to show that modern reproducible research and open data could make the existing 60 data observatories and the planned new ones grow faster into data ecosystems.

We are working around the open collaboration concept, which is well-known in open source software development and reproducible science, but we try to make this agile project management methodology more inclusive, and include data curators, and various institutional partners into this approach. Based around our early-stage startup, Reprex, and the open-source developer community rOpenGov, we are working together with other developers, data scientists, and domain specific data experts in climate change and mitigation, antitrust and innovation policies, and various aspects of the music and film industry.

Our open collaboration is truly open: new data curators,developers and service designers, even volunteers and citizen scientists are welcome to join.

Our open collaboration is truly open: new data curators, data scientists and data engineers are welcome to join. We develop open-source software in an agile way, so you can join in with an intermediate programming skill to build unit tests or add new functionality, and if you are a beginner, you can start with documentation and testing our tutorials. For business, policy, and scientific data analysts, we provide unexploited, exciting new datasets. Advanced developers can join our development team: the statistical data creation is mainly made in the R language, and the service infrastructure in Python and Go components.

There are Numerous Advantages of Switching from a National Level of the Analysis to a Sub National Level

Wed, 16 Jun 2021 12:00:00 +0200

The new version of our rOpenGov R package regions was released today on CRAN. This package is one of the engines of our experimental open data-as-service Green Deal Data Observatory , Economy Data Observatory , Digital Music Observatory prototypes, which aim to place open data packages into open-source applications.

In international comparison the use of nationally aggregated indicators often have many disadvantages: they inhibit very different levels of homogeneity, and data is often very limited in number of observations for a cross-sectional analysis. When comparing European countries, a few missing cases can limit the cross-section of countries to around 20 cases which disallows the use of many analytical methods. Working with sub-national statistics has many advantages: the similarity of the aggregation level and high number of observations can allow more precise control of model parameters and errors, and the number of observations grows from 20 to 200-300.

The change from national to sub-national level comes with a huge data processing price: internal administrative boundaries, their names, codes codes change very frequently.

Yet the change from national to sub-national level comes with a huge data processing price. While national boundaries are relatively stable, with only a handful of changes in each recent decade. The change of national boundaries requires a more-or-less global consensus. But states are free to change their internal administrative boundaries, and they do it with large frequency. This means that the names, identification codes and boundary definitions of sub-national regions change very frequently. Joining data from different sources and different years can be very difficult.

Our regions R package helps the data processing, validation and imputation of sub-national, regional datasets and their coding.

There are numerous advantages of switching from a national level of the analysis to a sub-national level comes with a huge price in data processing, validation and imputation, and the regions package aims to help this process.

You can review the problem, and the code that created the two map comparisons, in the Maping Regional Data, Maping Metadata Problems vignette article of the package. A more detailed problem description can be found in Working With Regional, Sub-National Statistical Products.

This package is an offspring of the eurostat package on rOpenGov. It started as a tool to validate and re-code regional Eurostat statistics, but it aims to be a general solution for all sub-national statistics. It will be developed parallel with other rOpenGov packages.

Get the Package

You can install the development version from GitHub with:

devtools::install_github("rOpenGov/regions")

or the released version from CRAN:

install.packages("regions")

You can review the complete package documentation on regions.dataobservaotry.eu. If you find any problems with the code, please raise an issue on Github. Pull requests are welcome if you agree with the Contributor Code of Conduct

If you use regions in your work, please cite the package as: Daniel Antal, Kasia Kulma, Istvan Zsoldos, & Leo Lahti. (2021, June 16). regions (Version 0.1.7). CRAN. http://doi.org/10.5281/zenodo.4965909

Join us

Join our open collaboration Economy Data Observatory team as a data curator, developer or business developer. More interested in environmental impact analysis? Try our Green Deal Data Observatory team! Or your interest lies more in data governance, trustworthy AI and other digital market problems? Check out our Digital Music Observatory team!

Open Data is Like Gold in the Mud Below the Chilly Waves of Mountain Rivers

Thu, 10 Jun 2021 07:00:00 +0200

Open data is like gold in the mud below the chilly waves of mountain rivers. Panning it out requires a lot of patience, or a good machine.

As the founder of the automated data observatories that are part of Reprex’s core activities, what type of data do you usually use in your day-to-day work?

The automated data observatories are results of syndicated research, data pooling, and other creative solutions to the problem of missing or hard-to-find data. The music industry is a very fragmented industry, where market research budgets and data are scattered in tens of thousands of small organizations in Europe. Working for the music and film industry as a data analyst and economist was always a pain because most of the efforts went into trying to find any data that can be analyzed. I spent most of the last 7-8 years trying to find any sort of information—from satellites to government archives—that could be formed into actionable data. I see three big sources of information: textual,numeric, and continuous recordings for on-site, offsite, and satellite sensors. I am much better with numbers than with natural language processing, and I am improving with sensory sources. But technically, I can mint any systematic information—the text of an old book, a satellite image, or an opinion poll—into datasets.

For you, what would be the ultimate dataset, or datasets that you would like to see in the Economy Data Observatory?

I am a data scientist now, but I used to be a regulatory economist, and I have worked a lot with competition policy and monopoly regulation issues. Our observatories can automatically monitor market and environmental processes, which would allow us to get into computational antitrust. Peter Ormosi, our competition curator, is particularly interested in killer acquisitions: approved mergers of big companies that end up piling up patents that are not used. I am more interested in describing systematically which markets are getting more concentrated and more competitive, in real time. Does data concentration coincide with market concentration?

To bring an example from the realm of our Digital Music Observatory, which was a prototype to this one, I have been working for some time on creating streaming volume and price indexes, like the Dow Jones Industrial Average or the various bond market indexes, that talk more about price, demand, and potential revenue in music streaming markets all over the world. We did a first take on this in the Central European Music Industry Report and recently we iterated on the model for the UK Intellectual Property Office and the UK Music Creators’ Earnings project. We want to take this further to create a pan-Europe streaming market index, and we will be probably the first to actually be able to report on music market concentrations, and in fact, more or less in a real-time mode.

We would like to further developer our 20-country streaming indexes into a global music market index.

Is there a number or piece of information that recently surprised you? If so, what was it?

There were a few numbers that surprised me, and some of them were brought up by our observatory teams. Karel is talking about the fact that not all green energy is green at all: many hydropower stations contribute to the greenhouse effect and not reduce it. Annette brought up the growing interest in the Dalmatian breed after the Disney 101 Dalmatians movies, and it reminded me of the astonishing growth in interest for chess sets, chess tutorials, and platform subscriptions after the success of Netflix’s The Queen’s Gambit.

The Queen’s Gambit’ Chess Boom Moves Online By Rachael Dottle on bloomberg.com

Annette is talking about the importance of cultural influencers, and on that theme, what could be more exciting that Netflix’s biggest success so far is not a detective series or a soap opera but a coming-of-age story of a female chess prodigy. Intelligence is sexy, and we are in the intelligence business.

But to tell a more serious and more sobering number, I recently read with surprise that there are more people smoking cigarettes on Earth in 2021 than in 1990. Population growth in developing countries replaced the shrinking number of developed country smokers. While I live in Europe, where smoking is strongly declining, it reminds me that Europe’s population is a small part of the world. We cannot take for granted that our home-grown experiences about the world are globally valid.

Do you have a good example of really good, or really bad use of data?

FiveThirtyEight.com had a wonderful podcast series, produced by Jody Avirgan, called What’s the Point. It is exactly about good and bad uses of data, and each episode is super interesting. Maybe the most memorable is Why the Bronx Really Burned. New York City tried to measure fire response times, identify redundancies in service, and close or re-allocate fire stations accordingly. What resulted, though, was a perfect storm of bad data: The methodology was flawed, the analysis was rife with biases, and the results were interpreted in a way that stacked the deck against poorer neighborhoods. It is similar to many stories told in a very compelling argument by Catherine D’Ignazio and Lauren F. Klein in their much celebrated book, Data Feminism. Usually, the bad use of data starts with a bad data collection practice. Data analysts in corporations, NGOs, public policy organizations and even in science usually analyze the data that is available.

You can find these examples, together with many more that our contributors recommend, in the motivating examples of Create New Datasets and the Remain Critical parts of our onboarding material. We hope that more and more professionals and citizen scientist will help us to create high-quality and open data.

The real power lies in designing a data collection program. A consistent data collection program usually requires an investment that only powerful organizations, such as government agencies, very large corporations, or the richest universities can afford. You cannot really analyze the data that is not collected and recorded; and usually what is not recorded is more interesting than what is. Our observatories want to democratize the data collection process and make it more available, more shared with research automation and pooling.

You cannot really analyze the data that is not collected and recorded; and usually what is not recorded is more interesting than what is. Our observatories want to democratize the data collection process and make it more available, more shared with research automation and pooling.

From your perspective, what do you see being the greatest problem with open data in 2021?

I have been involved with open data policies since 2004. The problem has not changed much: more and more data are available from governmental and scientific sources, but in a form that makes them useless. Data without clear description and clear processing information is useless for analytical purposes: it cannot be integrated with other data, and it cannot be trusted and verified. If researchers or government entities that fall under the Open Data Directive release data for reuse in a way that does not have descriptive or processing metadata, it is almost as if they did not release anything. You need this additional information to make valid analyses of the data, and to reverse-engineer them may cost more than to recollect the data in a properly documented process. Our developers, particularly Leo and Pyry are talking eloquently about why you have to be careful even with governmental statistical products, and constantly be on the watch out for data quality.

Our API is not only publishing descriptive and processing metadata alongside with our data, but we also make all critical elements of our processing code available for peer-review on rOpenGov

What do you think the Economy Data Observatory, and our other automated observatories do, to make open data more credible in the European economic policy community and be accepted as verified information?

Most of our work is in research automation, and a very large part of our efforts are aiming to reverse engineer missing descriptive and processing metadata. In a way, I like to compare ourselves to the working method of the open-source intelligence platform Bellingcat. They were able to use publicly available, scattered information from satellites and social media to identify each member of the Russian military company that illegally entered the territory of Ukraine and shot down the Malaysian Airways MH17 with 297, mainly Dutch, civilians on board.

How we create value for research-oriented consultancies, public policy institutes, university research teams, journalists or NGOs.

We do not do such investigations but work very similarly to them in how we are filtering through many data sources and attempting to verify them when their descriptions and processing history is unknown. In the last years, we were able to estore the metadata of many European and African open data surveys, economic impact, and environmental impact data, or many other open data that was lying around for many years without users.

Open data is like gold in the mud below the chilly waves of mountain rivers. Panning it out requires a lot of patience, or a good machine. I think we will come to as surprising and strong findings as Bellingcat, but we are not focusing on individual events and stories, but on social and environmental processes and changes.

Join us

Educate and Train Data Admirers that Data is not Scary

Wed, 09 Jun 2021 12:00:00 +0200

Annette Wong is helping our service development from a digital strategy and marketing point of view.

Why is data important to the work that you do as a digital strategist at an agency?

As a marketing and digital agency, we work with clients to produce and develop marketing campaigns that impact the bottom line. One of the ways to determine the Return-On-Investment (ROI) is through data. By analyzing the data, our team is able to help our clients predict audience behavior and ideally convert them into taking action ($$$).

Currently, I’m working on a music livestreaming platform and everyday we’re always looking at how our campaigns are performing (and measuring their effectiveness). For example, if we’re running a paid campaign through Facebook and if it’s not converting at the expected % that we want, it indicates to us that we need to change our approach. Data gives us the power and freedom to experiment (with minimal risk) and empowers us to make informed decisions quickly.

Why are you excited about the Digital Music Observatory and is there a reason you decided to participate in this initiative?

Seeing how the pandemic decimated the music industry, specifically in-person events, made me feel a lot of empathy for musicians and the economics of their situation, especially with how musicians generate a living income through their music. The importance of data and having open access promotes transparency, fairer wages (ideally), and levels the playing field for musicians of all sizes and popularity.

Our retroharmonization software helps the creation of objective and comparable indicators about how musicians make a living, or how people think about climate challenges.

I decided to participate in this challenge because I love how data is a secret weapon that anyone can use to re-balance the interests of creators, distributors, and consumers.

Is there a number that recently surprised you? What was it?

This is a little silly but very recently I watched the 101 Dalmatians movie. After watching the movie, I was curious to see if there was a correlation between the release of the movie and the number of Dalmations adopted afterwards. 101 Dalmatians was released in 1985 and 1991 which made thousands of families (in the U.S.) want to adopt one. The American Kennel Club reported that the annual number of Dalmatian puppies registered skyrocketed from 8,170 animals to 42,816.

Photo: John O' Groats, Unsplash license.

This information is interesting because it validates the idea of how culture influences consumer behavior. I think it’s really cool that we can measure cultural collisions and how it impacts the way we act, think, and respond.

What can our automated data observatories do to make open data more credible in the European economic policy community, or in the music business community more accepted?

I believe that people, in general, appreciate and understand the importance of data. But, it can be overwhelming, sometimes scary, and intimidating to deal with (esp. in large quantities).

However, I feel more people are open to the idea of using data and understand the value of leveraging data to share objective truths. Something that our automated data observatories can do is to provide more opportunities to educate and train data admirers that data is not scary, that it is accessible, and it is here to help uncover insights that can’t be immediately seen.

Join us

Credibility is Enhanced Through Cross Links Between Different Data from Different Domains

Tue, 08 Jun 2021 18:50:00 +0200

As a consultant, what type of data do you usually work with?

I work at the intersection between strategy, finance and organisation. My usual dataset is quite broad - and sometimes unstructured. Oftentimes, the most decisive data are ones that cross domains: economic data coupled with environmental measurements, sociodemographic characteristics linked with online analytics.

If you were able to pick, what would be the ultimate dataset, or datasets that you would like to see in the Green Deal Data Observatory? And the Economy Data Observatory?

If I may venture that far, the interesting point is where these two data observatories meet. But high on my wishlist would be anything related to geospatial dispersion of environmental and climate data: land erosion, aerosols, solar incidence. From an economic perspective, my interest would go especially to - again - dispersion across regions or other geographical domains of, say, number of new enterprises, disposable income, tax incidence…

See our case study on connecting local tax revenues, climate awareness poll data and drought data in Belgium.

Why did you decide to join the challenge and why do you think that this would be a game changer for policymakers and for business leaders?

There is, both from an ecological and a societal point of view, an urgent need for open-access, real-time, trustworthy data to base decisions on. Ever since Kydland & Prescott’s analyses of “rules rather than discretion” and even earlier analyses of investment under uncertainty, the dynamic rules for optimal decision-making (including investment) require fast-response reliable data.

Do you have a favorite, or most used open governmental or open science data source? What do you think about it? Could it be improved?

Let me give one example: the AMECO annual macro-economic database is great for long-term historical analyses but its components ought to be real-time available. As an anecdote, as a fund manager in emerging markets we needed to anticipate macro-economic evolutions and in particular the manner in which capital markets anticipate these evolutions by adjusting foreign exchange rates or positioning themselves along yield curves. To some extent, we needed to predict what AMECO would tell us one year later by means of any real-time trustworthy assessments of the financial or economic situation. The latter data is what we would ideally have in an observatory.

To some extent, we needed to predict what AMECO would tell us one year later by means of any real-time trustworthy assessments of the financial or economic situation. The latter data is what we would ideally have in an observatory.

Is there a piece of information that recently surprised you? What was it?

I am currently working on water-related issues and came across a result reported in Nature Energy earlier this year that in more than one in ten hydropower stations, the extra warming from the dark surface of the water reservoir was enough to outbalance its “green” electricity generation potential, leading to no net climate benefits.

The researchers found that almost half of the reservoirs they surveyed took just four years to reach a net climate benefit. Unfortunately, they also found that 19% of those surveyed took more than 40 years to do so, and approximately 12% of them took 80 years—the average lifetime of a hydroelectric plant. Calculating the albedo-climate penalty of hydropower dammed reservoirs

Again: spatial distribution matters…

Photo: Kees Streefkerk, Unplash License

From your experience, what do you think the greatest problem with open data in 2021 will be?

Trust. In a society where “value” and even “truth” is determined more by the amount of (web) links to a particular “fact” than by its intrinsic characteristics, we need to be able to trust data — open data because it’s open and “closed” data because it’s closed.

What can our automated data observatories do to make open data more credible in the European economic policy and climate change or mitigation community and be more accepted as verified information?

If I may refer to the previous answer: credibility is enhanced through cross-links between different data from different domains that “does not disprove” one another or that is internally consistent. If, say, data on taxable income goes in one direction and taxes in another, it is the reasoned reconciliation of the - alleged or real - inconsistency that will validate the comprehensive data set. So I am a great believer in broad, real-time observatories where not only the data capture, but the data reconciliation is automated, sometimes by means of a simple comparative statics analysis, in other cases maybe through quite elaborate artificial intelligence.

Join us

Our Datasets Should be Retrieved Cleaned and Assessed in Order to Deliver Efficient Relevant and Credible Information

Mon, 07 Jun 2021 20:00:00 +0200

As a consultant, what type of data do you usually use in your work at ECORYS?

We work with a great variety of data – both from qualitative and quantitative sources – that we retrieve from publicly available sources or get through our clients. Since we are a public policy consultancy, most of the datasets are related to government reports, policies, statistics or surveys that we analyse and assess within a specific timeframe. Oftentimes, we gather open data like non-textual or numeric, such as maps and satellite images; so-called “raw data,” like weather, geospatial and environmental data; or data such as that generated in research like genomes, medical data, mathematical and scientific formulas.

If you were able to pick, what would be the ultimate dataset, or datasets that you would like to see in the Green Deal Data Observatory?

I would like to see more data on the consequences and impact of increasing drought and urban heat in our cities in the Green Deal Data Observatory. Because of the complexity of rapidly developing metropolitan regions and the uncertainty associated with climate change, we need to explore more climate change adaptation and mitigation activities, or disaster risk reduction, not only climate change itself.

See our drought case study on how we combine very different data in our observatory

We need more reliable datasets on the effect of global warming on urban resilience and more indicators to inform stakeholders on disaster risk reduction. The Green Deal Observatory could build indexes for public and private entities once we would have all the relevant data at hand. With this project, we could explore many possibilities to actually utilise open data for a common and societal good, working towards a great social cause.

Why did you decide to join the challenge and why do you think that this would be a game changer for policymakers and for business?

As a consultant for many socially relevant projects, everyday I see the importance of high quality and diverse datasets. I joined the challenge to contribute to significant causes enabled through the Green Deal Data Observatory and Economy Data Observatory. We can all benefit from the usage of open data, which is, in my opinion, a prerequisite for open government partnerships.

I believe that through our work and through open data collaborations, we show a good example for a cultural change in the relationship between citizens and the state, which can contribute to more transparency, more participation and more intensive cooperation.

The access and analysis of open data for the general public would make political action more transparent and more comprehensible. This can lead to greater accountability and a sense of duty on the part of public officials to the general public, which in turn can lead to greater acceptance of government action and strengthen the public’s trust in their government and administration.

Is there a number that recently surprised you? What was it?

Climate change is increasing people’s exposure to heat. Extreme temperature events have been documented to be rising in frequency, duration, and magnitude over the world. The number of persons exposed to heatwaves grew by roughly 125 million between 2000 and 2016.

Sydney by Marek Piwnicki Unplash License

From your experience, what do you think the greatest problem with open data in 2021 will be?

I see two great problems with the use of open data. The first one is the low level of exploitation. The other is the lack of transparency in data processing.

The use of open data should be transparent and meet high quality standards. If we want to enable communities to use it for solving local problems, we must do two things. First, data must be made easy to use (or actionable), and second, we have to increase public awareness and offer training for use. Furthermore, governments should release data in usable formats that follow open data guidelines. Currently, there is very little effort made at the community level to encourage the reuse of public data for the public good.

What can our automated data observatories do to make open data more credible in the European economic policy community and be more accepted as verified information?

Almost nothing is being done to help communities build the capability to analyze and implement open data without relying on technology.

Our API contains rich processing and descriptive metadata besides our high-quality indicators.

This is a critical task that the our fledlging data Observatories, the Digital Music Observatory, Green Deal Data Observatory and Economy Data Observatory, may be able to help with. Facilitating private-public partnerships is one step to encourage the data community to work with valuable open data. However, transparency and a high level quality assurance step must be given. In a joint collaboration with data curators, developers, technical specialists and academics, the datasets should be retrieved, cleaned and assessed in order to deliver efficient, relevant and credible information. The constant monitoring and regulation as well as compliance with data security guidelines are indispensable.

Join us

Comparing Data to Oil is a Cliché: Crude Oil Has to Go Through a Number of Steps and Pipes Before it Becomes Useful

Mon, 07 Jun 2021 10:00:00 +0200

As a developer at rOpenGov, and as an economic sociologist, what type of data do you usually use in your work?

Generally speaking, people’s access to (or inequalities in accessing) different types of resources and their ability in transforming these resources to other types of resources is what interests me. The data I usually work with is the kind of data that is actually nicely covered by existing rOpenGov tools: data about population demographics and administrative units from Statistics Finland, statistical information on welfare and health from Sotkanet and also data from Eurostat. Aside from these a lot of information is of course data from surveys and texts scraped from the internet.

We are placing the growing number of rOpenGov tools in a modern application with a user-friendly service and a modern data API.

In your ideal data world, what would be the ultimate dataset, or datasets that you would like to see in the Music Data Observatory?

Late spring and early summer time is, at least for me, defined by the Eurovision Song Contest. Every year watching the contest makes me ponder the state of the music industry in my home country Finland as well as in Europe. Was the song produced by homegrown talent or was it imported? Was it better received by the professional jury or the public? How well does the domestic appeal of an artist translate to the international stage? Many interesting phenomena are difficult to quantify in a meaningful way and writing a catchy song with international appeal is probably more an art than a science. Nevertheless that should not deter us from trying as music, too, is bound by certain rules and regularities that can be researched.

Music, too, is bound by certain rules and regularities that can be researched. Our Digital Music Observatory and its Listen Local experimental App does this exactly, and we would love to create Eurovision musicology datasets. Photo: Eurovision Song Contest 2021 press photo by Jordy Brada

Why did you decide to join the EU Datathon challenge team and why do you think that this would be a game changer for researchers and policymakers?

The challenge has, in my opinion, great potential in leading by example when it comes to open data access and reproducible research. Comparing data to oil is a common phrase but fitting in the sense that crude oil has to go through a number of steps and pipes before it becomes useful. Most users and especially policymakers appreciate ease-of-use of the finished product, but the quality of the product and the process must also be guaranteed somehow. Openness and peer-review practices are the best guarantors in the field of data, just as industrial standards and regulations are in the oil industry.

We provide many layers of fully transparent quality control about the data we are placing in our data APIs and provide for our end-users.

Join us

Creating Algorithmic Tools to Interpret and Communicate Open Data Efficiently

Fri, 04 Jun 2021 10:00:00 +0200

As a developer at rOpenGov, what type of data do you usually use in your work?

As an academic data scientist whose research focuses on the development of general-purpose algorithmic methods, I work with a range of applications from life sciences to humanities. Population studies play a big role in our research, and often the information that we can draw from public sources - geospatial, demographic, environmental - provides invaluable support. We typically use open data in combination with sensitive research data but some of the research questions can be readily addressed based on open data from statistical authorities such as Statistics Finland or Eurostat.

In your ideal data world, what would be the ultimate dataset, or datasets that you would like to see in the Music Data Observatory?

One line of our research analyses the historical trends and spread of knowledge production, in particular book printing based on large-scale metadata collections. It would be interesting to extend this research to music, to understand the contemporary trends as well as the broader historical developments. Gaining access to a large systematic collection of music and composition data from different countries across long periods of time would make this possible.

Why did you decide to join the challenge and why do you think that this would be a game changer for researchers and policymakers?

Joining the challenge was a natural development based on our overall activities in this area; the rOpenGov project has been around for a decade now, since the early days of the broader open data movement. This has also created an active international developer network and we felt well equipped for picking up the challenge. The game changer for researchers is that the project highlights the importance of data quality, even when dealing with official statistics, and provides new methods to solve these issues efficiently through the open collaboration model. For policymakers, this provides access to new high-quality curated data and case studies that can support evidence-based decision-making.

Do you have a favorite, or most used open governmental or open science data source? What do you think about it? Could it be improved?

Regarding open government data, one of my favorites is not a single data source but a data representation standard. The px format is widely used by statistical authorities in various countries, and this has allowed us to create R tools that allow the retrieval and analysis of official statistics from many countries across Europe, spanning dozens of statistical institutions. Standardization of open data formats allows us to build robust algorithmic tools for downstream data analysis and visualization. Open government data is still too often shared in obscure, non-standard or closed-source file formats and this is creating significant bottlenecks for the development of scalable and interoperable AI and machine learning methods that can harness the full potential of open data.

Regarding open government data, one of my favorites is not a single data source but a data representation standard, the Px format.

From your perspective, what do you see being the greatest problem with open data in 2021?

Although there are a variety of open data sources available (and the numbers continue to increase), the availability of open algorithmic tools to interpret and communicate open data efficiently is lagging behind. One of the greatest challenges for open data in 2021 is to demonstrate how we can maximize the potential of open data by designing smart tools for open data analytics.

What can our automated data observatories do to make open data more credible in the European economic policy community and be accepted as verified information?

The role of the professional network backing up the project, and the possibility of getting critical feedback and later adoption by the academic communities will support the efforts. Transparency of the data harmonization operations is the key to credibility, and will be further supported by concrete benchmarks that highlight the critical differences in drawing conclusions based on original sources versus the harmonized high-quality data sets.

We need to get critical feedback and later adoption by the academic communities.

How we can ensure the long-term sustainability of the efforts?

The extent of open data space is such that no single individual or institution can address all the emerging needs in this area. The open developer networks play a huge role in the development of algorithmic methods, and strong communities have developed around specific open data analytical environments such as R, Python, and Julia. These communities support networked collaboration and provide services such as software peer review. The long-term sustainability will depend on the support that such developer communities can receive, both from individual contributors as well as from institutions and governments.

Join us

Economic and Environment Impact Analysis, Automated for Data-as-Service

Thu, 03 Jun 2021 16:00:00 +0200

We have released a new version of iotables as part of the rOpenGov project. The package, as the name suggests, works with European symmetric input-output tables (SIOTs). SIOTs are among the most complex governmental statistical products. They show how each country’s 64 agricultural, industrial, service, and sometimes household sectors relate to each other. They are estimated from various components of the GDP, tax collection, at least every five years.

SIOTs offer great value to policy-makers and analysts to make more than educated guesses on how a million euros, pounds or Czech korunas spent on a certain sector will impact other sectors of the economy, employment or GDP. What happens when a bank starts to give new loans and advertise them? How is an increase in economic activity going to affect the amount of wages paid and and where will consumers most likely spend their wages? As the national economies begin to reopen after COVID-19 pandemic lockdowns, is to utilize SIOTs to calculate direct and indirect employment effects or value added effects of government grant programs to sectors such as cultural and creative industries or actors such as venues for performing arts, movie theaters, bars and restaurants.

Making such calculations requires a bit of matrix algebra, and understanding of input-output economics, direct, indirect effects, and multipliers. Economists, grant designers, policy makers have those skills, but until now, such calculations were either made in cumbersome Excel sheets, or proprietary software, as the key to these calculations is to keep vectors and matrices, which have at least one dimension of 64, perfectly aligned. We made this process reproducible with iotables and eurostat on rOpenGov

Our iotables package creates direct, indirect effects and multipliers programatically. Our observatory will make those indicators available for all European countries.

Accessing and tidying the data programmatically

The iotables package is in a way an extension to the eurostat R package, which provides a programmatic access to the Eurostat data warehouse. The reason for releasing a new package is that working with SIOTs requires plenty of meticulous data wrangling based on various metadata sources, apart from actually accessing the data itself. When working with matrix equations, the bar is higher than with tidy data. Not only your rows and columns must match, but their ordering must strictly conform the quadrants of the a matrix system, including the connecting trade or tax matrices.

When you download a country’s SIOT table, you receive a long form data frame, a very-very long one, which contains the matrix values and their labels like this:

## Table naio_10_cp1700 cached at C:\Users\...\Temp\RtmpGQF4gr/eurostat/naio_10_cp1700_date_code_FF.rds

# we save it for further reference here 
saveRDS(naio_10_cp1700, "not_included/naio_10_cp1700_date_code_FF.rds")

# should you need to retrieve the large tempfiles, they are in 
dir (file.path(tempdir(), "eurostat"))

dplyr::slice_head(naio_10_cp1700, n = 5)

## # A tibble: 5 x 7
##   unit    stk_flow induse  prod_na geo       time        values
##   <chr>   <chr>    <chr>   <chr>   <chr>     <date>       <dbl>
## 1 MIO_EUR DOM      CPA_A01 B1G     EA19      2019-01-01 141873.
## 2 MIO_EUR DOM      CPA_A01 B1G     EU27_2020 2019-01-01 174976.
## 3 MIO_EUR DOM      CPA_A01 B1G     EU28      2019-01-01 187814.
## 4 MIO_EUR DOM      CPA_A01 B2A3G   EA19      2019-01-01      0 
## 5 MIO_EUR DOM      CPA_A01 B2A3G   EU27_2020 2019-01-01      0

The metadata reads like this: the units are in millions of euros, we are analyzing domestic flows, and the national account items B1-B2 for the industry A01. The information of a 64x64 matrix (the SIOT) and its connecting matrices, such as taxes, or employment, or C**O₂ emissions, must be placed exactly in one correct ordering of columns and rows. Every single data wrangling error will usually lead in an error (the matrix equation has no solution), or, what is worse, in a very difficult to trace algebraic error. Our package not only labels this data meaningfully, but creates very tidy data frames that contain each necessary matrix of vector with a key column.

iotables package contains the vocabularies (abbreviations and human readable labels) of three statistical vocabularies: the so called COICOP product codes, the NACE industry codes, and the vocabulary of the ESA2010 definition of national accounts (which is the government equivalent of corporate accounting).

Our package currently solves all equations for direct, indirect effects, multipliers and inter-industry linkages. Backward linkages show what happens with the suppliers of an industry, such as catering or advertising in the case of music festivals, if the festivals reopen. The forward linkages show how much extra demand this creates for connecting services that treat festivals as a ‘supplier’, such as cultural tourism.

Let’s seen an example

## Downloading employment data from the Eurostat database.

## Table lfsq_egan22d cached at C:\Users\...\Temp\RtmpGQF4gr/eurostat/lfsq_egan22d_date_code_FF.rds

and match it with the latest structural information on from the Symmetric input-output table at basic prices (product by product) Eurostat product. A quick look at the Eurostat website already shows that there is a lot of work ahead to make the data look like an actual Symmetric input-output table. Download it with iotable_get() which does basic labelling and preprocessing on the raw Eurostat files. Because of the size of the unfiltered dataset on Eurostat, the following code may take several minutes to run.

sk_io <-  iotable_get ( labelled_io_data = NULL, 
                        source = "naio_10_cp1700", geo = "SK", 
                        year = 2015, unit = "MIO_EUR", 
                        stk_flow = "TOTAL",
                        labelling = "iotables" )

## Reading cache file C:\Users\..\Temp\RtmpGQF4gr/eurostat/naio_10_cp1700_date_code_FF.rds

## Table  naio_10_cp1700  read from cache file:  C:\Users\..\Temp\RtmpGQF4gr/eurostat/naio_10_cp1700_date_code_FF.rds

## Saving 808 input-output tables into the temporary directory
## C:\Users\...\Temp\RtmpGQF4gr

## Saved the raw data of this table type in temporary directory C:\Users\...\Temp\RtmpGQF4gr/naio_10_cp1700.rds.

The input_coefficient_matrix_create() creates the input coefficient matrix, which is used for most of the analytical functions.

a_i**j = X_i**j / x_j

It checks the correct ordering of columns, and furthermore it fills up 0 values with 0.000001 to avoid division with zero.

input_coeff_matrix_sk <- input_coefficient_matrix_create(
  data_table = sk_io
)

## Columns and rows of real_estate_imputed_a, extraterriorial_organizations are all zeros and will be removed.

Then you can create the Leontieff-inverse, which contains all the structural information about the relationships of 64x64 sectors of the chosen country, in this case, Slovakia, ready for the main equations of input-output economics.

I_sk <- leontieff_inverse_create(input_coeff_matrix_sk)

And take out the primary inputs:

primary_inputs_sk <- coefficient_matrix_create(
  data_table = sk_io, 
  total = 'output', 
  return = 'primary_inputs')

## Columns and rows of real_estate_imputed_a, extraterriorial_organizations are all zeros and will be removed.

Now let’s see if there the government tries to stimulate the economy in three sectors, agricultulre, car manufacturing, and R&D with a billion euros. Direct effects measure the initial, direct impact of the change in demand and supply for a product. When production goes up, it will create demand in all supply industries (backward linkages) and create opportunities in the industries that use the product themselves (forward linkages.)

direct_effects_create( primary_inputs_sk, I_sk ) %>%
  select ( all_of(c("iotables_row", "agriculture",
                    "motor_vechicles", "research_development"))) %>%
  filter (.data$iotables_row %in% c("gva_effect", "wages_salaries_effect", 
                                    "imports_effect", "output_effect"))

##            iotables_row agriculture motor_vechicles research_development
## 1        imports_effect   1.3684350       2.3028203            0.9764921
## 2 wages_salaries_effect   0.2713804       0.3183523            0.3828014
## 3            gva_effect   0.9669621       0.9790771            0.9669467
## 4         output_effect   2.2876287       3.9840251            2.2579634

Car manufacturing requires much imported components, so each extra demand will create a large importing activity. The R&D will create a the most local wages (and supports most jobs) because research is job-intensive. As we can see, the effect on imports, wages, gross value added (which will end up in the GDP) and output changes are very different in these three sectors.

This is not the total effect, because some of the increased production will translate into income, which in turn will be used to create further demand in all parts of the domestic economy. The total effect is characterized by multipliers.

Then solve for the multipliers:

multipliers_sk <- input_multipliers_create( 
  primary_inputs_sk %>%
    filter (.data$iotables_row == "gva"), I_sk )

And select a few industries:

set.seed(12)
multipliers_sk %>% 
  tidyr::pivot_longer ( -all_of("iotables_row"), 
                        names_to = "industry", 
                        values_to = "GVA_multiplier") %>%
  select (-all_of("iotables_row")) %>%
  arrange( -.data$GVA_multiplier) %>%
  dplyr::sample_n(8)

## # A tibble: 8 x 2
##   industry               GVA_multiplier
##   <chr>                           <dbl>
## 1 motor_vechicles                  7.81
## 2 wood_products                    2.27
## 3 mineral_products                 2.83
## 4 human_health                     1.53
## 5 post_courier                     2.23
## 6 sewage                           1.82
## 7 basic_metals                     4.16
## 8 real_estate_services_b           1.48

Vignettes

The Germany 1990 provides an introduction of input-output economics and re-creates the examples of the Eurostat Manual of Supply, Use and Input-Output Tables, by Jörg Beutel (Eurostat Manual).

The United Kingdom Input-Output Analytical Tables Daniel Antal, based on the work edited by Richard Wild is a use case on how to correctly import data from outside Eurostat (i.e. not with eurostat::get_eurostat()) and join it properly to a SIOT. We also used this example to create unit tests of our functions from a published, official government statistical release.

Finally, Working With Eurostat Data is a detailed use case of working with all the current functionalities of the package by comparing two economies, Czechia and Slovakia and guides you through a lot more examples than this short blogpost.

Our package was originally developed to calculate GVA and employment effects for the Slovak music industry, and similar calculations for the Hungarian film tax shelter. We can now programatically create reproducible multipliers for all European economies in the Digital Music Observatory, and create further indicators for economic policy making in the Economy Data Observatory.

Environmental Impact Analysis

Our package allows the calculation of various economic policy scenarios, such as changing the VAT on meat or effects of re-opening music festivals on aggregate demand, GDP, tax revenues, or employment. But what about the C**O₂, methane and other greenhouse gas effects of the reopening festivals, or the increasing meat prices?

Technically our package can already calculate such effects, but to do so, you have to carefully match further statistical vocabulary items used by the European Environmental Agency about air pollutants and greenhouse gases.

The last released version of iotables is Importing and Manipulating Symmetric Input-Output Tables (Version 0.4.4). Zenodo. https://doi.org/10.5281/zenodo.4897472, but we are alread working on a new major release. In that release, we are planning to build in the necessary vocabulary into the metadata functions to increase the functionality of the package, and create new indicators for our Green Deal Data Observatory. This experimental data observatory is creating new, high quality statistical indicators from open governmental and open science data sources that has not seen the daylight yet.

rOpenGov and the EU Datathon Challenges

rOpenGov, Reprex, and other open collaboration partners teamed up to build on our expertise of open source statistical software development further: we want to create a technologically and financially feasible data-as-service to put our reproducible research products into wider user for the business analyst, scientific researcher and evidence-based policy design communities.

rOpenGov is a community of open governmental data and statistics developers with many packages that make programmatic access and work with open data possible in the R language. Reprex is a Dutch-startup that teamed up with rOpenGov and other open collaboration partners to create a technologically and financially feasible service to exploit reproducible research products for the wider business, scientific and evidence-based policy design community. Open data is a legal concept - it means that you have the rigth to reuse the data, but often the reuse requires significant programming and statistical know-how. We entered into the annual EU Datathon competition in all three challenges with our applications to not only provide open-source software, but daily updated, validated, documented, high-quality statistical indicators as open data in an open database. Our iotables package is one of our many open-source building blocks to make open data more accessible to all.

New Indicators for Computational Antitrust

Wed, 02 Jun 2021 17:00:00 +0200

As someone who’s worked in data for almost 20 years, what type of data do you usually use in your research?

In my field (industrial organisation, competition policy), company level financial data, and product price and sales data have been the conventional building blocks of research papers. Ideally this has been the sort of data that I would seek out for my work. Of course as academic researchers we often get knocked back by the reality of data access and availability. I would think that industrial organisation is one of those fields where researchers have to be quite innovative in terms of answering interesting and relevant policy questions, whilst having to operate in an environment where most relevant data is proprietary and very expensive. Against this backdrop, I have worked with neatly organised proprietary datasets, self-assembled data collections, and also textual data.

From your experience working with various data sets, models, and frameworks, what would be the ultimate dataset, or datasets that you would like to see from the Economy Data Observatory?

There seems to be an emerging consensus that market concentration and markups have been continuously increasing across the economy. But most of these works use industry classification to define markets. One of the things I’d really like to see coming out of the Economy Data Observatory is a mapping of what we call antitrust markets.

Mapping NACE to Antitrust Markets.

Available datasets use standard industry classification (such as NACE in the EU), which is often very different from what we call a product market in microeconomics. Product markets are defined by demand, and supply-side substitutability, which is a dynamically evolving feature and difficult to capture systematically on a wider scale. But with the recent proliferation of data and the growth (and fall in price) of computing power, I am positive that we could attempt to map out the European economy along these product market boundaries. Of course this is not without any challenge. For example in digital markets, traditional ways to define markets have caused serious challenges to competition authorities around the world.

I believe that there is an immensely rich, and largely unexplored source of information in unstructured textual data that would be hugely useful for applied microeconomic works, including my own area of IO and competition policy. This includes a large corpus of administrative and court decisions that relate to businesses, such as merger control decisions of the European Commission. To give two examples from my experience, we’ve used a large corpus of news reports related to various firms to gauge the reputational impact of European Commission cartel investigations, or we’ve trained an algorithm to be able to classify US legislative bills and predict whether they have been lobbied or not. Finding a way to collect and convert this unstructured data into a format that is relevant and useful for users is not a trivial challenge, but is one of the most exciting parts of our Economy Data Observatory plans (see related project plan).

Finding a way to collect and convert this unstructured data into a format that is relevant and useful for users is not a trivial challenge, but is one of the most exciting parts of our Economy Data Observatory plans.

What is an idea that you consider will be a game changer for researchers and/or policymakers?

Partly talking in the past tense, the use of data driven approaches, automation in research, and machine learning have been increasingly influential and I think this trend will continue to all areas of social science. 10 years ago, to do machine learning, you had to build your models from scratch, typically requiring a solid understanding of programming and linear algebra. Today, there are readily available deep learning frameworks like TensorFlow, Keras, PyTorch, to design a neural network for your own application. 10 years ago, natural language processing would have only been relevant for a small group of computational linguists. Today we have massive word embedding models trained on an enormous corpus of texts, at the fingertip of any researcher. 10 years ago, the cost of computing power would have made it prohibitive for most researchers to run even relatively shallow neural networks. Today, I can run complex deep learning models on my laptop using cloud computing servers. As a result of these developments, whereas 10 years ago one would have needed a small (or large) research team to explore certain research questions, much of this can now be automated and be done by a single researcher. For researchers without access to large research grants and without the ability to hire a research team, this has truly been an amazing victory for the democratisation of research.

You can already try out our API.

Do you have a favorite, or most used open governmental or open science data source? What do you think about it? Could it be improved?

As a competition economist, I tend to need very specific data for each research question I’m working on, which has to be collected from scratch. On the other hand, most works do require us to use data that has already been collected and made available. For example, access to census data has been immensely useful in ensuring that we can control for local demographic features, in papers where local competition plays a role. Census data is made readily available by most governments, but I particularly liked the Australian data, partly because they run a census every 5 years, but also because they have made the data available through a great table making tool.

Is there a number that recently surprised you? What was it?

I have these moments of surprise fairly frequently. To give one example from something I’m currently working on, looking at the distributional impact of increasing market concentration, we’ve found that low income households experience a larger increase in the petrol retail margin when market concentration increases than high income households. This fits nicely with theoretical works on search in homogeneous costs, i.e. low income households are less good at engaging with the market, and, as a result, if suppliers can price discriminate, they will charge a higher margin to these households.

The figure below shows our raw data (18 years of petrol station level daily price data from Western Australia) for low and high income areas, and the increase in the margin following an increase in market concentration (vertical dotted line). The left hand side, low income areas, displays a large increase in the margin (when compared to a control group), whereas the right hand side (high income households) experience no change. In our paper of course we build a fairly data intensive quasi experiment for identification of the treatment effect of changing market concentration on the price margin applied to various demographic groups.

Surprising findings: market concentration and margin changes for petrol stations.

Do you have a good example of really good, or really bad use of data science /data curation?

Out of professional courtesy I really wouldn’t like to mention names from academic research as examples of bad use of data. But there are ample examples from newspaper coverage of data related work, or simply the misuse of data by newspapers. This may be intentional but is often a result of journalists not having the necessary training in using and analysing data.

When the press finds a piece of academic research interesting, often bad things come out of it. This is often because not all journalists are well equipped to interpret scientific findings. As a result, sometimes conclusions are drawn as a result of a misinterpretation of good data analysis. Correlation interpreted as causation is a frequently recurring example. Equally bad is press coverage changes the incentive system of producing good research, when scientists work too hard for their work to be noticed by the press, and sacrifice scientific rigour in data analysis for the sake of media attention. There can also be less discernible but equally damaging errors.

In some cases requiring to pre disclose the tests the research is going to run on data helps maintain credibility in many instances. Moreover, I am always a bit suspicious if the authors do not give access to their data for reproduction.

Our Economy Data Observatory places all new indicators on Zenodo with a DOI, and asks future individual contributors their data for replication there.

What do you see as the greatest challenge with open data in 2021?

The things I mentioned above about the democratisation of research driven by automation and access to big data does raise serious challenges as well. The obvious one is to do with the fact that there are enormous economies of scale in the use of data. As such, larger players will always be better positioned to outdo their smaller competitors, simply as a result of their superior data and infrastructure (for example having more granular consumer data allowing them to offer better designed customised experience for the consumer). Like many others, I see this as the biggest challenge for open data - to level the playing field for smaller players. This is not a trivial task at all; and even if, miraculously, small businesses could access the same data as the biggest players, they still would not have the capacity or the ability to use this data. So allowing access to data alone is unlikely to solve any of these problems. I would say that fostering engagement with open data is probably as big a challenge as creating the open data in the first place.

How do you envision the Economy Data Observatory making open data more credible in the European economic policy community and accepted as verified information?

I think starting with a focused agenda is a good idea. For example, linking up with the Centre for Competition Policy means that we have an initial focus of competition policy relevant economic data. This is still a large domain, but it is one where we have ample expertise. Starting with specific research questions such as linking competition enforcement and merger decisions to related information on innovation and ownership data puts the Economy Data Observatory at the heart of some of the most topical policy questions, such as the role of killer acquisitions (acquisitions with the intent to kill of sources of rival innovation), or common ownership, both of which are increasingly discussed in policy and practitioner circles. Once we established ourselves as a credible source of data in the competition policy community, we can look into joining this up with other policy areas, and also with our other Data Observatories (Music and Green Deal).

Join us

Data API

Tue, 01 Jun 2021 11:00:00 +0000

Our observatory has a new data API which allows access to our daily refreshing open data. You can access the API via api.economy.dataobservatory.eu (apologies for the ugly, temporary subdomain masking!).

All the data and the metadata are available as open data, without database use restrictions, under the ODbL license. However, the metadata contents are not finalized yet. We are currently working on a solution that applies the FAIR Guiding Principles for scientific data management and stewardship, and fulfills the mandatory requirements of the Dublic Core metadata standards and at the same time the mandatory requirements, and most of the recommended requirements of DataCite. These changes will be effective before 1 July 2021.

The Competition Data Observatory temporarily shares an API with the Economy Data Observatory, which serves as an incubator for similar economy-oriented reproducible research resources.

Indicator table

The indicator table contains the actual values, and the various estimated/imputed values of the indicator, clearly marking missing values, too.

api.economy.dataobservatory.eu: indicator retrieval

You can get the data in CSV or json format, or write SQL querries. (Tutorials in SQL, R, Python will be posted shortly.)

Description metadata table

Processing Metadata table

The metadata table contains various data processing information, such as the first and last actual observation of the indicator, the number of approximated, forecasted, backcasted values, last update at source and in our system, and so on.

api.economy.dataobservatory.eu: processing metadata

Authoritative Copies

Greendeal Data Observatory on Zenodo

Metadata

Tue, 01 Jun 2021 11:00:00 +0000

The Competition Data Observatory temporarily shares an API with the Economy Data Observatory, which serves as an incubator for similar economy-oriented reproducible research resources.

api.economy.dataobservatory.eu: processing metadata

Descriptive Metadata


Identifier	An unambiguous reference to the resource within a given context. (Dublin Core item), but several identifiders allowed, and we will use several of them.
Creator	The main researchers involved in producing the data, or the authors of the publication, in priority order. To supply multiple creators, repeat this property. (Extends the Dublin Core with multiple authors, and legal persons, and adds affiliation data.)
Title	A name given to the resource. Extends Dublin Core with alternative title, subtitle, translated Title, and other title(s).
Publisher	The name of the entity that holds, archives, publishes prints, distributes, releases, issues, or produces the resource. This property will be used to formulate the citation, so consider the prominence of the role. For software, use Publisher for the code repository. (Dublin Core item.)
Publication Year	The year when the data was or will be made publicly available.
Resource Type	We publish Datasets, Images, Report, and Data Papers. (Dublin Core item with controlled vocabulary.)

Recommended for discovery

The Recommended (R) properties are optional, but strongly recommended for interoperability.


Subject	The topic of the resource. (Dublin Core item.)
Contributor	The institution or person responsible for collecting, managing, distributing, or otherwise contributing to the development of the resource. (Extends the Dublin Core with multiple authors, and legal persons, and adds affiliation data.) When applicable, we add Distributor (of the datasets and images), Contact Person, Data Collector, Data Curator, Data Manager, Hosting Institution, Producer (for images), Project Manager, Researcher, Research Group, Rightsholder, Sponsor, Supervisor
Date	A point or period of time associated with an event in the lifecycle of the resource, besides the Dublin Core minimum we add Collected, Created, Issued, Updated, and if necessary, Withdrawn dates to our datasets.
Related Identifier	An identifier or identifiers other than the primary Identifier applied to the resource being registered.
Rights	We give SPDX License List standards rights description with URLs to the actual license. (Dublin Core item: Rights Management)
Description	Recommended for discovery.(Dublin Core item.)
GeoLocation	Similar to Dublin Core item Coverage

The Subject property: we need to set standard coding schemas for each observatory.
Contributor property:
- DataCurator the curator of the dataset, who sets the mandatory properties.
- DataManager the person who keeps the dataset up-to-date.
- ContactPerson the person who can be contacted for reuse requests or bug reports.
The Date property contains the following dates, which are set automatically by the dataobservatory R package:
- Updated when the dataset was updated;
- EarliestObservation, which the earliest, not backcasted, estimated or imputed observation.
- LatestObservation, which the earliest, not backcasted, estimated or imputed observation.
- UpdatedatSource, when the raw data source was last updated.
The GeoLocation is automatically created by the dataobservatory R package.
The Description property optional elements, and we adopted them as follows for the observatories:
- The Abstract is a short, textual description; we try to automate its creation as much as a possible, but some curatorial input is necessary.
- In the TechnicalInfo sub-field, we record automatically the utils::sessionInfo() for computational reproducability. This is automatically created by the dataobservatory R package.
- In the Other sub-field, we record the keywords for structuring the observatory.

Optional

The Optional (O) properties are optional and provide richer description. For findability they are not so important, but to create a web service, they are essential. In the mandatory and recommended fields, we are following other metadata standards and codelists, but in the optional fields we have to build up our own system for the observatories.


Language	A language of the resource. (Dublin Core item.)
Alternative Identifier	An identifier or identifiers other than the primary Identifier applied to the resource being registered.
Size	We give the CSV, downloadable dataset size in bytes.
Format	We give file format information. We mainly use CSV and JSON, and occasionally rds and SPSS types. (Dublin Core item.)
Version	The version number of the resource.
Rights	We give SPDX License List standards rights description with URLs to the actual license. (Dublin Core item: Rights Management)
Funding Reference	We provide the funding reference information when applicable. This is usually mandatory with public funds.
Related Item	We give information about our observatory partners' related research products, awards, grants (also Dublin Core item as Relation.) We particularly include source information when the dataset is derived from another resource (which is a Dublin Core item.)

In the Language we only use English (eng) at the moment.
By default We do not use the Alternative Identifier property. We will do this when the same dataset will be used in several observatories.
The Size property is measured in bytes for the CSV representation of the dataset. During creations, the software creates a temporary CSV file to check if the dataset has no writing problems, and measures the dataset size.
The Version property needs further work. For a daily re-freshing API we need to find an applicable versioning system.
The Funding reference will contain information for donors, sponsors, and co-financing partners.
Our default setting for Rights is the CC-BY-NC-SA-4.0 license and we provide an URI for the license document.
In the RelatedItem we give information about:
- The original (raw) data source.
- Methodological bibilography reference, when needed.
- The open-source statistical software code that processed the data.

Administrative (Processing) Metadata

Administrative Metadata

Like with diamonds, it is better to know the history of a dataset, too. Our administrative metadata contains codelists that follow the SXDX statistical metadata standards, and similarly strucutred information about the processing history of the dataset.

See for further reference The codebook Class.


Observation Status	SDMX Code list for Observation Status 2.2 (CL_OBS_STATUS), such as actual, missing, imputed, etc. values.
Method	If the value is estimated, we provide modelling information.
Unit	We provide the measurement unit of the data (when applicable.)
Frequency	SDMX Code list for Frequency 2.1 (CL_FREQ) frequency values
Codelist	Euros-SDMX Codelist entries for the observational units, such as sex, etc.
Imputation	SDMX Code list for Frequency 2.1 (CL_IMPUT_METH) imputation values
Estimation	The estimation methodology of data that we calculated, together with citation information and URI to the actual processing code
Related Item	We give information about the software code that processed the data (both Dublin Core and DataCite compliant.)

See an example in the The codebook Class article of the dataobservatory R package.

The Economy Data Observatory is Contesting the EU Datathon 2021 Prize

Fri, 21 May 2021 20:00:00 +0200

Reprex, a Dutch start-up enterprise formed to utilize open source software and open data, is looking for partners in an agile, open collaboration to win at least one of the three EU Datathon Prizes. We are looking for policy partners, academic partners and a consultancy partner. Our project is based on agile, open collaboration with three types of contributors.

With our competing prototypes we want to show that we have a research automation technology that can find open data, process it and validate it into high-quality business, policy or scientific indicators, and release it with daily refreshments in a modern API.

We are looking for institutions to challenge us with their data problems, and sponsors to increase our capacity. Over then next 5 months, we need to find a sustainable business model for a high-quality and open alternative to other public data programs.

The EU Datathon 2021 Challenge

To take part, you should propose the development of an application that links and uses open datasets. - our data curator team
Your application … is also expected to find suitable new approaches and solutions to help Europe achieve important goals set by the European Commission through the use of open data.” - this application is developed by our technology contributors
Your application should showcase opportunities for concrete business models or social enterprises. - our service development team is working to make this happen!
We use open source software and open data. The applications are hosted on the cloud resources of Reprex, an early-stage technology startup currently building a viable, open-source, open-data business model to create reproducible research products.
We are working together with experts in the domain as curators (check out our guidelines if you want to join: Data Curators: Get Inspired!).
Our development team works on an open collaboration basis. Our indicator R packages, and our services are developed together with rOpenGov.

Mission statement

We want to win an EU Datathon prize by processing the vast, already-available governmental and scientific open data made usable for policy-makers, scientific researchers, and business researcher end-users.

“To take part, you should propose the development of an application that links and uses open datasets. Your application should showcase opportunities for concrete business models or social enterprises. It is also expected to find suitable new approaches and solutions to help Europe achieve important goals set by the European Commission through the use of open data.”

We aim to win at least one first prize in the EU Datathon 2021. We are contesting all three challenges, which are related to the EU’s official strategic policies for the coming decade.

Challenge 2: An economy that works for people

Our Economy Data Observatory will focus on competition, small and medium sized enterprizes and robotization.

Challenge 2: An economy that works for people, with a particular focus on the Single market strategy, and particular attention to the strategy’s goals of 1. Modernising our standards system, 2. Consolidating Europe’s intellectual property framework, and 3. Enabling the balanced development of the collaborative economy strategic goals.

Big data and automation create new inequalities and injustices and have the potential to create a jobless growth economy. Our Economy Data Observatory is a fully automated, open source, open data observatory that produces new indicators from open data sources and experimental big data sources, with authoritative copies and a modern API.

Our observatory monitors the European economy to protect consumers and small companies from unfair competition, both from data and knowledge monopolization and robotization. We take a critical Small and Medium-Sized Enterprises (SME)-, intellectual property, and competition policy point of view of automation, robotization, and the AI revolution on the service-oriented European social market economy.

We would like to create early-warning, risk, economic effect, and impact indicators that can be used in scientific, business, and policy contexts for professionals who are working on re-setting the European economy after a devastating pandemic in the age of AI. We are particularly interested in designing indicators that can be early warnings for killer acquisitions, algorithmic and offline discrimination against consumers based on nationality or place of residence, and signs of undermining key economic and competition policy goals. Our goal is to help small and medium-sized enterprises and start-ups to grow, and to furnish data that encourages the financial sector to provide loans and equity funds for their growth.

Other Challenges

Challenge 1: A European Green Deal, with a particular focus on the The European Climate Pact, the Organic Action Plan, and the New European Bauhaus, i.e., mitigation strategies. Our Green Deal Data Observatory is a modern reimagination of existing ‘data observatories’; currently, there are over 70 permanent international data collection and dissemination points. One of our objectives is to understand why the dozens of the EU’s observatories do not use open data and reproducible research. We want to show that open governmental data, open science, and reproducible research can lead to a higher quality and faster data ecosystem that fosters growth for policy, business, and academic data users.
Challenge 3: A Europe fit for the digital age, with a particular focus Artificial Intelligence, the European Data Strategy, the Digital Services Act, Digital Skills and Connectivity. The Digital Music Observatory (DMO) is a fully automated, open source, open data observatory that creates public datasets to provide a comprehensive view of the European music industry. It provides high-quality and timely indicators in all four pillars of the planned official European Music Observatory as a modern, open source and largely open data-based, automated, API-supported alternative solution for this planned observatory. The insight and methodologies we are refining in the DMO are applicable and transferable to about 60 other data observatories funded by the EU which do not currently employ governmental or scientific open data.

Our Product/Market Fit was validated in the world’s 2nd ranked university-backed incubator program, the Yes!Delft AI Validation Lab. We are currently developing this project with the help of the JUMP European Music Market Accelerator program.

Problem Statement

The EU has an 18-year-old open data regime and it makes public taxpayer-funded data in the values of tens of billions of euros per year; the Eurostat program alone handles 20,000 international data products, including at least 5,000 pan-European environmental indicators.

As open science principles gain increased acceptance, scientific researchers are making hundreds of thousands of valuable datasets public and available for replication every year.

The EU, the OECD, and UN institutions run around 100 data collection programs, so-called ‘data observatories’ that more or less avoid touching this data, and buy proprietary data instead. Annually, each observatory spends between 50 thousand and 3 million EUR on collecting untidy and proprietary data of inconsistent quality, while never even considering open data.

Our automated data observatories are modern reimaginations of the existing observatories that do not use open data and research automation.

The problem with the current EU data strategy is that while it produces enormous quantities of valuable open data, in the absence of common basic data science and documentation principles, it seems often cheaper to create new data than to put the existing open data into shape.

This is an absolute waste of resources and efforts. With a few R packages and our deep understanding of advanced data science techniques, we can create valuable datasets from unprocessed open data. In most domains, we are able to repurpose data originally created for other purposes at a historical cost of several billions of euros, converting these unused data assets into valuable datasets that can replace tens of millions’ worth of proprietary data.

What we want to achieve with this project – and we believe such an accomplishment would merit one of the first prizes - is to add value to a significant portion of pre-existing EU open data (for example, available on data.europa.eu/data) by re-processing and integrating them into a modern, tidy database with an API access, and to find a business model that emphasises a triangular use of data in 1. business, 2. science and 3. policy-making. Our mission is to modernize the concept of data observatories.

Competition

Sun, 16 May 2021 00:00:00 +0000

Big data and AI create new inequalities in the world. Data monopolization can potentially lead to product and service market monopolization, too. We are designing new policy indicators that can help competition policy practitioners and researchers to measure and understand these trends.

We are planning to programatically access the merger database of the European Commission’s Merger Cases, which we believe to fall under the scope of the Open Data Directive.
We would like to connect to the EUIPO databases to create indicators of potential killer acquisitions - acquisitions by large companies that may thwart innovation by buying up and killing start-up companiees to take hold of competitive patents.

Mapping antitrust markets

Most market-based analysis uses standardised industry classifications such as the NACE system in the European Union. Whilst these systems are useful to have a standardised view of the component parts of the economy, these markets are frequently very different from product markets that are defined by demand and supply conditions (literature refers to this as antitrust markets). One of the cornerstones of our Data Observatory is automating the construction and updating of a system of antitrust markets. This can be done from available antitrust market decisions. For example Affeldt et al. (2021) manually identified 20,000 product/geographic antitrust markets affected by over 2,000 mergers from DG COMP merger decisions. Automating this process will create a continuously updated source of market definitions that can directly benefit a number of users, such as small businesses caught up in costly merger litigations, academics, policy organisations. This new market classification system could also contribute to the topical debate on increasing market concentration and markups (where most current analysis uses standard industry classification systems). Under the umbrella of our Antitrust Market Classification (AMC) and under a standard industry classification (NACE) we will create three company-level tracker datasets:

Computational Antitrust Project Plan

Merger tracker

We compile a merger database from constructing our antitrust markets. This will be complimented by continuously monitoring markets for M&A transactions that are below the radar of DG COMP. As a first step, we will only have this option for selected markets, until we develop a tool to systematically collect this data on a wider scope. Tracking mergers in both NACE and AMC markets helps trace how concentration evolves over time. Moreover, this is a key building block to our Competition and Innovation Data Observatory.

Innovation tracker

Part of this project is to link the firms that appear in our antitrust market definitions to their historical and current innovation data. For this we will draw from patent data APIs, such as that of the European Patent Office. This would enable the tracking of how innovation develops as market become more/less concentrated. This is important for several reasons. First of all, innovation is an important component of economic productivity, and therefore our tracker will provide vital information to competition authorities and sectoral regulators. Second, there are increasingly vocal concerns about the innovation impact of acquisitions between companies that are either on separate antitrust markets, or include an acquired business that is below the size threshold of antitrust scrutiny (these are frequently referred to as killer acquisitions in the literature). One side of the ongoing debate argues that these acquisitions can hamper innovation, whereas other claim the opposite (it is the prospect of these acquisitions that drives innovation in many small start-up businesses). As such, our tool will be a key indicator of how healthy the environment for innovation for small businesses is.

Ownership tracker

Finally, we will construct and ownership tracker from the shareholder reports of listed companies (from the set of companies identified in our market definition process). Once again, this is key information for gauging how healthy competition is in European markets. It has been repeatedly highlighted that overlaps in the ownership structure (in institutional investors) of the largest businesses can hinder competition. Mapping the ownership structure of European listed companies means that not only we are aware of market concentration concerns in antitrust markets, but also concerns when overlap across corporate governance structures is considered.

Photo credit: Matt Ridley, Unplash licence.

Data Sharing

Sun, 16 May 2021 00:00:00 +0000

we would like to actively encourage the sharing of data assets.

Open Data

Sun, 16 May 2021 00:00:00 +0000

Many countries in the world allow access to a vast array of information, such as documents under freedom of information requests, statistics, datasets. In the European Union, most taxpayer financed data in government administration, transport, or meteorology, for example, can be usually re-used. More and more scientific output is expected to be reviewable and reproducible, which implies open access.

What’s the Problem with Open Data?

How We Add Value?

Is There Value in It?
If it’s money on the street, why nobody’s picking it up?

Datasets Should Work Together to Give Information
Data is only potential information, raw and unprocessed.

What’s the Problem with Open Data?

“Data is stuff. It is raw, unprocessed, possibly even untouched by human hands, unviewed by human eyes, un-thought-about by human minds.” [1]

Most open data cannot be just “downloaded."
Often, you need to put more than $100 value of work into processing, validating, documenting a dataset that is worth $100. But you can share this investment with our data observatories.
Open data is almost always lacking of documentation, and no clear references to validate if the data is reliable or not corrupted. This is why we always start with reprocessing and redocumenting.

How We Add Value?

We believe that even such generally trusted data sources as Eurostat often need to be reprocessed, because various legal and political constraints do not allow the common European statistical services to provide optimal quality data – for example, on the regional and city levels.
With rOpenGov and other partners, we are creating open-source statistical software in R to re-process these heterogenous and low-quality data into tidy statistical indicators to automatically validate and document it.
Metadata is a potentially informative data record about a potentially informative dataset. We are carefully documenting and releasing administrative, processing, and descriptive metadata, following international metadata standards, to make our data easy to find and easy to use for data analysts.
We are automatically creating depositions and authoritative copies marked with an individual digital object identifier (DOI) to maintain data integrity.

Is There Value in Open Data?

A well-known story tells of a finance professor and a student who come across a $100 bill lying on the ground. As the student stops to pick it up, the professor says, “Don’t bother—if it were really a $100 bill, it wouldn’t be there.”

But this is not the case with open data. Often, you need to put more than $100 into processing, validating, documenting a dataset that is worth $100.

In the EU, open data is governed by the Directive on open data and the re-use of public sector information - in short: Open Data Directive (EU) 2019 / 1024. It entered into force on 16 July 2019. It replaces the Public Sector Information Directive, also known as the PSI Directive which dated from 2003 and was subsequently amended in 2013.

Open Data is potentially useful data that can potentially replace costlier or hard to get data sources to build information. It is analogous to potential energy: work is required to release it. We build automated systems that reduce this work and increase the likelihood that open data will offer the best value for money.

Most open data is not publicy accessible, and available upon request. Our real curatorial advantage is that we know where it is and how to get this request processed.
Most European open data comes from tax authorities, meteorological offices, managers of transport infrastructure, and other governmental bodies whose data needs are very different from yours. Their data must be carefully evaluated, re-processed, and if necessary, imputed to be usable for your scientific, business or policy goals.
The use of open science data is problematic in different ways: usually understanding the data documentation requires domain-specific specialist knowledge. Open science data is even more scattered and difficult to access than technically open, but not public governmental data.

From Datasets to Data Integration, Data to Information

“Data is only potential information, raw and unprocessed, prior to anyone actually being informed by it.” ^[2]

We are building simple databases and supporting APIs that release the data without restrictions, in a tidy format that is easy to join with other data, or easy to join into databases, together with standardized metadata.

Our service flow and value chain

FAQ

Why Downloading Does Not Work?

Most open data is not available on the internet.
If it is available, it is not in a form that you can easily import into a spreadsheet application like Excel or OpenOffice, or into a statistical application like SPSS or STATA.
Even the data quality of trusted web sources, like the Eurostat website, can be very low. Eurostat just publishes what it gets from governments, and often has no mandate to fix errors. The data is full with missing information, and in the case of regional statistics, faulty region codes and region names that make matching your data or placing them on a map impossible.
Adjusting euros with millions of euros, correctly translating dollars to euros, pounds to kilograms requires plenty of work. This is a very error-prone process when done by humans.

Can Open Data be Used in Machine Learning and AI?

Most public and open data sources have many missing observations; machine learning models usually cannot hanlde missingness. These points must be carefully imputed with approximations, which can be very challenging when the data has geographical dimension.
Removing missing values makes samples extremely biased and your model will learn from omissions, not information.

Photo Credits

What’s the Problem with Open Data? illustration is a photo by Cristina Gottardi How We Add Value? illustration is a photo by Nana Smirnova. Is There Value Left in It? is a photo by Imelda Datasets Should Work Together to Give Information is a photo by Lucas Santos

Footnote References

[1] Pomerantz, Jeffrey. 2021. “Metadata.” MIT Press essential knowledge series. MIT Press. Cambridge, Massachusetts ; London, England : The MIT Press, [2015]

[2] Pomerantz, Jeffrey. 2021. “Metadata.” MIT Press essential knowledge series. MIT Press. Cambridge, Massachusetts ; London, England : The MIT Press, [2015]

Small- and Medium Sized Enterprizes

Wed, 12 May 2021 00:00:00 +0000

Photo credit: Tim Mossholder, Unplash licence.

EU Datathon 2021

Tue, 16 Mar 2021 00:00:00 +0000

The EU Datathon 2021 Challenge

To take part, you should propose the development of an application that links and uses open datasets. - our data curator team
Your application … is also expected to find suitable new approaches and solutions to help Europe achieve important goals set by the European Commission through the use of open data.” - this application is developed by our technology contributors
Your application should showcase opportunities for concrete business models or social enterprises. - our service development team is working to make this happen!
We use open source software and open data. The applications are hosted on the cloud resources of Reprex, an early-stage technology startup currently building a viable, open-source, open-data business model to create reproducible research products.
We are working together with experts in the domain as curators (check out our guidelines if you want to join: Data Curators: Get Inspired!).
Our development team works on an open collaboration basis. Our indicator R packages, and our services are developed together with rOpenGov.

Mission statement

We aim to win at least one first prize in the EU Datathon 2021. We are contesting all three challenges, which are related to the EU’s official strategic policies for the coming decade.

Challenge 1: A European Grean Deel

Our Green Deal Data Observatory connects socio-economic and environmental data to help understanding and combating climate change.

Challenge 1: A European Green Deal, with a particular focus on the The European Climate Pact, the Organic Action Plan, and the New European Bauhaus, i.e., mitigation strategies.

Climate change and environmental degradation are an existential threat to Europe and the world. To overcome these challenges, the European Union created the European Green Deal strategic plan, which aims to make the EU’s economy sustainable by turning climate and environmental challenges into opportunities and making the transition just and inclusive for all.

Our Green Deal Data Observatory is a modern reimagination of existing ‘data observatories’; currently, there are over 70 permanent international data collection and dissemination points. One of our objectives is to understand why the dozens of the EU’s observatories do not use open data and reproducible research. We want to show that open governmental data, open science, and reproducible research can lead to a higher quality and faster data ecosystem that fosters growth for policy, business, and academic data users.

We provide high quality, tidy data through a modern API which enables data flows between public and proprietary databases. We believe that introducing Open Policy Analysis standards with open data, open-source software, and research automation, can help the Green Deal policymaking process. Our collaboration is open for individuals, citizens scientists, research institutes, NGOS, and companies.

Challenge 2: An economy that works for people

Our Economy Data Observatory will focus on competition, small and medium sized enterprizes and robotization.

Challenge 3: A Europe fit for the digital age

Our Digital Music Observatory is not only a demo of the European Music Observatory, but a testing ground for data governance, Digital Servcies Act, and trustworthy AI problems.

Challenge 3: A Europe fit for the digital age, with a particular focus Artificial Intelligence, the European Data Strategy, the Digital Services Act, Digital Skills and Connectivity.

The Digital Music Observatory (DMO) is a fully automated, open source, open data observatory that creates public datasets to provide a comprehensive view of the European music industry. It provides high-quality and timely indicators in all four pillars of the planned official European Music Observatory as a modern, open source and largely open data-based, automated, API-supported alternative solution for this planned observatory. The insight and methodologies we are refining in the DMO are applicable and transferable to about 60 other data observatories funded by the EU which do not currently employ governmental or scientific open data.

Music is one of the most data-driven service industries where most sales are currently executed by AI-driven autonomous systems that influence market shares and intellectual property remuneration. We provide a template that enables making these AI-driven systems accountable and trustworthy, with the goal of re-balancing the legitimate interests of creators, distributors, and consumers. Within Europe, this new balance will be an important use case of the European Data Strategy and the Digital Services Act.

The DMO is a fully functional service that can serve as a testing ground of the European Data Strategy. It can showcase the ways in which the music industry is affected by the problems that the Digital Services Act and European Trustworthy AI initiatives attempt to regulate. It is being built in open collaboration with national music stakeholders, NGOs, academic institutions, and industry groups.

Problem Statement

As open science principles gain increased acceptance, scientific researchers are making hundreds of thousands of valuable datasets public and available for replication every year.

Our automated data observatories are modern reimaginations of the existing observatories that do not use open data and research automation.

Our solution

We are empowering data curators with reproducible research solutions to create high-quality, rigorously tested original datasets from low quality, not validated, not tidy open data. We help them to design meaningful business, policy or scientific indicators and provide them with a software and API to keep the data up-to-date. We help them deposit a copy of the authoritative, uncompromised dataset onto Zenodo, the EU’s data repository, with a DOI or new DOI version.
We create a research workflow that periodically (daily, weekly, monthly, quarterly or annually) collects, corrects and re-processes the data. We use peer-reviewed statistical software and unit-tests to make sure that the data is sound.

Panning out gold from muddy open sources - with automation technology.

We add value with correcting open (and proprietary!) data problems that make open data hard to use, and proprietary, in-house data hard to re-use.
- regions corrects inconsistent geographical coding. Eurostat has no mandate to correct geographical coding, and member states do not historically adjust their data. With many thousands of parish, county, region, province, state boundary changes within states, regional and metropolitian area datasets are not usable without our software.
- iotables puts extremely complex national accounts data into actually useful environmental and economic impact indicators. Instead of working with each country separately, our standardized system can calculate direct and indirect effects, as well as multipliers for every European country that works in the European statistical framework (EU member states, EEA, UK, member candidates.)
- retroharmonize connects cross-sectional surveys with non-European countries, puts pan-European surveys into time series, and corrects regional subsamples. We are creating new indicators from Eurobarometer, Afrobarometer, Arab Barometer, and standardized CAP surveys, as well as other harmonized surveys. We help design surveys that can utilize data from already existing, openly available surveys.
We place the authoritative copy to a data repository (Zenodo or Dataverse), automatically document the data, and make it available in a modern API for SQL queries or CSV downloads.
We present the data with commentary and blog posts from our curators (see: Is Drought Risk Uninsurable? - solidarity and climate change in Belgium) and contributors on a semi-automatically refreshed, open source web portal.
We are perfecting the agile open collaboration model in a triangular setting, where corporate users, scientific researchers, public and non-governmental policy makers, and even citizen scientists can work around a single data ecoystem.
We are validating a business model that allows the commercial, scientific, and policy use of re-processed, high quality data products made from open and shared data.

Identifying Roadblocks to Net Zero Legislation

Tue, 16 Mar 2021 00:00:00 +0000

In our use case we are merging data about Europe’s coal regions, harmonized surveys about the acceptance of climate policies, and socio-economic data. While the work starts out from existing European research, our retroharmonize survey harmonization solution, our regions sub-national boundary harmonization solution and iotables allows us to connect open data and open knowledge from other coal regions of the world, for example, from the Appalachian economy.

Policy Context

The Just Transition Platform aims to assist EU countries and regions to unlock the support available through the Just Transition Mechanism. It builds on and expands the work of the existing Initiative for Coal Regions in Transition, which already supports fossil fuel producing regions across the EU in achieving a just transition through tailored, needs-oriented assistance and capacity-building.

The Initiative has a secretariat that is co-run by Ecorys, Climate Strategies, ICLEI Europe, and the Wuppertal Institute for Climate. While the initiative is an EU project, it cooperates with other similar initiatives, for example, with the Coalfield Development social enterprise in the Appalachian economy.

Data Sources

Coal regions: Our starting point is the EU coal regions: opportunities and challenges ahead publication Joint Research Centre (JRC), the European Commission’s science and knowledge service. This publication maps Europe’s coal dependent energy and transport infrastructure, and regions that depend on coal-related jobs.
Harmonized Survey Data: The dataset of the Eurobarometer 91.3 (April 2019) harmonized survey. Our transition policy variable is the four-level agreement with the statement More public financial support should be given to the transition to clean energies even if it means subsidies to fossil fuels should be reduced (EN) and Davantage de soutien financier public devrait être donné à la transition vers les énergies propres même si cela signifie que les subventions aux énergies fossiles devraient être réduites (FR) which is then translated to the language use of all participating country.
Environmental Variables: We used data on pm and SO2 polution measured by participating stations in the European Environmental Agency’s monitoring program. The station locations were mapped by Milos to the NUTS sub-national regions.

Exploratory Data Analysis

Our coal-dependency dummy variable is base on the policy document Coal regions in transition.

readRDS(file.path("data", "coal_regions.rds"))

## # A tibble: 253 x 5
##    country_code_is~ region_nuts_nam~ region_nuts_cod~ coal_region is_coal_region
##    <chr>            <fct>            <chr>            <chr>                <dbl>
##  1 BE               Brussels hoofds~ BE10             <NA>                     0
##  2 BE               Liege            BE33             <NA>                     0
##  3 BE               Brabant Wallon   BE31             <NA>                     0
##  4 BE               Antwerpen        BE21             <NA>                     0
##  5 BE               Limburg [BE]     BE22             <NA>                     0
##  6 BE               Oost-Vlaanderen  BE23             <NA>                     0
##  7 BE               Vlaams Brabant   BE24             <NA>                     0
##  8 BE               West-Vlaanderen  BE25             <NA>                     0
##  9 BE               Hainaut          BE32             <NA>                     0
## 10 BE               Namur            BE35             <NA>                     0
## # ... with 243 more rows

Our exploratory data analysis shows that respondent in 2019, agreement with the policy measure significantly differed among EU member states and regions.

transition_policy <- eb19_raw %>%
  rowid_to_column() %>%
  mutate ( transition_policy = normalize_text(transition_policy)) %>%
  fastDummies::dummy_cols(select_columns = 'transition_policy') %>%
  mutate ( transition_policy_agree = case_when(
    transition_policy_totally_agree + transition_policy_tend_to_agree > 0 ~ 1, 
    TRUE ~ 0
  )) %>%
  mutate ( transition_policy_disagree = case_when(
    transition_policy_totally_disagree + transition_policy_tend_to_disagree > 0 ~ 1, 
    TRUE ~ 0
  )) 

eb19_df  <- transition_policy %>% 
  left_join ( air_pollutants, by = 'region_nuts_codes' ) %>%
  mutate ( is_poland = ifelse ( country_code == "PL", 1, 0))

Preliminary Results

Significantly more people agree where

there are more polutants
who are younger
where people are more educated

Significantly less people agree

in rural areas
where more people are older
where more people are less educated
in less polluted areas
in coal regions

A simple model run:

c("transition_policy_totally_agree" , "pm10", "so2", "age_exact", "is_highly_educated" , "is_rural")

## [1] "transition_policy_totally_agree" "pm10"                           
## [3] "so2"                             "age_exact"                      
## [5] "is_highly_educated"              "is_rural"

summary( glm ( transition_policy_totally_agree ~ pm10 + so2 + 
                 age_exact +
                 is_highly_educated + is_rural + is_coal_region +
                 country_code, 
               data = eb19_df, 
               family = binomial ))

## 
## Call:
## glm(formula = transition_policy_totally_agree ~ pm10 + so2 + 
##     age_exact + is_highly_educated + is_rural + is_coal_region + 
##     country_code, family = binomial, data = eb19_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7690  -1.0253  -0.8165   1.2264   1.9085  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.1975096  0.0921551  -2.143 0.032095 *  
## pm10                0.0068505  0.0017445   3.927 8.60e-05 ***
## so2                 0.1381994  0.0405867   3.405 0.000662 ***
## age_exact          -0.0075018  0.0007873  -9.529  < 2e-16 ***
## is_highly_educated  0.2953905  0.0311127   9.494  < 2e-16 ***
## is_rural           -0.1277983  0.0313321  -4.079 4.53e-05 ***
## is_coal_region     -0.2624005  0.0640233  -4.099 4.16e-05 ***
## country_codeBE     -0.3290891  0.0916117  -3.592 0.000328 ***
## country_codeBG     -0.6470116  0.1125114  -5.751 8.89e-09 ***
## country_codeCY      0.8471483  0.1273306   6.653 2.87e-11 ***
## country_codeCZ     -0.5754008  0.0965974  -5.957 2.57e-09 ***
## country_codeDE      0.0106430  0.0856322   0.124 0.901088    
## country_codeDK      0.0577724  0.0925391   0.624 0.532429    
## country_codeEE     -0.8041188  0.0989047  -8.130 4.28e-16 ***
## country_codeES      1.1266903  0.0941495  11.967  < 2e-16 ***
## country_codeFI     -0.2617501  0.0946837  -2.764 0.005702 ** 
## country_codeFR      0.0130239  0.1639339   0.079 0.936678    
## country_codeGB      0.2454631  0.0891845   2.752 0.005918 ** 
## country_codeGR      0.2169278  0.1209199   1.794 0.072816 .  
## country_codeHR     -0.1632727  0.1001563  -1.630 0.103064    
## country_codeHU      0.5779928  0.1020987   5.661 1.50e-08 ***
## country_codeIT     -0.1427249  0.0940144  -1.518 0.128985    
## country_codeLU     -0.3111627  0.1140426  -2.728 0.006363 ** 
## country_codeLV     -0.6246590  0.0963526  -6.483 8.99e-11 ***
## country_codeMT      0.3303363  0.1228611   2.689 0.007173 ** 
## country_codeNL      0.1707080  0.0902189   1.892 0.058470 .  
## country_codePL     -0.2843198  0.1228657  -2.314 0.020664 *  
## country_codePT      0.1447295  0.0899079   1.610 0.107452    
## country_codeRO     -0.0479674  0.0930433  -0.516 0.606177    
## country_codeSE      0.4865939  0.0922486   5.275 1.33e-07 ***
## country_codeSK     -0.2427307  0.0964652  -2.516 0.011861 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 30568  on 22401  degrees of freedom
## Residual deviance: 29313  on 22371  degrees of freedom
##   (5253 observations deleted due to missingness)
## AIC: 29375
## 
## Number of Fisher Scoring iterations: 4

summary( glm ( transition_policy_agree ~ pm10 + so2 + age_exact +
                 is_highly_educated + is_rural, 
               data = eb19_df, 
               family = binomial ))

## 
## Call:
## glm(formula = transition_policy_agree ~ pm10 + so2 + age_exact + 
##     is_highly_educated + is_rural, family = binomial, data = eb19_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1970   0.5035   0.5803   0.6495   0.8465  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.807823   0.079297  22.798  < 2e-16 ***
## pm10                0.005092   0.001239   4.108 3.99e-05 ***
## so2                 0.003274   0.051410   0.064  0.94922    
## age_exact          -0.009781   0.000988  -9.900  < 2e-16 ***
## is_highly_educated  0.396743   0.039735   9.985  < 2e-16 ***
## is_rural           -0.107448   0.037953  -2.831  0.00464 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20488  on 22401  degrees of freedom
## Residual deviance: 20250  on 22396  degrees of freedom
##   (5253 observations deleted due to missingness)
## AIC: 20262
## 
## Number of Fisher Scoring iterations: 4

Next Steps

After careful documentation, we will very soon publish all the processed, clean datasets on the EU Zenodo repository with clear digital object identification and versioning.
We will seek contact with the Secretariat of the Initiative for Coal Regions in Transition to process all the data annexes in the EU coal regions: opportunities and challenges ahead report.
With our volunteers we want to include coal regions from the United States, Latin America, Australia, Africa first – because we have harmonized survey results – and gradually add the rest of the world.
We will ask political scientists and policy researchers to interpret our findings.

Reprex Open Data Day 2021

Sat, 06 Mar 2021 15:30:00 +0200

Open Data Day is an annual celebration of open data all over the world. It is an opportunity to show the benefits of open data and encourage the adoption of open data policies in government, business, and civil society. Reprex is a start-up that utilizes open data with open-source reproducible research: please challenge us with your data requests and participate in our web events.

The Reprex Open Data Day 2021 will be two informal conversations based on a series of run up introductory blogposts centered around two themes. Because important guests became ill in the last days, we are going to consolidate the two talks into one with less structure. We want to create an informal, inclusive, collaborative online event on International Open Data Day 2021. Please, grab a tea, coffee, or even a beer, and join us for an informal conversation. We hope that we will finish the afternoon with ideas on new, open-data driven collaborations.

9.30 EST / 15.30 CET: Open collaboration in business, policy and science. Creating evidence-based policy, business strategy or scientific research with small contributions with independent components with incentives. Short introduction with examples: joining environmental sensory data and public opinion data on maps; creating harmonized datasets across the Arab world. Survey harmonization, mapping, data products. Scaling up open collaboration: making small organizations competitive with big tech in the big data era. Data sharing, data pooling, data altruism and observatories. The new European trustworthy AI and data governance agenda.

You can click through a short presentation to familiarize yourself with our topics.

See you here.

Case studies:

We are connecting raw survey data about Climate Awareness in Eurobarometer surveys. Here is the reproduction code (intermediate to advanced R needed.) You should use the development version of our retroharmonize package at github.com/antaldaniel/retroharmonize
We are tracking changes in the boundaries of provinces, states, counties, parishes with our regions open source software – reproduction code here. You will need our regions package which is available on CRAN or in the rOpenGov GitHub repo.
We will talk about how to join this with air pollution data and put it on the map with Milos Popovic, who prepared this nice choropleth animation.

We will discuss data observatories (permanent data collection programs), open collaboration (open-source inspired way of cooperation among small and large independent actors) and data altruism.

Any questions: send Daniel a message on Keybase, Whatsapp or email.

Hello on International #OpenDataDay2021 from🌷 the Hague!
- We have brought some new data to the light about 🌡climate change awareness
- We created some tutorials how to harmonize survey and geographical data
- Join us at 9.30 EST/15.30 CET 👇https://t.co/7J7pvi3sPC #ODD2021 pic.twitter.com/DwkGQaDhW1
— dataandlyrics (@dataandlyrics) March 6, 2021

Where Are People More Likely To Treat Climate Change as the Most Serious Global Problem?

Sat, 06 Mar 2021 00:00:00 +0000

library(regions)
library(lubridate)
library(dplyr)

if ( dir.exists('data-raw') ) {
  data_raw_dir <- "data-raw"
} else {
  data_raw_dir <- file.path("..", "..", "data-raw")
  }

The first results of our longitudinal table were difficult to map, because the surveys used an obsolete regional coding. We will adjust the wrong coding, when possible, and join the data with the European Environment Agency’s (EEA) Air Quality e-Reporting (AQ e-Reporting) data on environmental pollution. We recoded the annual level for every available reporting stations [not shown here] and all values are in μg/m3. The period under observation is 2014-2016. Data file: https://www.eea.europa.eu/data-and-maps/data/aqereporting-8 (European Environment Agency 2021).

Recoding the Regions

Recoding means that the boundaries are unchanged, but the country changed the names and codes of regions because there were other boundary changes which did not affect our observation unit. We explain the problem and the solution in greater detail in our tutorial that aggregates the data on regional levels.

panel <- readRDS((file.path(data_raw_dir, "climate-panel.rds")))

climate_data_geocode <-  panel %>%
  mutate ( year = lubridate::year(date_of_interview)) %>%
  recode_nuts()

Let’s join the air pollution data and join it by corrected geocodes:

load(file.path("data", "air_pollutants.rda")) ## good practice to use system-independent file.path

climate_awareness_air <- climate_data_geocode %>%
  rename ( region_nuts_codes  = .data$code_2016) %>%
  left_join ( air_pollutants, by = "region_nuts_codes" ) %>%
  select ( -all_of(c("w1", "wex", "date_of_interview", 
                     "typology", "typology_change", "geo", "region"))) %>%
  mutate (
    # remove special labels and create NA_numeric_ 
    age_education = retroharmonize::as_numeric(age_education)) %>%
  mutate_if ( is.character, as.factor) %>%
  mutate ( 
    # we only have responses from 4 years, and this should be treated as a categorical variable
    year = as.factor(year) 
    ) %>%
  filter ( complete.cases(.) )

The climate_awareness_air data frame contains the answers of 75086 individual respondents. 17.07% thought that climate change was the most serious world problem and 33.6% mentioned climate change as one of the three most important global problems.

summary ( climate_awareness_air  )

##                  rowid       serious_world_problems_first
##  ZA5877_v2-0-0_1    :    1   Min.   :0.0000              
##  ZA5877_v2-0-0_10   :    1   1st Qu.:0.0000              
##  ZA5877_v2-0-0_100  :    1   Median :0.0000              
##  ZA5877_v2-0-0_1000 :    1   Mean   :0.1707              
##  ZA5877_v2-0-0_10000:    1   3rd Qu.:0.0000              
##  ZA5877_v2-0-0_10001:    1   Max.   :1.0000              
##  (Other)            :75080                               
##  serious_world_problems_climate_change    isocntry    
##  Min.   :0.000                         BE     : 3028  
##  1st Qu.:0.000                         CZ     : 3023  
##  Median :0.000                         NL     : 3019  
##  Mean   :0.336                         SK     : 3000  
##  3rd Qu.:1.000                         SE     : 2980  
##  Max.   :1.000                         DE-W   : 2978  
##                                        (Other):57058  
##                                    marital_status         age_education  
##  (Re-)Married: without children           :13242   18            :15485  
##  (Re-)Married: children this marriage     :12696   19            : 7728  
##  Single: without children                 : 7650   16            : 5840  
##  (Re-)Married: w children of this marriage: 6520   still studying: 5098  
##  (Re-)Married: living without children    : 6225   17            : 5092  
##  Single: living without children          : 4102   15            : 4528  
##  (Other)                                  :24651   (Other)       :31315  
##    age_exact                      occupation_of_respondent
##  Min.   :15.0   Retired, unable to work       :22911      
##  1st Qu.:36.0   Skilled manual worker         : 6774      
##  Median :51.0   Employed position, at desk    : 6716      
##  Mean   :50.1   Employed position, service job: 5624      
##  3rd Qu.:65.0   Middle management, etc.       : 5252      
##  Max.   :99.0   Student                       : 5098      
##                 (Other)                       :22711      
##             occupation_of_respondent_recoded
##  Employed (10-18 in d15a)   :32763          
##  Not working (1-4 in d15a)  :37125          
##  Self-employed (5-9 in d15a): 5198          
##                                             
##                                             
##                                             
##                                             
##                        respondent_occupation_scale_c_14
##  Retired (4 in d15a)                   :22911          
##  Manual workers (15 to 18 in d15a)     :15269          
##  Other white collars (13 or 14 in d15a): 9203          
##  Managers (10 to 12 in d15a)           : 8291          
##  Self-employed (5 to 9 in d15a)        : 5198          
##  Students (2 in d15a)                  : 5098          
##  (Other)                               : 9116          
##                   type_of_community   is_student      no_education     
##  DK                        :   34   Min.   :0.0000   Min.   :0.000000  
##  Large town                :20939   1st Qu.:0.0000   1st Qu.:0.000000  
##  Rural area or village     :24686   Median :0.0000   Median :0.000000  
##  Small or middle sized town: 9850   Mean   :0.0679   Mean   :0.008151  
##  Small/middle town         :19577   3rd Qu.:0.0000   3rd Qu.:0.000000  
##                                     Max.   :1.0000   Max.   :1.000000  
##                                                                        
##    education       year       region_nuts_codes  country_code  
##  Min.   :14.00   2013:25103   LU     : 1432     DE     : 4531  
##  1st Qu.:17.00   2015:    0   MT     : 1398     GB     : 3538  
##  Median :18.00   2017:25053   CY     : 1192     BE     : 3028  
##  Mean   :19.61   2019:24930   SK02   : 1053     CZ     : 3023  
##  3rd Qu.:22.00                EL30   :  974     NL     : 3019  
##  Max.   :30.00                EE     :  973     SK     : 3000  
##                               (Other):68064     (Other):54947  
##      pm2_5             pm10               o3              BaP        
##  Min.   : 2.109   Min.   :  5.883   Min.   : 66.37   Min.   :0.0102  
##  1st Qu.: 9.374   1st Qu.: 28.326   1st Qu.: 90.89   1st Qu.:0.1779  
##  Median :11.866   Median : 33.673   Median :102.81   Median :0.4105  
##  Mean   :12.954   Mean   : 38.637   Mean   :101.49   Mean   :0.8759  
##  3rd Qu.:15.890   3rd Qu.: 49.488   3rd Qu.:110.73   3rd Qu.:1.0692  
##  Max.   :41.293   Max.   :123.239   Max.   :141.04   Max.   :7.8050  
##                                                                      
##       so2              ap_pc1            ap_pc2             ap_pc3       
##  Min.   : 0.0000   Min.   :-4.6669   Min.   :-2.21851   Min.   :-2.1007  
##  1st Qu.: 0.0000   1st Qu.:-0.4624   1st Qu.:-0.49130   1st Qu.:-0.5695  
##  Median : 0.0000   Median : 0.4263   Median : 0.02902   Median :-0.1113  
##  Mean   : 0.1032   Mean   : 0.1031   Mean   : 0.04166   Mean   :-0.1746  
##  3rd Qu.: 0.0000   3rd Qu.: 0.9748   3rd Qu.: 0.57416   3rd Qu.: 0.3309  
##  Max.   :42.5325   Max.   : 2.0344   Max.   : 3.25841   Max.   : 4.1615  
##                                                                          
##      ap_pc4            ap_pc5        
##  Min.   :-1.7387   Min.   :-2.75079  
##  1st Qu.:-0.1669   1st Qu.:-0.18748  
##  Median : 0.0371   Median : 0.01811  
##  Mean   : 0.1154   Mean   : 0.06797  
##  3rd Qu.: 0.3050   3rd Qu.: 0.34937  
##  Max.   : 3.2476   Max.   : 1.42816  
##

Let’s see a simple CART tree! We remove the regional codes, because there are very serious differences among regional climate awareness. These differences, together with education level, and the year we are talking about, are the most important predictors of thinking about climate change as the most important global problem in Europe.

# Classification Tree with rpart
library(rpart)

# grow tree
fit <- rpart(as.factor(serious_world_problems_first) ~ .,
   method="class", data=climate_awareness_air %>%
     select ( - all_of(c("rowid", "region_nuts_codes"))), 
   control = rpart.control(cp = 0.005))

printcp(fit) # display the results

## 
## Classification tree:
## rpart(formula = as.factor(serious_world_problems_first) ~ ., 
##     data = climate_awareness_air %>% select(-all_of(c("rowid", 
##         "region_nuts_codes"))), method = "class", control = rpart.control(cp = 0.005))
## 
## Variables actually used in tree construction:
## [1] age_education                         isocntry                             
## [3] serious_world_problems_climate_change year                                 
## 
## Root node error: 12817/75086 = 0.1707
## 
## n= 75086 
## 
##          CP nsplit rel error  xerror      xstd
## 1 0.0240566      0   1.00000 1.00000 0.0080438
## 2 0.0082703      3   0.92783 0.92783 0.0078055
## 3 0.0050000      5   0.91129 0.91425 0.0077588

plotcp(fit) # visualize cross-validation results

summary(fit) # detailed summary of splits

## Call:
## rpart(formula = as.factor(serious_world_problems_first) ~ ., 
##     data = climate_awareness_air %>% select(-all_of(c("rowid", 
##         "region_nuts_codes"))), method = "class", control = rpart.control(cp = 0.005))
##   n= 75086 
## 
##            CP nsplit rel error    xerror        xstd
## 1 0.024056592      0 1.0000000 1.0000000 0.008043837
## 2 0.008270266      3 0.9278302 0.9278302 0.007805478
## 3 0.005000000      5 0.9112897 0.9142545 0.007758824
## 
## Variable importance
## serious_world_problems_climate_change                              isocntry 
##                                    31                                    26 
##                          country_code                                   BaP 
##                                    20                                     8 
##                                 pm2_5                                ap_pc1 
##                                     4                                     3 
##                         age_education                                  pm10 
##                                     2                                     2 
##                             education                                ap_pc2 
##                                     2                                     1 
##                                  year 
##                                     1 
## 
## Node number 1: 75086 observations,    complexity param=0.02405659
##   predicted class=0  expected loss=0.1706976  P(node) =1
##     class counts: 62269 12817
##    probabilities: 0.829 0.171 
##   left son=2 (25229 obs) right son=3 (49857 obs)
##   Primary splits:
##       serious_world_problems_climate_change < 0.5          to the right, improve=2214.2040, (0 missing)
##       isocntry                              splits as  RRLLLRRRLLRLRLLLLLLLLLLRRLLLRLL, improve= 728.0160, (0 missing)
##       country_code                          splits as  RRLLLRRLLRLLLLLLLLLLRRLLLRLL, improve= 673.3656, (0 missing)
##       BaP                                   < 0.4300347    to the right, improve= 310.6229, (0 missing)
##       pm2_5                                 < 13.38264     to the right, improve= 296.4013, (0 missing)
##   Surrogate splits:
##       age_education splits as  ----RRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRL-RRR-RRRRRRRRR--RRRLLR--R-R, agree=0.664, adj=0, (0 split)
##       pm10          < 7.491315     to the left,  agree=0.664, adj=0, (0 split)
## 
## Node number 2: 25229 observations
##   predicted class=0  expected loss=0  P(node) =0.3360014
##     class counts: 25229     0
##    probabilities: 1.000 0.000 
## 
## Node number 3: 49857 observations,    complexity param=0.02405659
##   predicted class=0  expected loss=0.2570752  P(node) =0.6639986
##     class counts: 37040 12817
##    probabilities: 0.743 0.257 
##   left son=6 (34631 obs) right son=7 (15226 obs)
##   Primary splits:
##       isocntry     splits as  RRLLLRRRLLRLRLLLLLLLLLLRRLLLRLL, improve=1454.9460, (0 missing)
##       country_code splits as  RRLLLRRLLRLLLLLLLLLLRRLLLRLL, improve=1359.7210, (0 missing)
##       BaP          < 0.4300347    to the right, improve= 629.8844, (0 missing)
##       pm2_5        < 13.38264     to the right, improve= 555.7484, (0 missing)
##       ap_pc1       < -0.005459537 to the left,  improve= 533.3579, (0 missing)
##   Surrogate splits:
##       country_code splits as  RRLLLRRLLRLLLLLLLLLLRRLLLRLL, agree=0.987, adj=0.957, (0 split)
##       BaP          < 0.1749425    to the right, agree=0.775, adj=0.264, (0 split)
##       pm2_5        < 5.206993     to the right, agree=0.737, adj=0.140, (0 split)
##       ap_pc1       < 1.405527     to the left,  agree=0.733, adj=0.126, (0 split)
##       pm10         < 25.31211     to the right, agree=0.718, adj=0.076, (0 split)
## 
## Node number 6: 34631 observations
##   predicted class=0  expected loss=0.1769802  P(node) =0.4612178
##     class counts: 28502  6129
##    probabilities: 0.823 0.177 
## 
## Node number 7: 15226 observations,    complexity param=0.02405659
##   predicted class=0  expected loss=0.4392487  P(node) =0.2027808
##     class counts:  8538  6688
##    probabilities: 0.561 0.439 
##   left son=14 (11607 obs) right son=15 (3619 obs)
##   Primary splits:
##       isocntry      splits as  LL---LLR--L-L----------LL---R--, improve=337.5462, (0 missing)
##       country_code  splits as  LL---LR--L-L--------LL---R--, improve=337.5462, (0 missing)
##       age_education splits as  ----LLLLLL-LLLRRRRRRR-RRRRRRRRRL-RRRRRRLLRR-RRRRLLRLRL-RRLRRR-RRR-LLLLRRR-----LR-----L-R, improve=294.0807, (0 missing)
##       education     < 22.5         to the left,  improve=262.3747, (0 missing)
##       BaP           < 0.053328     to the right, improve=232.7043, (0 missing)
##   Surrogate splits:
##       BaP           < 0.053328     to the right, agree=0.878, adj=0.485, (0 split)
##       pm2_5         < 4.810361     to the right, agree=0.827, adj=0.271, (0 split)
##       ap_pc2        < 0.8746175    to the left,  agree=0.792, adj=0.124, (0 split)
##       so2           < 0.3302972    to the left,  agree=0.781, adj=0.078, (0 split)
##       age_education splits as  ----LLLLLL-LLLLLLLRLR-LRRLRRRRRR-RRRRLLLLLR-LRLRLLRRLL-LLRLLR-LLR-RRLLLLL-----RR-----R-L, agree=0.779, adj=0.071, (0 split)
## 
## Node number 14: 11607 observations,    complexity param=0.008270266
##   predicted class=0  expected loss=0.3804601  P(node) =0.1545827
##     class counts:  7191  4416
##    probabilities: 0.620 0.380 
##   left son=28 (7462 obs) right son=29 (4145 obs)
##   Primary splits:
##       age_education                    splits as  ----LLLLLL-LRRRRRRRRR-RRLRRLRRLL-RRRRLRLLRR-RLRLLLRLRL-RR-RR--RRL-L-LLRRR------------L-R, improve=123.71070, (0 missing)
##       year                             splits as  R-LR, improve=107.79460, (0 missing)
##       education                        < 20.5         to the left,  improve= 90.28724, (0 missing)
##       occupation_of_respondent         splits as  LRRLRRRRRLRLLLRLLL, improve= 84.62865, (0 missing)
##       respondent_occupation_scale_c_14 splits as  LRLLLRRL, improve= 68.88653, (0 missing)
##   Surrogate splits:
##       education                        < 20.5         to the left,  agree=0.950, adj=0.861, (0 split)
##       occupation_of_respondent         splits as  LLLLRLLRRLRLLLRLLL, agree=0.738, adj=0.267, (0 split)
##       respondent_occupation_scale_c_14 splits as  LRLLLLRL, agree=0.733, adj=0.251, (0 split)
##       is_student                       < 0.5          to the left,  agree=0.709, adj=0.186, (0 split)
##       age_exact                        < 23.5         to the right, agree=0.676, adj=0.094, (0 split)
## 
## Node number 15: 3619 observations
##   predicted class=1  expected loss=0.3722023  P(node) =0.04819807
##     class counts:  1347  2272
##    probabilities: 0.372 0.628 
## 
## Node number 28: 7462 observations
##   predicted class=0  expected loss=0.326052  P(node) =0.09937938
##     class counts:  5029  2433
##    probabilities: 0.674 0.326 
## 
## Node number 29: 4145 observations,    complexity param=0.008270266
##   predicted class=0  expected loss=0.4784077  P(node) =0.05520337
##     class counts:  2162  1983
##    probabilities: 0.522 0.478 
##   left son=58 (2573 obs) right son=59 (1572 obs)
##   Primary splits:
##       year                     splits as  L-LR, improve=40.13885, (0 missing)
##       occupation_of_respondent splits as  LRLLRRRRRLRLLLRLLL, improve=18.33254, (0 missing)
##       marital_status           splits as  LRRRLRRRLRRLRLLRRRRRRLRLRLLRR, improve=17.86888, (0 missing)
##       type_of_community        splits as  LRLRL, improve=17.55254, (0 missing)
##       age_education            splits as  ------------LLRRRRRRR-RR-RL-RR---LRRR-R--LR-R-R---R-R--RR-RR--RR------RRR--------------R, improve=14.66121, (0 missing)
##   Surrogate splits:
##       type_of_community splits as  LLLRL, agree=0.777, adj=0.412, (0 split)
##       marital_status    splits as  RRLLLLLRLLLLLLLRRRLLLLLLRLRLL, agree=0.680, adj=0.155, (0 split)
##       isocntry          splits as  LL---LL---L-R----------LL------, agree=0.669, adj=0.127, (0 split)
##       country_code      splits as  LL---L---L-R--------LL------, agree=0.669, adj=0.127, (0 split)
##       o3                < 83.06345     to the right, agree=0.650, adj=0.076, (0 split)
## 
## Node number 58: 2573 observations
##   predicted class=0  expected loss=0.4240187  P(node) =0.03426737
##     class counts:  1482  1091
##    probabilities: 0.576 0.424 
## 
## Node number 59: 1572 observations
##   predicted class=1  expected loss=0.43257  P(node) =0.02093599
##     class counts:   680   892
##    probabilities: 0.433 0.567

# plot tree
plot(fit, uniform=TRUE,
   main="Classification Tree: Climate Change Is The Most Serious Threat")
text(fit, use.n=TRUE, all=TRUE, cex=.8)

## Warning in labels.rpart(x, minlength = minlength): more than 52 levels in a
## predicting factor, truncated for printout

saveRDS ( climate_awareness_air , file.path(tempdir(), "climate_panel_recoded.rds"), version = 2)

# not evaluated
saveRDS( climate_awareness_air, file = file.path("data-raw", "climate-panel_recoded.rds"))

What is Retrospective Survey Harmonization?

Thu, 04 Mar 2021 00:00:00 +0000

Reproducible ex post harmonization of survey microdata

Retrospective survey harmonization allows the comparison of opinion poll data conducted in different countries or time. In this example we are working with data from surveys that were ex ante harmonized to a certain degree – in our tutorials we are choosing questions that were asked in the same way in many natural languages. For example, you can compare what percentage of the European people in various countries, provinces and regions thought climate change was a serious world problem back in 2013, 2015, 2017 and 2019.

We developed the retroharmonize R package to help this process. We have tested the package with about 80 Eurobarometer, 5 Afrobarometer survey files extensively, and a bit with Arabbarometer files. This allows the comparison of various survey answers in about 70 countries. This policy-oriented survey programs were designed to be harmonized to a certain degree, but their ex post harmonization is still necessary, challenging and errorprone. Retrospective harmonization includes harmonization of the different coding used for questions and answer options, post-stratification weights, and using different file formats.

Eurobarometer, Afrobaromer, Arab Barometer and Latinobarómetro make survey files that are harmonized across countries available for research with various terms. Our retroharmonize is not affiliated with them, and to run our examples, you must visit their websites, carefully read their terms, agree to them, and download their data yourself. What we add as a value is that we help to connect their files across time (from different years) or across these programs.

The survey programs mentioned above publish their data in the proprietary SPSS format. This file format can be imported and translated to R objects with the haven package; however, we needed to re-design haven’s labelled_spss class to maintain far more metadata, which, in turn, a modification of the labelled class. The haven package was designed and tested with data stored in individual SPSS files.

The author of labelled, Joseph Larmarange describes two main approaches to work with labelled data, such as SPSS’s method to store categorical data in the Introduction to labelled.

Two main approaches of labelled data conversion.

Our approach is a further extension of Approach B. Survey harmonization in our case always means the joining data from several SPSS files, which requires a consistent coding among several data sources. This means that data cleaning and recoding must take place before conversion to factors, character or numeric vectors. This is particularly important with factor data (and their simple character conversions) and numeric data that occasionally contains labels, for example, to describe the reason why certain data is missing. Our tutorial vignette labelled_spss_survey gives you more information about this.

In the next series of tutorials, we will deal with an array of problems. These are not for the faint heart – you need to have a solid intermediate level of R to follow.

Tidy, joined survey data

The original files identifiers may not be unique, we have to create new, truly unique identifiers. Weighting may not be straightforward.
Neither the number of observations or the number of variables (which represents the survey questions and their translation to coded data) is the same. Certain data may be only present in one survey and not the other. This means that you will likely to run loops on lists and not data.frames, but eventually you must carefully join them.

Class conversion

Similar questions may be imported from a non-native R format, in our case, from an SPSS files, in an inconsistent manner. SPSS’s variable formats cannot be translated unambiguously to R classes. retroharmonize introduced a new S3 class system that handles this problem, but eventually you will have to choose if you want to see a numeric or character coding of each categorical variable.
The harmonized surveys, with harmonized variable names and harmonized value labels, must be brought to consistent R representations (most statistical functions will only work on numeric, factor or character data) and carefully joined into a single data table for analysis.

Harmonization of variables and variable labels

Same variables may come with dissimilar variable names and variable labels. It may be a challenge to match age with age. We need to harmonize the names of variables.
The harmonized variables may have different labeling. One may call refused answers as declined and the other refusal. On a simple choice, climate change may be ‘Climate change’ or Problem: Climate change. Binary choices may have survey-specific coding conventions. Value labels must be harmonized. There are good tools to do this in a single file - but we have to work with several of them.

Missing value harmonization

There are likely to be various types of missing values. Working with missing values is probably where most human judgment is needed. Why are some answers missing: was the question not asked in some questionnaires? Is there a coding error? Did the respondent refuse the question, or sad that she did not have an answer? retroharmonize has a special labeled vector type that retains this information from the raw data, if it is present, but you must make the judgment yourself – in R, eventually you will either create a missing category, or use NA_character_ or NA_real_.

That’s a lot to put on your plate.

It is unlikely that you will be able to work with completely unfamiliar survey programs if you do not have a strong intermediate level of R. Our package comes with tutorials for Eurobarometer, Afrobarometer and our development version already covers Arab Barometer, highlighting some peculiar issues with these survey programs, that we hope to give a head start for less experienced R users.

Open Data Day Interview: Mapping Data with Milos Popovic

Wed, 03 Mar 2021 22:23:00 +0200

Milos Popovic is a researcher, a data scientist, Marie Curie postdoc & Top 10 dataviz & R contributor on Twitter according to NodeXL. He took part in policy debates about terrorism and military intervention and appeared on a number of TV channels including N1 (the CNN affiliate in the Western Balkans), Serbian National Television and Al-Jazeera Balkans. My research interests are at the intersection of civil war dynamics and postwar politics in the Balkans. He is going to join the Data & Lyrics team on International Open Data Day to help us put harmonized environmental degradation perception and environmental sensory data on maps. We asked him four questions about his passion, mapping data. Please join us 6 March 2021 9.30 EST / 15.30 CET for an informal digital coffee.

As a researcher, why are you so much drawn into maps? Is this connected to your interest in territorial conflicts, or you have some other inspiration?

That’s a great question that really makes me pause and look back at the past 5 years. My mapping story started out of curiosity: I found interesting data on the post-WWII violence in Serbia and thought how cool it would be to make a map in R. I quickly made an unimpressive choropleth map and noticed some unexpected patterns. Then I realized just how much unused violence and census data sits out there while we have no clue about geographic patterns. So, it began. I started off with map-making but my curiosity took me to the world of georeferencing and geospatial analysis. In the process, I created over 300 maps hosted on my website as well as dozens of shapefiles from the scratch.

I used to think that my interest is linked to growing up in a war-torn country. But, as my map-making evolved, I discovered that my passion is to use maps as a way to democratize the data: to take the scores of unused, and often buried datasets, place them on the map and share the dataviz with people.

Can you show us an example of the best use of mapped data, and the best map that you have personally created? What is their distinctive value?

I’m immensely proud of my work that required making the shapefiles from the scratch. For instance, my shapefile of over 1500 Kosovo cadastral settlements came into being after I turned dozens of high-resolution raster files into a shapefile fully compatible with Open Street Maps. After months of hard work, I managed to merge the shapefile with the 2011 Kosovo census and present several laser-focused demographic maps to my audience. Same goes for the settlement shapefile of Republika Srpska [the Serb-speaking entity of Bosnia-Herzegovina — the editor], which I made out of a pdf file and merged with the 2013 census data. Whereas most existing maps take a bird’s eye view, my work offers a more fine-grained view of the local dynamics to stakeholders.

Another similar undertaking was my transformation of the pre-WWII German military map of Yugoslavia into a unique shapefile of a few hundred Yugoslav municipalities. I combined this shapefile with the 1931 census data, 80 years after it was first published (better late than never!). It took me almost a year to complete this tremendous project but I enjoyed every bit of it. I have teamed up with my brother who is a web developer and we even made an interactive map of Yugoslavia based on the 1931 census.[The screenshot of this interactive map is the top image in the post – the editor] We hope this project would serve not only scholars but also history enthusiasts to better understand a history of the country that is no more.

Check out Milos’s beautiful static and interactive maps on https://milosp.info/

What do you think about collaboration based on open data and open-source software that processes such data?

It’s a fantastic opportunity for small teams to bypass traditional gatekeepers such as state institutions or big companies and use open source apps for the benefit of their local communities. For example, the access to Open Street Map allows small teams to map pressing communal issues as crime, deceases, or environmental degradation and come up with innovative solutions. In my work, too, I used OSM has helped me create several fine-grained maps that shed more light on local problems in Serbia such as pollution, car accidents or violence.

We are hoping to bring together environmental, sensory data and public attitude data on environmental issues? How can mapping help? What do you expect from this project?

More than ever, we are compelled to figure out how maladies spreads locally. Without mapping the hotspots, our understanding of the consequences of, for example, viral transmission or pollution is shrouded with a lot of uncertainty. We might have no clue how environmental issues shape public attitudes in localities until we use the mapping to turn on the light. Mapping would help this project pin down geographic clusters that require immediate attention from the private and public stakeholders.

Please join us for a digital coffee, tea or beer on International Open Data Day - we will put never seen data on maps, and discuss how to build successful open collaborations, with little, independent contributions to build large data observatories. Make sure you check out Milos' amazing website, too!

This blogpost was originally posted on our Data & Lyrics blog and its mutation on Medium.

Eurobarometer Surveys Used In Our Project

Wed, 03 Mar 2021 00:00:00 +0000

In our tutorial series, we are going to harmonize the following questionnaire items from five Eurobarometer harmonized survey files. The Eurobarometer survey files are harmonized across countries, but they are only partially harmonized in time.

All data must be downloaded from the GESIS Data Archive in Cologne. We are not affiliated with GESIS and you must read and accept their terms to use the data.

Eurobarometer 80.2 (2013)

GESIS Data Archive, Cologne. ZA5877 Data file Version 2.0.0, https://doi.org/10.4232/1.12792

Data file: ZA6595 data file (European Commission 2017).
Questionnaire: Eurobarometer 83.4 Basic Bilingual Questionnaire
Citation: ZA6595 Bibtex

QA1a Which of the following do you consider to be the single most serious problem facing the world as a whole? (single choice)

QA1b Which others do you consider to be serious problems? (multiple choice)

QA2 And how serious a problem do you think climate change is at this moment? Please use a scale from 1 to 10, with '1' meaning it is "not at all a serious problem (scale 1-10)

QA4 To what extent do you agree or disagree with each of the following statements? - Fighting climate change and using energy more efficiently can boost the economy and jobs in the EU (agreement-disagreement 4-scale)

QA4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU could benefit the EU economically (agreement-disagreement 4-scale)

QA5 Have you personally taken any action to fight climate change over the past six months? (binary)

Eurobarometer 83.4 (2015)

European Commission, Brussels; Directorate General Communication COMM.A.1 ´Strategy, Corporate Communication Actions and Eurobarometer´GESIS Data Archive, Cologne. ZA6595 Data file Version 3.0.0, https://doi.org/10.4232/1.13146

Data file: ZA6595 data file (European Commission 2018).
Questionnaire: Eurobarometer 83.4 Basic Bilingual Questionnaire
Citation: ZA6595 Bibtex

Eurobarometer 87.1 (2017)

European Commission, Brussels; Directorate General Communication, COMM.A.1 ‘Strategic Communication’; European Parliament, Directorate-General for Communication, Public Opinion Monitoring Unit GESIS Data Archive, Cologne. ZA6861 Data file Version 1.2.0, https://doi.org/10.4232/1.12922

Data file: ZA6861 data file.
Questionnaire: Eurobarometer 90.2 Basic Bilingual Questionnaire
Citation: ZA6861 Bibtex

QC1a Which of the following do you consider to be the single most serious problem facing the world as a whole? (single choice)

QC1b Which others do you consider to be serious problems? (multiple choice)

QC2 And how serious a problem do you think climate change is at this moment? Please use a scale from 1 to 10, with '1' meaning it is "not at all a serious problem (scale 1-10)

Qc4 To what extent do you agree or disagree with each of the following statements? - Fighting climate change and using energy more efficiently can boost the economy and jobs in the EU (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - Promoting EU expertise in new clean technologies to countries outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can increase the security of EU energy supplies (agreement-disagreement 4-scale)

Qc4 To what extent do you agree or disagree with each of the following statements? - More public financial support should be given to the transition to clean energies even if it means subsidies to fossil fuels should be reduced. (agreement-disagreement 4-scale)

Qc5 Have you personally taken any action to fight climate change over the past six months? (binary)

Eurobarometer 90.2 (2018)

European Commission, Brussels; Directorate General Communication, COMM.A.3 ‘Media Monitoring and Eurobarometer’ GESIS Data Archive, Cologne. ZA7488 Data file Version 1.0.0, https://doi.org/10.4232/1.13289

Data file: ZA7488 data file (European Commission 2019a)
Questionnaire: Eurobarometer 90.2 Basic Bilingual Questionnaire
Citation: ZA7488 Bibtex

QB5 To what extent do you agree or disagree with each of the following statements? - Fighting climate change and using energy more efficiently can boost the economy and jobs in the EU (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - Promoting EU expertise in new clean technologies to countries outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can increase the security of EU energy supplies (agreement-disagreement 4-scale)

QB5 To what extent do you agree or disagree with each of the following statements? - More public financial support should be given to the transition to clean energies even if it means subsidies to fossil fuels should be reduced. (agreement-disagreement 4-scale)

Eurobarometer 91.3 (2019)

European Commission, Brussels; Directorate General Communication, COMM.A.3 ‘Media Monitoring and Eurobarometer’ GESIS Data Archive, Cologne. ZA7572 Data file Version 1.0.0, https://doi.org/10.4232/1.13372

Data file: ZA7572 data file (European Commission 2019b).
Questionnaire: Eurobarometer 91.3 Basic Bilingual Questionnaire
Citation: ZA7572 Bibtex

QB4 To what extent do you agree or disagree with each of the following statements? - Taking action on climate change will lead to innovation that will make EU companies more competitive (N) (agreement-disagreement 4-scale)

QB4 To what extent do you agree or disagree with each of the following statements? - Promoting EU expertise in new clean technologies to countries outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB4 To what extent do you agree or disagree with each of the following statements? - Reducing fossil fuel imports from outside the EU can benefit the EU economically (agreement-disagreement 4-scale)

QB4 To what extent do you agree or disagree with each of the following statements? - Adapting to the adverse impacts of climate change can have positive outcomes for citizens in the EU (agreement-disagreement 4-scale)

QB5 Have you personally taken any action to fight climate change over the past six months? (binary)

References

European Commission, Brussels. 2017. “Eurobarometer 80.2 (2013).” GESIS Data Archive, Cologne. ZA5877 Data file Version 2.0.0, https://doi.org/10.4232/1.12792. https://doi.org/10.4232/1.12792.

———. 2018. “Eurobarometer 83.4 (2015).” GESIS Data Archive, Cologne. ZA6595 Data file Version 3.0.0, https://doi.org/10.4232/1.13146. https://doi.org/10.4232/1.13146.

———. 2019a. “Eurobarometer 90.2 (2018).” GESIS Data Archive, Cologne. ZA7488 Data file Version 1.0.0, https://doi.org/10.4232/1.13289. https://doi.org/10.4232/1.13289.

———. 2019b. “Eurobarometer 91.3 (2019).” GESIS Data Archive, Cologne. ZA7572 Data file Version 1.0.0, https://doi.org/10.4232/1.13372. https://doi.org/10.4232/1.13372.

Data Curation

Thu, 21 Jan 2021 00:00:00 +0000

If you cannot find the right data for your policy evaluation, your consulting project, your PhD thesis, your market research, or your scientific research project, it does not mean that the data does not exist, or that it is not available for free. In our experience, up to 95% of available open data is never used, because potential users do not realize it exists or do not know how to access it.

Every day, thousands of new datasets become available via the EU open data regime, freedom of information legislation in the United States and other jurisdictions, or open science and scientific reproducibility requirements — but as these datasets have been packaged or processed for different primary, original uses, they often require open data experts to locate them and adapt them to a usable form for reuse in business, scientific, or policy research.

The creative and cultural industries often do not participate in government statistics programs because these industries are typically comprised of microenterprises that are exempted from statistical reporting and that file only simplified financial statements and tax returns. This means that finding the appropriate private or public data sources for creative and cultural industry uses requires particularly good data maps.

Data curation means that we are continuously mapping potential data sources and sending requests to download and quality test the most current data sources. Our CEEMID project has produced several thousand indicators, of which a few dozen are available in our Demo Music Observatory.If you have specific data needs for a scientific research, policy evaluation, or business project, we can find and provide the most suitable, most current, and best value data for analysis or for ethical AI applications.

Data Processing

Thu, 21 Jan 2021 00:00:00 +0000

Data analysts spend 80% of their time on data processing, even though computers can perform these task much faster, with far less errors, and they can document the process automatically. Data processing can be shared: an analyst in a company and an analyst in an NGO does not have to reprocess the very same data twice*

See our blogpost How We Add Value to Public Data With Imputation and Forecasting?.

See our blogpost about the Data Sisyphus blogpost.

We have a better solution. You can always rely on our API to import directly the latest, best data, but if you want to be sure, you can use our regular backups on Zenodo. Zenodo is an open science repository managed by CERN and supported by the European Union. On Zenodo, you can find an authoritative copy of our indicator (and its previous versions) with a digital object identifier, for example, 10.5281/zenodo.5652118. These datasets will be preserved for decades, and nobody can manipulate them. You cannot accidentally overwrite them, and we have no backdoor access to modify them.

Data-as-Service

Thu, 21 Jan 2021 00:00:00 +0000

We want to ensure that individual researchers, artists, and professionals, as well as NGOs and small and large organizations can benefit equally from big data in the age of artificial intelligence.

Big data creates inequality and injustice because it is only the big corporations, big government agencies, and the biggest, best endowed universities that can finance long-lasting, comprehensive data collection programs. Big data, and large, well-processed, tidy, and accurately imputed datasets allow them to unleash the power of machine learning and AI. These large entities are able to create algorithms that decide the commercial success of your product and your artwork, giving them a competitive edge against smaller competitors while helping them evade regulations.

Check out our iotables software that helps the use of national accounts data from all EU members states to create economic direct, indirect and induced economic impact calculation, such as employment multipliers or GVA affects of various cultural and creative economy policies.

Check out our regions software that helps the harmonization of various European and African standardized surveys.

Check out our retroharmonize software that helps the harmonization of various European and African standardized surveys.

retroharmonize R package for survey harmonization

Tue, 25 Aug 2020 00:00:00 +0000

Retrospective data harmonization

The aim of retroharmonize is to provide tools for reproducible retrospective (ex-post) harmonization of datasets that contain variables measuring the same concepts but coded in different ways. Ex-post data harmonization enables better use of existing data and creates new research opportunities. For example, harmonizing data from different countries enables cross-national comparisons, while merging data from different time points makes it possible to track changes over time.

Retrospective data harmonization is associated with challenges including conceptual issues with establishing equivalence and comparability, practical complications of having to standardize the naming and coding of variables, technical difficulties with merging data stored in different formats, and the need to document a large number of data transformations. The retroharmonize package assists with the latter three components, freeing up the capacity of researchers to focus on the first.

Specifically, the retroharmonize package proposes a reproducible workflow, including a new class for storing data together with the harmonized and original metadata, as well as functions for importing data from different formats, harmonizing data and metadata, documenting the harmonization process, and converting between data types. See here for an overview of the functionalities.

The new labelled_spss_survey() class is an extension of haven’s labelled_spss class. It not only preserves variable and value labels and the user-defined missing range, but also gives an identifier, for example, the filename or the wave number, to the vector. Additionally, it enables the preservation – as metadata attributes – of the original variable names, labels, and value codes and labels, from the source data, in addition to the harmonized variable names, labels, and value codes and labels. This way, the harmonized data also contain the pre-harmonization record. The stored original metadata can be used for validation and documentation purposes.

The vignette Working With The labelled_spss_survey Class provides more information about the labelled_spss_survey() class.

In Harmonize Value Labels we discuss the characteristics of the labelled_spss_survey() class and demonstrates the problems that using this class solves.

We also provide three extensive case studies illustrating how the retroharmonize package can be used for ex-post harmonization of data from cross-national surveys:

The creators of retroharmonize are not affiliated with either Afrobarometer, Arab Barometer, Eurobarometer, or the organizations that designs, produces or archives their surveys.

We started building an experimental APIs data is running retroharmonize regularly and improving known statistical data sources. See: Digital Music Observatory, Green Deal Data Observatory, Economy Data Observatory.

Citing the data sources

Our package has been tested on three harmonized survey’s microdata. Because retroharmonize is not affiliated with any of these data sources, to replicate our tutorials or work with the data, you have download the data files from these sources, and you have to cite those sources in your work.

Afrobarometer data: Cite Afrobarometer Arab Barometer data: cite Arab Barometer. Eurobarometer data: The Eurobarometer data Eurobarometer raw data and related documentation (questionnaires, codebooks, etc.) are made available by GESIS, ICPSR and through the Social Science Data Archive networks. You should cite your source, in our examples, we rely on the GESIS data files.

Citing the retroharmonize R package

For main developer and contributors, see the package homepage.

This work can be freely used, modified and distributed under the GPL-3 license:

citation("retroharmonize")
#> 
#> To cite package 'retroharmonize' in publications use:
#> 
#>   Daniel Antal (2021). retroharmonize: Ex Post Survey Data
#>   Harmonization. R package version 0.1.17.
#>   https://retroharmonize.dataobservatory.eu/
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {retroharmonize: Ex Post Survey Data Harmonization},
#>     author = {Daniel Antal},
#>     year = {2021},
#>     doi = {10.5281/zenodo.5006056},
#>     note = {R package version 0.1.17},
#>     url = {https://retroharmonize.dataobservatory.eu/},
#>   }

Contact

For contact information, contributors, see the package homepage.

Code of Conduct

Please note that the retroharmonize project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

regions R package to create sub-national statistical indicators

Wed, 03 Jun 2020 17:00:00 +0000

Installation

You can install the development version from GitHub with:

devtools::install_github("rOpenGov/regions")

or the released version from CRAN:

install.packages("devtools")

regions currently takes care of 20,000 sub-divisional boundary changes in Europe since 1999. Comparing departments of France in 2013, with 2007 vojvodinas of Poland and 2018 megyék in Hungary? This extremely errorprone work is automated, as a result, you can compare 110-260 regions for far better analysis. regions was downloaded about 600 researchers in the first month after release.

You can review the complete package documentation on regions.dataobservatory.eu. If you find any problems with the code, please raise an issue on Github. Pull requests are welcome if you agree with the Contributor Code of Conduct

If you use regions in your work, please cite the package.

Motivation

Working with sub-national statistics has many benefits. In policymaking or in social sciences, it is a common practice to compare national statistics, which can be hugely misleading. The United States of America, the Federal Republic of Germany, Slovakia and Luxembourg are all countries, but they differ vastly in size and social homogeneity. Comparing Slovakia and Luxembourg to the federal states or even regions within Germany, or the states of Germany and the United States can provide more adequate insights. Statistically, the similarity of the aggregation level and high number of observations can allow more precise control of model parameters and errors.

The advantages of switching from a national level of the analysis to a sub-national level comes with a huge price in data processing, validation and imputation. The package Regions aims to help this process.

Sub-national Statistics Have Many Challenges

Frequent boundary changes: as opposed to national boundaries, the territorial units, typologies are often change, and this makes the validation and recoding of observation necessary across time. For example, in the European Union, sub-national typologies change about every three years and you have to make sure that you compare the right French region in time, or, if you can make the time-wise comparison at all.
Hierarchical aggregation and special imputation: missingness is very frequent in sub-national statistics, because they are created with a serious time-lag compared to national ones, and because they are often not back-casted after boundary changes. You cannot use standard imputation algorithms because the observations are not similarly aggregated or averaged. Often, the information is seemingly missing, and it is present with an obsolete typology code.

Package functionality

Generic vocabulary translation and joining functions for geographically coded data
Keeping track of the boundary changes within the European Union between 1999-2021
Vocabulary translation and joining functions for standardized European Union statistics
Vocabulary translation for the ISO-3166-2 based Google data and the European Union
Imputation functions from higher aggregation hierarchy levels to lower ones, for example from NUTS1 to NUTS2 or from ISO-3166-1 to ISO-3166-2 (impute down)
Imputation functions from lower hierarchy levels to higher ones (impute up)
Aggregation function from lower hierarchy levels to higher ones, for example from NUTS3 to NUTS1 or from ISO-3166-2 to ISO-3166-1 (aggregate; under development)
Disaggregation functions from higher hierarchy levels to lower ones, again, for example from NUTS1 to NUTS2 or from ISO-3166-1 to ISO-3166-2 (disaggregate; under development)

Vignettes / Articles

Feedback?

Raise and issue on Github or get in touch. Downloaders from CRAN:

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

iotables R package for working with symmetric input-output tables

Wed, 03 Jun 2020 00:00:00 +0000

iotables processes all the symmetric input-output tables of the EU member states, and calculates direct, indirect and induced effects, multipliers for GVA, employment, taxation. These are important inputs into policy evaluation, business forecasting, or granting/development indicator design. iotables is used by about 800 experts around the world.

Code of Conduct

Please note that the iotables project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Slides

Tue, 05 Feb 2019 00:00:00 +0000

Create slides in Markdown with Wowchemy

Wowchemy | Documentation

Features

Efficiently write slides in Markdown
3-in-1: Create, Present, and Publish your slides
Supports speaker notes
Mobile friendly slides

Controls

Next: Right Arrow or Space
Previous: Left Arrow
Start: Home
Finish: End
Overview: Esc
Speaker notes: S
Fullscreen: F
Zoom: Alt + Click
PDF Export: E

Code Highlighting

Inline code: variable

Code block:

porridge = "blueberry"
if porridge == "blueberry":
    print("Eating...")

Math

In-line math: $x + y = z$

Block math:

$$ f\left( x \right) = ;\frac{{2\left( {x + 4} \right)\left( {x - 4} \right)}}{{\left( {x + 4} \right)\left( {x + 1} \right)}} $$

Fragments

Make content appear incrementally

{{% fragment %}} One {{% /fragment %}}
{{% fragment %}} **Two** {{% /fragment %}}
{{% fragment %}} Three {{% /fragment %}}

Press Space to play!

One Two Three

A fragment can accept two optional parameters:

class: use a custom style (requires definition in custom CSS)
weight: sets the order in which a fragment appears

Speaker Notes

Add speaker notes to your presentation

{{% speaker_note %}}
- Only the speaker can read these notes
- Press `S` key to view
{{% /speaker_note %}}

Press the S key to view the speaker notes!

Themes

black: Black background, white text, blue links (default)
white: White background, black text, blue links
league: Gray background, white text, blue links
beige: Beige background, dark text, brown links
sky: Blue background, thin dark text, blue links

night: Black background, thick white text, orange links
serif: Cappuccino background, gray text, brown links
simple: White background, black text, blue links
solarized: Cream-colored background, dark green text, blue links

Custom Slide

Customize the slide style and background

{{< slide background-image="/media/boards.jpg" >}}
{{< slide background-color="#0000FF" >}}
{{< slide class="my-style" >}}

Custom CSS Example

Let’s make headers navy colored.

Create assets/css/reveal_custom.css with:

.reveal section h1,
.reveal section h2,
.reveal section h3 {
  color: navy;
}

Questions?

Ask

Documentation

Mon, 01 Jan 0001 00:00:00 +0000

Robin Nagy

Mon, 01 Jan 0001 00:00:00 +0000

Robin is a freelancer working on projects using emerging technologies. Main areas of experties are Business Development, Project Management, Innovation, Governance and Corporate culture. Recent interest: Exponential technologies, Exponential business.