![abbyy spreadsheet converter abbyy spreadsheet converter](https://1.bp.blogspot.com/-WSTWtAYw8CU/XoX1qOaxUDI/AAAAAAAAA6o/rpip_M98QKsFjc_T4zCCChYp6kpLcgJ2gCLcBGAsYHQ/s1600/abbyy15.1.jpg)
- #Abbyy spreadsheet converter pdf#
- #Abbyy spreadsheet converter verification#
- #Abbyy spreadsheet converter code#
- #Abbyy spreadsheet converter download#
This repo contains the code that processes the data, but not the data itself. Our project is open source, the code is available in this gitlab repository.
![abbyy spreadsheet converter abbyy spreadsheet converter](https://images.sftcdn.net/images/t_app-cover-l,f_auto/p/065047b5-458d-4efa-a6dd-f37497cff661/189010910/editable-excel-screenshot.png)
Should you encounter any mistake in the data, please contact us immediately, we will do our best to fix it. Especially in the “main” table enhanced_link, there is the url of the source of the data, allowing for quick verification.
#Abbyy spreadsheet converter verification#
The nature of the format we created is thought to allow quick verification against the source where the data was first published. We then extracted key metrics grouped by year / company / country for the top 20 companies in term of total amount disclosed.
![abbyy spreadsheet converter abbyy spreadsheet converter](https://manualmachine.com/html/53/53a2/53a279ad832af008f48b31809c5dc29ea02ae2fac093bf1afb2faa8ad80f7318/htmlconvd-UJTqUK18x1.jpg)
Checked the links with the biggest values to check against their source.Randomly checked 100 links of interest against their source and found no data had been misread.If an aggregated line is detected, the 2 other aggregated lines must be detected (amount, percent, number of recipients).HCP lines cannot have Grants or Event Sponsorship values.Checking which line blocks were detected in the file (some line blocks can be empty or not be present, but not all).
#Abbyy spreadsheet converter pdf#
To ensure this we put various quality checks in place, especially around the extraction of the data from EFPIA PDF publication: The goal of this project is to collect and centralize data without altering it. This tool allows to run SQL queries against the database as well as to create charts and dashboards. This database sits on a server with an instance of Metabase, an open-source tool for database exploration (see here for more info). Once the data is in a normalized CSV file, it can be inserted into a unique database. In this step we convert the previous CSV files to this format. This format can represent both the EFPIA standard as well as other standards of countries with state regulation like France or Portugal. We developed a format to represent links of interest in a standardized way (see details of this format here on gitlab). The result of this step is a csv file representing the data in a format very close to the EFPIA format.įor PDF countries, this step was sometimes unsuccessful, you can see the parse ratio on the overview dashboard. The main challenges were to correctly identify columns and the different blocks of lines of each documents (Headers, HCP individual lines, HCP aggregated lines, HCO individual lines, HCO aggregated lines, RnD line) Converters used: Adobe Acrobat Export PDF, SmallPDF, ABBYY FinereaderĮven once in Excel format, extracting data wasn’t trivial, since each company formatted their publication differently. Due to the versatility of the PDF documents, no converter was working for all documents, so we used different converters, so we possibly had different xls files for each pdf file. We tried various approaches, the best results were obtained via first using softwares to convert the PDF into XLS, then extract the data from the Excel file. 3 – Extract dataįor centralized countries, we read the downloaded files (xls or html files), and extract the data via a python script, then save the data as a csv file.įor PDF countries, this is much more complex. For PDF countries, a Python script reads the previous spreadsheet and downloads the pdfs, standardizing names for practicality.
#Abbyy spreadsheet converter download#
Transferofvalue.ie, made by the IPHA (trade association)ĭocuments disclosed by each company: see listįor centralized countries, we download the data if possible (UK), or scrape the website. Here you can see the sources we used for centralized countries as well as links to the list of PDFs we gathered for PDF countries: Countryīase Transparence Santé (public register)ĭanish Medicines Agency (public register)ĭisclosure UK, made by the ABPI (trade association) For countries based on PDFs, we list the PDFs in an online spreadsheet.
![abbyy spreadsheet converter abbyy spreadsheet converter](https://phanmemfree.org/image/Cong-Cu/ABBYY-FineReader.jpg)
For centralized countries, this is simple.