Data Sources - Public Innovations Explorer

Awards Data

Awards data for the Small Business Innovation Research (SBIR) and Small Business Technology Transfer (STTR) programs was obtained from the Awards Database at SBIR.gov. For the period between 2008 and 2018 there were 65,749 grants made by the eleven Federal agencies particpiating in the SBIR and STTR programs. Information regarding the data models for the available awards data, solicitations data, firms data and local assistance programs datasets, including the data dictionaries, can be found here. Awards were geocoded using Geocod.io.

From the SBIR website:

The Small Business Innovation Research (SBIR) and Small Business Technology Transfer (STTR) programs are highly competitive programs that encourage domestic small businesses to engage in Federal Research/Research and Development (R/R&D) with the potential for commercialization. Through a competitive awards-based program, SBIR and STTR enable small businesses to explore their technological potential and provide the incentive to profit from its commercialization. By including qualified small businesses in the nation's R&D arena, high-tech innovation is stimulated, and the United States gains entrepreneurial spirit as it meets its specific research and development needs.

Central to the STTR program is the partnership between small businesses and nonprofit research institutions. The STTR program requires the small business to formally collaborate with a research institution in Phase I and Phase II. STTR's most important role is to bridge the gap between performance of basic science and commercialization of resulting innovations.

Each year, Federal agencies with extramural research and development (R&D) budgets that exceed $100 million are required to allocate 3.2% (since FY2017) of this extramural R&D budget to fund small businesses through the SBIR program. Federal agencies with extramural R&D budgets that exceed $1 billion are required to reserve 0.45% (since FY2016) of this extramural R&D budget for the STTR program. Currently, eleven Federal agencies participate in the SBIR program and five of those agencies also participate in the STTR program.

Labor Data

Data on the workforce composition of congressional districts was obtained from the United States Census Bureau's 2014-2018 American Community Survey 5-year estimates, available here. Industries are based on the North American Industry Classification System (NAICS). National hot-spots and cold-spots for each industry were identified with the Local Moran's I statistic, using RStudio and the sp, spdep, rgdal libraries.

Vocabularies

A number of controlled vocabularies and taxonomies were utilized for the keyword extraction pipeline. These vocabularies were chosen based upon relevance, as well as based on the ease of using the data resources available. Vocabularies were accessed from SPARQL endpoints and/or manually downloaded in various formats, and subsequently reformatted as needed using Python (rdflib) and Protegé open-source ontology editor.

Source	Description
European Environment Agency General Multilingual Environmental Thesaurus (GEMET)	Environmental issue classifications from the European Commission
European Institute for Gender Equality (EIGE) Glossary & Thesaurus	Gender equality thesaurus from the European Commission
Food and Agriculture Organization of the United Nations - AGROVOC Thesaurus	Food systems and agricutltural classifications from the United Nations
STW Thesaurus of Economics	Standardized subject headings and individual keywords in various areas of economics, geography, society and politics from the Leibniz Information Centre for Economics.
European Science Vocabulary (EuroSciVoc)	Science related classifications from the European Commission
EUROVOC Thesaurus of Activities related to the EU	Governmental, social, political, legal and economic classifications from the European Commission
National Library of Medicine Medical Subject Headings	The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information

NLP Pipeline

To process the award abstracts, I used the NIH Office of Portfolio Analysis's NLPre processing pipeline. The library includes a number of utilities often used when working with scientific publications, including:

Acronym identification and replacement
Paranthetical phrase identification
Parenthetical phrase extraction
Part-of-speech tokenization
Unicode to ASCII character conversions
Reconnecting hyphenated words
Decapitalization of document and section titles
Citation separation
URL replacement
Replacement of mathematical tokens (e.g. > →'greater than'; % → 'percent')
Token replacement from dictionaries

To expand the utility of this pipeline when using multiple controlled vocabularies as dictionaries, I added a class that updates an award-level vocabulary index after running through multiple dictionaries. You can find the class here. Work on this utility remains on the to-do list for future improvements, as the class is wholly functional but far from optimally performant. You can find more details about that in the accompanying white paper here.

A special thanks is owed to Travis Hobbes, one of NLPre's co-authors, who was incredibly gracious in talking through the pipeline with me during the ideation phase of this project.

Visualization Libraries

The Explorer uses Leaflet.js for the webmap, D3.js to create the other charts and tables, and SlickGrid to create the table accompanying the parallel coordinates chart. The parallel coordinates chart was created with the latest BigFatDog edition of Syntagmatic's parallel coordinates module, available on Github here. Axis removal icons were added to the parallel coordinates chart using Bootstrap's popovers.