This project was developed and submitted by Seth Schimmel in Spring 2021 towards the capstone requirement for completion of the Masters of Science in Data Analysis and Visualization at the CUNY Graduate Center. The project was motivated by my professional experience with data mining and analytics development within a private philanthropic foundation, and inspired by computational research methods being applied in social science fields like economic geography, science studies, and bibliometrics.
By making even a subset of government funded research more visible for a variety of analysts and a wider public, my hope is that the Public Innovations Explorer can foreground and provoke questions regarding the geography of knowledge production and how “innovation” is defined in relation to particular research agendas. Rather than advancing one particular analysis, the Explorer—along with its repository of scripts and processed data—can contribute to a variety of analyses while also highlighting the ways in which data modeling, analytics and visualization go hand in hand.
Researchers in bibliometrics develop statistical measures to evaluate the impact and reach of publications. By generating co-authorship and citation networks, for instance, analysts and/or product designers can make content recommendations based on textual and network level information or trace genealogies of ideas and collaboration.1,2,3 As a sub-field of bibliometrics, scientometrics seeks to apply these quantitative techniques to publication outputs in order to describe phenomena such as: the historical or current dynamics of national and international scientific collaboration;4,5,6,7 the emergence of innovations in particular fields like materials science, topics like big data, or even particular technologies like flash memory or solid-state drives;8,9,10 to explain how publication and citation activity differs based on gender, and relates to career outcomes for scientists.11,12,13 Scientometrics has contributed techniques that are useful both for enhancing research monitoring and distribution platforms, and also for conducting sociological analyses of various dimensions of scientific practice. By honing methods of studying the outputs of scientific research, including research publications, patenting activity, and references to research in news media and policy documents, scientometrics supports interdisciplinary collaborations with fields like policy studies and economic geography.
While techniques for information extraction and network analysis have grown more robust with the increasing availability of data sources and efficient computing services, economists have been using patent records as a data proxy for measuring innovative activity for several decades already.14,15 Current scholarship at the intersection of economic geography and science and technology policy studies uses patent data to identify comparative advantages and disadvantages in certain fields research activity across geographic regions, offering policy makers and analysts an empirical indicator of regional scientific and technical specializations that might effect policy planning and outcomes. 16,17,18,19,20,21,22 The Public Innovations Explorer integrates an economic geography persective in two important ways.
First, the Explorer enables identification of spatial clusters for different labor sectors to allow users to identify underlying economic profiles of congressional districts on the map. Economists propose various social and microeconomic mechanisms that explain the localization of innovative activities. The concept of "knowledge spillovers" or "human capital spillovers," for instance, has been proposed to capture the way that knowledge sharing is an inherently social activity where proximity to others with similar specialized knowledge can lead to a competitive advantage in some places and a disadvantage in others. While some have proposed that "knowledge" is a special kind of tacit product or service that depends on communication for transmission, others have suggested political-economic explanations of localization that seek to identify how interactions of labor market features and localization, as well as aggrements between universities and firms, affect innovation activity.23 Inspired by scholarship that remarks on the localization of specific industries and meanwhile also aims to disaggregate findings of innovation spillovers or localizations across specific types of research and invention,24,25,26 the Explorer integrates spatial cluster analysis for employment sectors to help support formation of sector-specific and region-specific evaluation of Federal funding activity.
Secondly, the Explorer adds to the small handful of efforts that uses techniques from text mining and linked data to enrich funding data rather than patents and publications data. While much has been done analyzing patent data, publications and citations as proxies for particular scientific outcomes or to develop regional comparisons, much less work has been done to analyze publicly available grants data,27,28,29 or to assess relationships between geographical specialization and funding flows.30 One reason why far less research has been done on fundings flows and grants data than patents or publications is that the classifications and metadata scheme for grant records is much less consistent than those applied to patents and publications. By enriching the SBIR/STTR grants data, albeit imperfectly, I hope to hilight how better metadata could enhance the geographical study of funding flows.
In the landscape of research and knowledge production, grant records are an input that reflect momentary opportunities and priorities. Highlighting the what and where of research funding is one way to evaluate which research initiatives and agendas have been prioritized in the past, and imagine what research initiatives and agendas might be supported in the future. In order to advance these qualitative analyses, it is important to have consistent qualitative metadata regarding the subject matter of research activities that are funded. Typically, patent records and scientific publications are frequently utilized data sources within scientometric research. While researchers frequently use natural language processing on abstracts or the body text of publications, crucially important is that these data sources often include carefully curated metadata. Patent documents include a variety of industrial classifications, and publications include a host of subject area as well as geographic and historical classifications. The relative consistency of the schemes makes for relatively clean analyses of co-occurence between classifications. For scholars in fields like economic geography and scientometrics, the recombinations of thematic classifications or of authorial collaborations are interesting proxies for the recombinant social activities that develop and spread knowledge.
Over 80% of the 65,749 awards in the 2008 to 2018 SBIR/STTR grants dataset used for this project included information regarding the solicitations for proposals on the basis of which a grants was awarded, and over 90% included topic codes. However, in working with the SBIR.gov databases and doing my best to understand the chain of identifiers used for linking, I was only able to connect 52% of awards to their solicitation information and just 32% to their topic information. It did not appear to me that a single, direct unique identifier was available to cleanly connect award records to the solicitation and topic information. Additionally, across the grant records, just 43% came with Research Keywords. Rather than train a classifier to predict solicitation or topic codes whose linkages to qualitative information would be uncertain, I chose to enrich the data by extracting keywords from grant abstracts for all records.
While the keywords supply qualitative information to the Explorer, this data can be used for a number of extended analyses. Researchers in scientometrics or economic geography may be interested in analyses of novel terms across geographies and time, or in evaluating further whether the research topics funded in certain regions tended to vary according to the specialized labor profile of the region. Alternatively, others interested in the curation and metadata standards may want to analyze the overlap between the vocabularies and keyword extraction methods that were utilized. Both efforts could help to develop the more commonplace use grant records toward research on knowledge production.
1 Ganguly, S., & Pudi, V. (2017). Paper2vec: Combining Graph and Text Information for Scientific Paper Representation. In J. M. Jose, C. Hauff, I. S. Altıngovde, D. Song, D. Albakour, S. Watt, & J. Tait (Eds.), Advances in Information Retrieval (pp. 383–395). Springer International Publishing. https://doi.org/10.1007/978-3-319-56608-5_30
2 Satish, S., Yao, Z., Drozdov, A., & Veytsman, B. (2020). The impact of preprint servers in the formation of novel ideas. BioRxiv. https://doi.org/10.1101/2020.10.08.330696
3 Vieira, E. S., & Gomes, J. A. N. F. (2009). A comparison of Scopus and Web of Science for a typical university. Scientometrics, 81(2), 587–600. https://doi.org/10.1007/s11192-009-2178-0
4 Gui, Q., Liu, C., & Du, D. (2019). Globalization of science and international scientific collaboration: A network perspective. Geoforum, 105, 1–12. https://doi.org/10.1016/j.geoforum.2019.06.017
5 Li, J., Yin, Y., Fortunato, S., & Wang, D. (2020). Scientific elite revisited: Patterns of productivity, collaboration, authorship and impact. Journal of The Royal Society Interface, 17(165). https://doi.org/10.1098/rsif.2020.0135
6 Newman, M. E. J. (2001). The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences, 98(2), 404–409. https://doi.org/10.1073/pnas.98.2.404
7 Stek, P. E., & van Geenhuizen, M. S. (2016). The influence of international research interaction on national innovation performance: A bibliometric approach. Technological Forecasting and Social Change, 110, 61–70. https://doi.org/10.1016/j.techfore.2015.09.017
8 Gläser, J., & Laudel, G. (2015). A bibliometric reconstruction of research trails for qualitative investigations of scientific innovations. Historical Social Research, 40(3), 299–330. https://doi.org/10.12759/hsr.40.2015.3.299-330
9 Li, Y., Li, H., Liu, N., & Liu, X. (2018). Important institutions of interinstitutional scientific collaboration networks in materials science. Scientometrics, 117(1), 85–103. https://doi.org/10.1007/s11192-018-2837-0
10 Ranaei, S., Suominen, A., Porter, A., & Carley, S. (2019). Evaluating technological emergence using text analytics: Two case technologies and three approaches. Scientometrics, 122, 215-247. https://doi.org/10.1007/s11192-019-03275-w
11 King, M. M., Bergstrom, C. T., Correll, S. J., Jacquet, J., & West, J. D. (2017). Men Set Their Own Cites High: Gender and Self-citation across Fields and over Time. Socius, 3. https://doi.org/10.1177/2378023117738903
12 Nosek, B. A., Graham, J., Lindner, N. M., Kesebir, S., Hawkins, C. B., Hahn, C., Schmidt, K., Motyl, M., Joy-Gaba, J., Frazier, R., & Tenney, E. R. (2010). Cumulative and Career-Stage Citation Impact of Social-Personality Psychology Programs and Their Members. Personality and Social Psychology Bulletin, 36(10), 1283–1300. https://doi.org/10.1177/0146167210378111
13 Parker, J. N., Allesina, S., & Lortie, C. J. (2013). Characterizing a scientific elite (B): Publication and citation patterns of the most highly cited scientists in environmental science and ecology. Scientometrics, 94(2), 469–480. https://doi.org/10.1007/s11192-012-0859-6
14 Jaffe, A. B., Trajtenberg, M., & Henderson, R. (1993). Geographic Localization of Knowledge Spillovers as Evidenced by Patent Citations. The Quarterly Journal of Economics, 108(3), 577–598. JSTOR. https://doi.org/10.2307/2118401
15 Verbeek, A., Debackere, K., & Luwel, M. (2004). Science cited in patents: A geographic “flow” analysis of bibliographic citation patterns in patents. Scientometrics, 58(2), 241–263. https://doi.org/10.1023/a:1026232526034
16 Apa, R., De Noni, I., Orsi, L., & Sedita, S. R. (2018). Knowledge space oddity: How to increase the intensity and relevance of the technological progress of European regions. Research Policy, 47(9), 1700–1712. https://doi.org/10.1016/j.respol.2018.06.002
17 Balland, P.-A., Boschma, R., Crespo, J., & Rigby, D. L. (2019). Smart specialization policy in the European Union: Relatedness, knowledge complexity and regional diversification. Regional Studies, 53(9), 1252–1268. https://doi.org/10.1080/00343404.2018.1437900
18 Boschma, R., Balland, P.-A., & Kogler, D. F. (2014). Relatedness and technological change in cities: The rise and fall of technological knowledge in US metropolitan areas from 1981 to 2010. Industrial and Corporate Change, 24(1), 223–250. https://doi.org/10.1093/icc/dtu012
19 Castaldi, C., & Los, B. (2017). Geographical patterns in US inventive activity 1977–1998: The “regional inversion” was underestimated. Research Policy, 46(7), 1187–1197. https://doi.org/10.1016/j.respol.2017.04.005
20 Perruchas, F., Consoli, D., & Barbieri, N. (2020). Specialisation, diversification and the ladder of green technology development. Research Policy, 49(3). https://doi.org/10.1016/j.respol.2020.103922
21 Surana, K., Doblinger, C., Anadon, L. D., & Hultman, N. (2020). Effects of technology complexity on the emergence and evolution of wind industry manufacturing locations along global value chains. Nature Energy, 1–11. https://doi.org/10.1038/s41560-020-00685-6
22 Wouden, F. van der, & Rigby, D. L. (2020). Inventor mobility and productivity: A long-run perspective. Industry and Innovation. https://doi.org/10.1080/13662716.2020.1789451
23 Breschi, S. (2001). Knowledge Spillovers and Local Innovation Systems: A Critical Survey. Industrial and Corporate Change, 10(4), 975–1005. https://doi.org/10.1093/icc/10.4.975
24 Anselin, L., Varga, A., & Acs, Z. (2000). Geographical Spillovers and University Research: A Spatial EconometricPerspective. Growth and Change, 31(4), 501–515. https://doi.org/10.1111/0017-4815.00142
25 Audretsch, D. B., & Feldman, M. P. (1996). R&D Spillovers and the Geography of Innovation and Production. The American Economic Review, 86(3), 630–640. JSTOR.
26 Boschma, et al. (2014).
27 Kardes, H., Sevincer, A., Gunes, M. H., & Yuksel, M. (2014). Complex Network Analysis of Research Funding: A Case Study of NSF Grants. In F. Can, T. Özyer, & F. Polat (Eds.), State of the Art Applications of Social Network Analysis (pp. 163–187). Springer International Publishing. https://doi.org/10.1007/978-3-319-05912-9_8
28 Talley, E. M., Newman, D., Mimno, D., Herr, B. W., Wallach, H. M., Burns, G. A. P. C., Leenders, A. G. M., & McCallum, A. (2011). Database of NIH grants using machine-learned categories and graphical clustering. Nature Methods, 8(6), 443–444. https://doi.org/10.1038/nmeth.1619
29 Zhang, Y., Zhang, G., Chen, H., Porter, A. L., Zhu, D., & Lu, J. (2016). Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research. Technological Forecasting and Social Change, 105, 179–191. https://doi.org/10.1016/j.techfore.2016.01.015
30 Chausse Vázquez de Parga, I. (2018). A geographical analysis of research trends applying text mining to conference data [Escola Tècnica Superior d’Enginyeria Industrial de Barcelona]. https://upcommons.upc.edu/handle/2117/169067