IRIS Photos

Getting Up to Speed with Cyberinfrastructure

Gail Steinhart

In the past few months, I've had the good fortune to get started in my new position as the Research Data and Environmental Sciences Librarian by attending two events that have taught me a great deal about cyberinfrastructure (CI) and science informatics. The first was the Cyberinfrastructure Summer Institute for Geoscientists (CSIG), held in August at the San Diego Super Computer Center. The second was the 5th International Conference on Ecological Informatics, held in early December in Santa Barbara.

Fran Berman, of the San Diego Super Computer Center, describes CI as "the distributed computer, information, and communication technologies combined with the personnel and integrating components that provide a long-term platform to empower the modern scientific research endeavor." Various facets of this topic as they apply to the geosciences were the focus of the CI summer institute. The CSIG was sponsored by the Geosciences Network (GEON), a collaborative effort of more than a dozen institutions to develop and support cyberinfrastructure for research in the geosciences. The stated goal of the CSIGs (this is the third year one has been offered) is to "provide geoscientists an 'IT headstart,' and to expand the community of IT users in earth science research." This was accomplished by a mix of lectures, demonstrations, and hands-on exercises on topics ranging from geographic information systems to knowledge representation and scientific work flows. Instructors came from several institutions, and the thirty-six attendees included scientists, IT professionals, and data managers from academic and research institutions all over the country, as well as from a few state and federal government agencies. Here are a few of my personal highlights from the workshop:

  • Chaitan Baru (San Diego Super Computer Center) and Randy Keller (University of Oklahoma) set the stage by providing introductory information on cyberinfrastructure challenges, including a very brief survey of some domain-specific initiatives aimed at addressing needs in high-performance computing and data integration, as well as the common needs across all disciplines.
  • Ashraf Memon (San Diego Super Computer Center) and Ramon Arrowsmith (Arizona State University) provided an introduction to Web services, which included some hands-on exercises in which we created and deployed some very simple Web services and a demonstration of the GEON LiDAR work flow. The GEON LiDAR work flow allows users to extract high-resolution elevation data for a geographic area of interest and download it in a variety of formats. This is especially useful, as raw LiDAR data sets can be very large and are not useable in their native format by most common desktop applications, such as GIS software.
  • Chris Condit (San Diego Super Computer Center) gave an overview of various Internet mapping applications and technologies with an emphasis on open source software and standards.
  • Krishna Sinha (Virginia Tech) gave a truly inspired presentation on knowledge representation. Sinha introduced participants to the challenges and benefits of data integration and made an enthusiastic pitch for participation by domain scientists in developing these approaches.
  • Bertrand Ludaescher (UC Davis) and Efrat Jaeger (San Diego Super Computer Center) wrapped up the instruction portion of the institute with sessions on scientific work flows in general (Ludaescher) and Kepler in particular (Jaeger). What’s exciting about scientific work flows is that they are the "glue" that brings data and applications together in a single environment, making it possible to reproduce, track, and document a data-analysis process. It's possible that some day a graduate student will submit a complete scientific work flow along with his or her dissertation. Ludaescher, in his introduction to scientific work flows, got my vote for best sense of humor by describing CI components very generally as "underware, middleware, and upperware," and for defining "shimeology" as the science of making things fit together that don’t. Jaeger’s presentation was an introduction to Kepler, an open source scientific work flow that is based on the Ptolemy system that was developed at UC Berkeley. My impression of Kepler is that it has a way to go before it will become a tool of choice for scientists but is very promising. Here at Cornell, researchers have asked me how they can document models they develop, and clearly work flows like Kepler will offer one way to meet that need.

conference photo
Attendees of the CSIG in San Diego. Gail is fifth from the right in the front row.

My second trip was to attend the 5th International Conference on Ecological Informatics, attended by a very international crowd of about 125 computer scientists and ecologists. There were two fairly distinct tracks at this conference. One group of attendees was clearly more interested in novel computational approaches to ecological research, while the other was more focused on issues of knowledge representation and data management and distribution. The conference led off with a keynote talk by Matt Jones, of the National Center for Ecological Analysis and Synthesis (NCEAS). Jones brought us up to speed on developments in the ecoinformatics community with respect to synthetic data-analysis projects and categorized the needs in this area as having to do with access to data, the need to integrate heterogeneous data, and the need to analyze and create models from data. The activities he described come largely out of the efforts of collaborators involved with the Knowledge Network for Biocomplexity (KNB) and its successor, Science Environments for Ecologial Knowledge (SEEK). These initiatives include collaborators from NCEAS, the Partnership for Biodiversity Informatics, San Diego Supercomputer Center, the Long Term Ecological Research Network (LTER), and several other institutions. Approaches to the issues of access, integration, and analysis include data distribution efforts such as the KNB, research in semantic mediation to facilitate scientific work flows (specifically, Kepler), the development of an observation ontology for observational ecological data, and work in resolving ambiguity in taxonomic concepts and schemes to support sharing of species data.

Other talks and posters from the conference that I found particularly interesting:

  • Several talks included some discussion of the challenges of tracking species name changes and changes in taxonomic concepts consistently. Jessie Kennedy (Napier University) described the work of the SEEK taxon working group, which focuses on this problem. Jacob Asiedu (University of Massachusetts Boston) described some of his work on semi-automatic schema mapping between taxonomic databases. The importance of this issue also came up in talks by Stephen Hale, on the species database of the U.S. National Coastal Assessment, and Bob Peet, on VegBank, the Ecological Society of America's vegetation plot database (vegetation plots are a fundamental research unit for plant community ecologists), as well as by other speakers, driving home the need for a taxonomic name and concept authority.
  • Deanna Pennington (University of New Mexico) described one of the use cases that Kepler developers use to inform their development efforts. The case is to create an ecological niche model for every mammal species in the western hemisphere. As input, the models take species distribution data (largely from museum collections) and climate data and produce a predicted geographic distribution for each species. With data on approximately 1,000 species available, using a variety of existing models, this becomes a computationally intense problem and also requires the integration of large amounts of data from diverse sources. The completed work flow will be available for scientists to modify and re-use as needed. Pennington also presented a poster on training to improve the rate of adoption by scientists of new cyberinfrastructure technologies such as Kepler.
  • Chris Jones (UC Santa Barbara) described a metadata-driven approach for generating field data entry interfaces for ecologists that holds some promise for recording metadata at the time field data are collected.

Overall, I feel fortunate to have been able to attend both of these events. If a common theme emerged from the two, it would be the fundamental necessity for domain scientists and information professionals to work together to make progress in the realm of developing tools and best practices with respect to the management of environmental and ecological data. There are a lot of exciting developments under way, and more to come!