Tapping into Big Data

By Karen Nitkin

The secrets of hundreds of millions of galaxies and stars are stored in a humming, whirring computer-filled room on the first floor of the School of Arts and Sciences’ Bloomberg Center for Physics and Astronomy. And they have lots of company, such as the genetic coding of loblolly pine trees (six times longer than human genetic sequences), sensor-collected soil data, and multi-terabyte data sets used to chart air turbulence in three dimensions.

“What we have here is probably hundreds of times the amount of information in the Library of Congress,” says Alex Szalay, director of the Institute for Data Intensive Engineering and Science (IDIES), standing amid 16 racks of neatly stacked processors and disks holding a combined 10 petabytes of storage.

In recent months, IDIES has refashioned itself as a true university-wide initiative that collects enormous data sets from many sources and makes them available to researchers around the world. In addition to the Krieger School, other university collaborators include the Bloomberg School of Public Health, the School of Medicine, the Sheridan Libraries, and the Whiting School of Engineering.

Ambitious research projects are quickly filling up the space in that computer room. That’s why the university is preparing for the next stage: a High Performance Research Computing Facility slated to open in September, which will be located at Johns Hopkins Bayview Medical Center in East Baltimore. The joint project with the University of Maryland at College Park, funded with $27 million from the state and $3 million from Hopkins, will have a storage capacity of 20 petabytes and room to scale up to eight megawatts of power, from a start of two megawatts.

The Big Data effort at JHU started about 10 years ago, with the Sloan Digital Sky Survey (SDSS), which was pioneering new ways to collect and study enormous amounts of information, says Szalay, the Alumni Centennial Professor in the Department of Physics and Astronomy and an SDSS leader.

Through IDIES, researchers will be able to piggyback on previous efforts to collect and analyze vast data sets that combine information in entirely new ways.

For example, Steve Salzberg, director of the School of Medicine’s Center for Computational Biology, is deciphering the genome of the loblolly pine tree, which has about 22 billion base pairs. (Since this fast-growing tree is a big cash crop across the southeastern United States, the Department of Agriculture is sponsoring the research.)

Yet even as storage capacity increases and programs calculate information at ever-quickening speeds, the demand for more and faster seems infinite.

Even astronomy, already unlocking secrets about the origins of the universe, is about to go turbo, when the Large Synoptic Survey Telescope begins operations in Chile around 2020, collecting the equivalent of the entire SDSS, “what took us five to 10 years to collect,” in three or four nights, says Aniruddha R. Thakar, principal research scientist with the Department of Physics and Astronomy, who is responsible for day-to-day operations of IDIES.

“We won’t have enough technology to handle it all,” he says. “That’s why we have to keep being innovative.”