Big Data

Science is drowning in bits and bytes of information; Krieger School researchers are casting a lifeline.

By Mike Field

Illustrations by Hank Osuna

Silently, steadily, the three-foot-long cylindrical sensor floats in the cold ocean current, 3,000 feet below the surface. Every 10 days it wakes up. A tiny on-board electric motor pumps oil stored inside the cylinder out into an external bladder and—now positively buoyant—the sensor begins to rise. The device breaks the surface top side up, its two-foot antenna extending high enough above the waves to beam a stream of data to an orbiting satellite far overhead: a precise location fix, along with information gathered during the ascent describing water temperature, salinity, and pressure. Once the data are delivered, the pump reverses direction, pulling the oil back within the cylinder, which descends and floats along for another 10 days before waking up again. Right at this moment there are more than 3,500 of these sensors dispersed through the global oceans, delivering by way of satellite more than 100,000 data sets a year.

Thousands of miles away, in a leafy suburban neighborhood north of Baltimore, an array of more than 50 devices connects to a couple of football fields of sensors and buried probes. The devices—called motes—are about the size of a dollar bill and a half inch thick. Powered by two AA batteries, they measure soil temperature and moisture, ambient temperature, light, and the level of carbon dioxide held within the soil, taking new readings every 10 minutes, 24 hours a day, seven days a week. The motes are connected by radio signals to a central receiving node that transmits all the data to a remote computer on the Internet over ordinary phone wires.

In an era when you can ask your phone to get you to the nearest pizza parlor and tell your car to parallel park itself, whole battalions of remote digital sensors hardly seem like news. Until you take all the millions of data points they are collecting and start finding ways of connecting them. Assembled randomly they are nothing more than the visual equivalent of white noise. But ask the right questions—and use the right kind of computational approaches and equipment that are being pioneered at Johns Hopkins—and that field of static turns into a picture like a Seurat painting. None of the millions of data points carries meaning by itself, but read together, it does. This is the new approach to research that is beginning to permeate science, across every discipline. On the Homewood campus, Assistant Research Scientist Inga Koszalka and Associate Research Professor Katalin Szlavecz are bringing new and richer understanding to their disciplines primarily through the application and manipulation of data—lots and lots of data.

How Big is Big?

“A billion here, a billion there, and pretty soon you’re talking about real money,” said Illinois Republican Senator Everett Dirksen back in 1962, when the U.S. federal debt limit was raised to the seemingly staggering level of $300 billion. Today, the number is more than $15 trillion, and Dirksen’s wry observation has special resonance: For most people, numbers with that many zeros after them don’t seem real somehow. This is especially true in the realm of big data, where most of us find it hard to conceptualize the difference between a nest of petabytes and a whole flock of terabytes.

Read More

Koszalka is investigating the deep water circulation patterns of the Irminger Sea, a part of the Atlantic Ocean that lies to the east of Greenland. She is learning how—far beneath the waves—seawater travels not in a continuous current but in packets of denser water measuring 30 to 40 kilometers across, moving through in two-to-three-day cycles. The final destination of these waters is the North Atlantic, where they contribute to the large-scale ocean circulation driven by differences in seawater density. By creating simulations of thousands of numerical “floats” that sample fields of a numerical ocean model much as the oceanographic instruments in the real ocean do, Koszalka gains insight and makes observations without standing on the deck of a ship or even getting her feet wet.

One floor away in the Olin building, Szlavecz is studying how soil respiration—a naturally occurring phenomenon that puts carbon in the atmosphere at more than 10 times the rate that comes from burning fossil fuels each year—is affected by routine variations in temperature and rainfall. Every day she gathers her data from the soil without ever getting on her knees or dirtying her hands.

Although engaged in the study of radically different environments located many thousands of miles apart, Koszalka and Szlavecz are power-computHoing enormous quantities of digital data to gain insight into how large natural systems work. Their research takes place within the Department of Earth and Planetary Sciences; but the same kind of large-data-set science is exploding in fields ranging from astronomy to genetics, protein folding to turbulence studies, neurobiology to hydrodynamics. Driven by ubiquitous Internet access, inexpensive remote- sensing technologies, ever more powerful computers, and the continuously falling price of data storage, a new realm of science is opening that promises to revolutionize how we understand the physical world. It is the science of Big Data, and Krieger School researchers are at the very forefront of this effort.

The New Calculus

Most people tend to think of science linearly: as an accelerating series of insights and discoveries that builds continuously upon itself, like a graph line moving upward across time, rising ever more steeply as it goes. Apostles of big data science say that’s not it at all; the history of science is better understood as a series of epochs defined by the tools available to study and understand the natural world. First, and for thousands of years, science was empirical and descriptive, carefully recording what could be seen by the naked eye. Advances in optics and the discovery of lenses introduced a whole new set of tools, and with them a revelatory understanding of the scale of the universe, the Earth’s place in the solar system, and the other, microscopic world invisible to the unaided eye. Then came Kepler, who used observed data to derive analytical expressions about the motion of planets. Kepler’s laws announced the era of analytic science, of Newton and Lavoisier and Maxwell, culminating in Einstein’s theory of general relativity.

At the midpoint of the 20th century, scientists at Los Alamos confronted a new challenge: Although the equations governing nuclear explosions were relatively simple to write down, they were immensely difficult and time consuming to solve. This led to the invention and use of first mechanical and then electronic computers, and the dawning of the age of computational science, which advances understanding through simulations made by solving equations in fields ranging from biology to physics to hydrodynamics.

The arrival of big data science in the last two decades constitutes another scientific revolution. It is perhaps best epitomized by the Human Genome Project, which originally conceived a wet lab approach to sequencing the genome that was expected to take 15 years to complete. But then along came the technique of “shotgun sequencing,” in which the strand of DNA is broken into millions of random small pieces that are sequenced and then reassembled by computers churning through huge volumes of data. This radically different approach allowed then President Bill Clinton to announce the completion of the first “rough draft” of the human genome fully two years ahead of schedule. It represented a landmark success for big data science. For many, it was a pointed sign of things to come.

“Just simply having access to large data sets and more information doesn’t lead to improved knowledge. It needs to be digested in ways that are not very obvious. We need new creative ways to understand and analyze data sets. It’s a very important challenge.”

—Thomas Haine
Professor, Department of Earth and Planetary Sciences

“This is going to completely change the way we think about the nature of knowledge,” says Jonathan Bagger, Krieger-Eisenhower Professor of Physics and Astronomy and vice provost for graduate and postdoctoral programs and special projects. “It’s not by accident that the Web browser was invented at CERN [the European Center for Nuclear Research], which is the big data physics project of our generation. People who are asking the big questions in science understand that there has been a revolution in the tools. It’s just like calculus was invented to enable us to do physics; this is the new calculus for the next 500 years.”

A Lifeline for the Drowning

But if calculus was a system derived to support a science, big data may be better understood as a science—both in computer hardware advances and in the increasing sophistication of the operations they perform—that is uniquely suited for advancing understanding of entire systems. Thomas Haine, professor of physical oceanography, talks about the “grand challenge” of modeling the oceans, such as the Irminger Sea modeling being carried out by Koszalka, who is a postdoc in his group. Understanding a vast and complex system like an ocean is not only intrinsically interesting, he says, but will also provide crucial insights into the process of global climate change. In recent years, the numbers and richness of oceanic measurements and observations have increased dramatically, thanks to the global system of ocean sensors and a complementary program of space observation from satellites using radar altimeters (for sea surface height, which can vary by a meter or more), radiometers (to measure sea-surface temperature), and other instruments. “It raises new challenges in some ways because we’re sort of swamped with data volume,” says Haine, voicing a refrain common among scientists trying to learn how to work successfully in the realm of big data. “Just simply having access to large data sets and more information doesn’t lead to improved knowledge. It needs to be digested in ways that are not very obvious. We need new creative ways to understand and analyze data sets. It’s a very important challenge.”

Calculating the Right Path

Matthew Witten A&S ‘95
Director of CyberKnife Radiosurgery and Chief Physicist in Radiation Oncology, Winthrop University Hospital,Mineola, N.Y.

What if there is not just one solution to a problem but instead a range of solutions, and one of them is optimal?

Read More

Data today is an embarrassment of riches—there’s so much, it’s so detailed, and offers such potential—that it seems to be turning science on its head. “It used to be that we had too little data. In field work, someone in hip boots would measure this and measure that, and it would all fit in a notebook,” says Jonathan Bagger, whose role in the Provost’s Office helps support and coordinate big-data initiatives. “Now we have cheap and ubiquitous sensors providing an unending stream of data. Rather than having too little data, now we have too much.” Bagger notes that in science, as in the business world, increasingly you hear people fretting about the data glut—so much information streaming through the Internet and residing in memory that things bog down. “It clogs up the pipes,” is how he puts it. Help is on the way in the form of new approaches to handling data and a new computational infrastructure that came about almost incidentally, in an effort to create a map of the stars.

In 1992 Johns Hopkins became one of a number of universities jointly cooperating in assembling a photographic record of the night sky using a dedicated 2.5-m wide-angle optical telescope at the Apache Point Observatory in New Mexico. Named the Sloan Digital Sky Survey in recognition of the lead funding agency, the Alfred P. Sloan Foundation, over the next eight years it obtained images covering more than a quarter of the sky and created three-dimensional maps containing more than 930,000 galaxies and more than 120,000 quasars. “Every participating institution had to pick a piece of the system that they would be responsible for, and Hopkins chose building some of the spectrographic instruments as well as managing the database,” recalls Professor Alex Szalay, of the Department of Physics and Astronomy, who served as a lead researcher on the project.

Since Szalay’s research interests involved using statistics to analyze cosmological data, managing the database seemed like a natural fit. “If we can’t properly aggregate the data, then I can’t do my work,” he says. Early on he recognized that it wasn’t enough just to assemble all the digital images in the sky survey; the real challenge was to find a way to make that huge quantity of data accessible, manageable, searchable—in a word, useful. As the project evolved, Szalay enlisted the help of Microsoft’s Jim Gray, widely recognized as one of the world’s foremost database experts. Szalay says the problems they faced in managing the expected 40 terabytes of image data (See “How Big Is Big?”) intrigued Gray, who saw it as the prototype of how science was to be conducted in the coming years.

“This Data-Scope instrument will be the best in the academic world, bar none.”

—Alex Szalay
Professor, Department of Physics and Astronomy

“We promised we would gather all this data with the derived lists and catalogs of galaxies and stars and make it available to the public,” says Szalay. “At that point, public databases were typically only hundreds or thousands of objects when we were talking of hundreds of millions. It was thousands to a million times larger data than anyone had really tackled.” Two decades later, external hard drives holding terabytes of data have become commonplace. But in 1992, storing, moving, and manipulating that much data presented a serious challenge. As it happened though, Intel co-founder Gordon Moore’s famous formulation—that the number of transistors that can be placed inexpensively on an integrated circuit doubles every two years—held true. Computers grew increasingly powerful and memory increasingly cheap and plentiful as the Sloan project unfolded with the beginning of actual data collection in the year 2000. During this time, Szalay, Gray, and their colleagues gradually became convinced they were working on a frontier of new understanding, not just of the universe through the images they were cataloging and storing but of a whole new way of doing science. And they saw the need for new kinds of tools to make the work possible.

This May, Szalay and four other Johns Hopkins co-investigators plan to power up their unique contribution to advancing the science of big data in a large room in Homewood’s Bloomberg Physics and Astronomy building that used to house mission control operations for the FUSE satellite. About 12 racks of off-the-shelf, high-end computer components have been assembled in a new and unique way to create what they believe will be the epoch-changing device of data research. With the ability to handle more than five petabytes of information, read at speeds of 500 gigabytes per second, and draw information from approximately 5,000 disk drives operating in parallel, they expect their creation to outpace the legendary Jaguar supercomputer at the Department of Energy’s Oak Ridge National Laboratory in accessing these petabytes, operating faster by a factor of two. “It will search for patterns and relationships by looking at huge quantities of data from afar, like a telescope,” says Szalay. “But it will also work as a microscope of data, to be able to see not only the big picture but also the tiny details.”

This system, supported by a grant from the National Science Foundation, is appropriately called the Data-Scope, and they confidently expect it will open a new era of computational research. The researchers have identified nearly two dozen research groups within Johns Hopkins alone that currently are struggling with data problems totaling to three petabytes or more. Without Data-Scope, says Szalay, “they would have to wait years to analyze that amount of data.”

The Internet of Things

Homaira Akbari

President and CEO, SkyBitz
Member, Department of Physics and Astronomy Advisory Council at Johns Hopkins

Former Krieger School Postdoctoral Fellow at the European Center for Nuclear Research (CERN)

About two decades ago, the U.S. Defense Department faced an interesting dilemma:

Read More

One of the first projects slated for investigation on Data-Scope is Inga Koszalka’s mathematical modeling of the North Atlantic Ocean currents. “Data-Scope will enable us to move all our data to one place and do all the calculations directly on the machine,” she says, noting that Alex Szalay’s insight is that the future of big data lies in moving the analysis to the data, a paradigm shift in which computer-derived large-scale data analysis will begin to drive scientific discovery. “Data Scope will enable us not only to move all our data to one place but also to generate new data through simulations and to analyze them directly on the same machine,” she says. It is a shift that comes perhaps not a moment too soon.

Katalin Szlavecz’s project monitoring soil and other environmental factors in the Cub Hill neighborhood north of Baltimore is, in one sense, a relatively modest research effort, investigating soil respiration in one kind of ecosystem and one particular microclimate. “I’m just one person interested in one aspect of this problem, but anyone who does environmental monitoring ends up creating a lot—a lot—of data,” she says. “Last year we recorded 160 million data points. We’re drowning, and it’s a major headache.”

But cracking the secrets of successfully managing and manipulating big data presents enormous opportunities as well; consider for a moment the names of the companies that have sprung up in the last decade or so that do it successfully: Google, YouTube, Facebook. Big data is in fact a cutting-edge research area in which the United States is an undisputed leader, says Szalay. And in the academic arena, he and his colleagues are racing to position Johns Hopkins at the forefront. “This Data-Scope instrument will be the best in the academic world, bar none,” he has said. “There is really nothing like this at any university right now.”

  • Raymond Yole

    I am interested in the comment referring to Katalin Szlavecz’s work on soil respiration. The item states that the process “puts carbon in the atmosphere at more than 10 times the rate that comes from burning fossil fuels each year”. I would like to have reference to the primary source for this statement, as I am unsure whether this refers to the total annual contribution or the unit production rate per area or volume. The web site article was very interesting. Thank you.

    • Ian Mathias

      We took your question to Katalin, and here’s what she said:

      “The statement refers to annual global carbon flux (“total annual contribution”, as the you wrote). There are many references for this, I am pasting one website, but searching for “global Carbon Cycle” will bring up many references. Fossil fuel burning releases about 5-6 Gt C per year, and the arrow from the soils to the atmosphere is about 60 Gt per year.. this is soil respiration, sometimes labeled as decay or decomposition.
      http://www.global-greenhouse-warming.com/global-carbon-cycle.html

      “This might be misleading unless you know that the 60 Gt, and the other 60 Gt C that comes from plant respiration is globally balanced out by plant uptake of CO2. Therefore, if there is an equilibrium, everything is OK. However, as we see a tenth of this amount annually released to the atmosphere, that is not taken up, causes tremendous problems. The issue might be that if there is major disturbance (land clearing fires, invasive species, warming temperatures) even a small change in soil respiration can result in excess amount of C release. This is one reason we want to understand how ecosystems respond to such disturbances.”