For the past week, the flooding in the Upper Midwest has been all over the news, as rivers have reached record levels and thousands of people have been evacuated across several states. A couple of ScienceBloggers have been personally affected, and we hope that they, their families, and their labs continue to be safe and dry.

Floods are a personal fascination for me, as I can trace my interest in hydrology directly to the 1993 Mississippi River floods that affected my hometown in Minnesota. However, flood recurrence intervals are also one of my professional pet peeves. I make sure that students in my classes never walk away with the misconception that a 500-year flood can only happen once every 500 years. If you finish reading this post, you’ll be disabused of the notion as well.

The most important point is that a “X-year flood” is a poorly-chosen way of expressing the probability of a flood of a given magnitude happening in a given year. A 500 year flood, has a probability of 1/500, or 0.2% of happening in any given year. Just like when you flip a coin the probability of getting heads is always 50% on the next flip, even if you happen to get heads three times in a row. In the same way, if a river has a 500-year flood in 2008, there is the same probability of having such a big event in 2009. That’s bad news for those flood victims with a poor understanding of probability. Fortunately, a quick scan of this round of media coverage has revealed very few reporters getting it wrong (and some news outlets even taking time to get it right).

Flood probabilities are based on historical records of stream discharge. Let’s use the Iowa River at Marengo, Iowa as an example. It reached a record discharge of 46,600 cubic feet per second* (1320 m^{3}/s) on 12 June. That flow was estimated to have a 500 year recurrence interval, based on 51 years of peak flow records. Here’s a graph of the peak flow record for the site:

Typically flood probabilities are based on a time series of the highest instantaneous discharge measured during a given water year (1 October to 30 September). These data are then fitted to a statistical distribution, often this one. These distributions then allow the estimation of the probability (or recurrence interval) of a flood of a given magnitude. Taking the peak flow time series from the USGS website and using the distribution above, I also get a ~500 year recurrence interval (0.2% probability) for the flood of 2008. But there’s a big problem here…I’m estimating a 500-year flood based on only 51-years of record. So I’m going beyond my data by a factor of 10!

When you are extrapolating beyond your data by an order of magnitude, the highest points in the dataset start to have a lot of leverage. Let’s imagine that there’s another big flood on the Iowa River next year and we do the same analysis. Now our dataset has 52 points, with the highest being the flood of 2008. When that point is included in the analysis, a discharge of 46,600 cubic feet per second* (1320 m^{3}/s) has a recurrence interval of <150 years (>0.6%). It’s still a darn big flow, but it doesn’t sound quite so biblical anymore.

OK, so our predictions of the probability (recurrence interval) of big floods can be really wrong just because of the limited nature of historical data (the situation is better in some other parts of the world). But there are a number of other possible confounding factors. First, erosion or deposition of the channel and surrounding regions over time can change the height of the flood relative to the volume of the flood. And flood height is what those people manning the sandbags really care about. Second, changes in the watershed can affect how much and how quickly precipitation makes it to the river. Urbanization and the adding of impervious surface is one cause of increasing flood peaks, but in Iowa, a more likely culprit is agricultural. Between the 1780s and 1980s, more than 95% of Iowa’s wetlands had been drained. (Most of this drainage occurred prior to the 1930s in Iowa, so it is unlikely to affect the example above.) Conversely, flood control dams (like Coralville Dam on the Iowa River) can suppress flood peaks downstream. Another potential culprit is climate change, though it is nearly impossible to attribute the occurrence or magnitude of any one event to changing climate.

All right, with all of that under your belt, the next time you hear someone say something like “They said 1993 was a 500-year flood. How can we be having another one only 15 years later?” you can patiently explain to him that recurrence intervals are only shorthand for probabilities. The hydrology professors of the world will thank you.

*Still the standard units for reporting discharge in the U.S.

The probability of a 500-year event occurring exactly once in a year is more like 0.001998 (not quite 1/500) if you assume it is a Poisson process. For two occurrences in 15 years the probability is 0.0004358.

We use the Poisson model for earthquakes; I don’t know if it applies to floods.

How do you estimate the confidence of the fitting when the extrapolation is an order of magnitude greater than the data?

“When you are extrapolating beyond your data by an order of magnitude…” that pretty much sums it up for me, and is a major reason I choose not to pollute my brain with TeeVee weathermen.

Interesting analysis. I’m curious: if you recalculate the probability based on adding in the 2008 data, is the 1993 flood still a >500 year event?

But the probability of 2 “500-year” floods in 2 consecutive years is surely (approximately, thanks Erik) 0.002 x 0.002 = 0.000004, is it not? Just like the probability of getting 2 heads in a row is 0.5 x 0.5 = 0.25.

Mark P – there is actually some nifty statistical theory called extreme value theory which gives a general result for the distribution of extremes. It’s surprising, but there is a single distribution that works for many cases – I have a student who used it to look at annual records of caught salmon.

One problem with floods is that they depend on rainfall, which may not be totally random, so more rain over the past 20 years would mean more floods than predicted. That could be incorporated into the estimation too.

The problem with computing the probability of a particular flood occurring is that the probability changes depending on its own occurrence. Since the flood record is much shorter than 500 years, if two “500-year” floods occur in consecutive years, the flood probability gets recalculated, and a 500-year flood becomes a flood with a shorter interval.

Divalent: The recurrence interval of the 1993 event also decreases. In the example above, since the 1993 event is smaller than the 2008 event, its recurrence interval is probably now about 100 years. (I’m not looking at my spreadsheet anymore).

Andrew: Yes. But since each flood is an independent event, that sort of probability math leads to public confusion and would also make for really bad policy if used to design levees, etc.

Andrew – No, because you have more information. The probability of X given Y is not the same as the probability of X and Y. Knowing you had a flood the previous year does nothing to the chances of it occurring the next year just as knowing you got tails does nothing to the chances for the next flip.

The calculated probability is changing from 1:500 to 1:150 not because the actual chance of the event is changing but because there is more data available for a more accurate estimation of the probability.

“The most important point is that a “X-year flood” is a poorly-chosen way of expressing the probability of a flood of a given magnitude happening in a given year.”

I wholeheartedly agree. I always cringe when I hear it … but it seems to be fully established in popular vernacular.

Andrew:

For a Poisson event:

P[N=n] = (l*t)^n * exp(-l*t) / n!

n = number of events (2)

t = time period of interest (15 yrs)

l = mean occurrence frequency (1/500 yrs)

My unaided (by statistics) eyeball sees a lot of serial dependence in that time series. I’d be real reluctant to trust a straight probability estimate that ignores that serial dependence. I trade derivatives for a hedge fund, and not a few meltdowns have occurred because their risk modelers were using distributions that throw time away.

I’d suspect some autocorrelation in the data, because high rainfll one year would lead to residual groundwater, increasing t risk fo flooding the next year. Tamino, at Open Mind, has a nice discussion on the autocorrelation effect, and points out one early area of application was in annual hydrology records.

http://tamino.wordpress.com/2008/06/10/hurst/#more-797

Thanks, Erik for the reminder. It’s been a long time since I took statistics.

My old text has the formula

f(k;λ) = λ^k exp(-λ)/k!

with

k = actual # of occurrences and

λ = expected # of occurrences during the time interval of interest.

Assume we are talking about a so-called “500-year” flood.

Let’s calculate the probability of having 2 such floods in consecutive years.

Then

k = 2

λ = 0.002*2 = 0.004

Plugging into the formula we get f = 0.004^2 * exp(-0.004)/2!

= 1/2 * 0.000016 exp (-0.004)

= 0.000008 * ~1

= 8 * 10^(-6)

What about the probability of 2 “500-year” floods in 15 years (as seems to have happened in the U.S. Midwest)?

Now

k = 2

λ = 0.002*15 = 0.03

So f = 0.009 * exp (-0.03)/2 = 0.0045 * 0.97 = 0.0044 = 4.4 * 10^(-3).

The question clearly now is, as has been pointed out, that under these circumstances is our ‘estimated’ value of λ tenable? Can someone show how to calculate the probability that λ (or equivalently, the postulated recurrence period) has been correctly determined?

Regards.

The Greek “lambda” looked OK in preview mode but seems to have been corrupted to an “l-hat =” in the post. The post should still be readable, I trust.

Sorry about that.

[no worries – I have edited the HTML so it now shows up correctly – Chris]Regards.

The main point being missed is that these recurring events are NOT 500 year floods. The likelihood of severe flooding has increased over time because of floodplain development, higher levees, navigational structures in the river channel, wetland destruction, land use changes reducing infiltration capacity, and possibly changes in rainfall. For example, the modest river stage attained this week at St Louis, 37.3 feet, has now been attained or exceeded 12 times since 1943, but was exceeded only once in the interval between 1861 (when daily records began) and 1942.

River flow is a better choice for calculating flood recurrence statistics, though it is more difficult to measure. Equivalent (within error) river flows at St Louis of 1 Mcfs are known to have occurred in 1844, 1903 and 1993. Three such 1 Mcfs occurrences at St Louis in ca 160 years indicate that these floods are not 500 year, or even 100 year, occurrences. Besides, our historical records are too short for anyone to thoughtfully define what a “500 year” flood would be. Given all these problems, we would be far better off abandoning this language altogether, as is routinely used to justify floodplain development by understating flood risks and overstating the level of flood protection. It’s pretty obvious where that tired, standard approach got us this year.

Streamflow probability distributions have been studied for a long time, and the hydrology literature is full of discussions of probability descriptions, methods of estimating flow characteristics based on incomplete data, &c.

I particularly like the TV interviews where some flooded-out resident says, “I’ve lived here for 15 years and it’s never flooded here!” despite the interviewed being filmed on an obvious floodplain.

Here’s a press release from the USGS itself on the very same subject: http://www.usgs.gov/newsroom/article.asp?ID=1963.

Also, the comment by hydrologist Bob Criss above is a good one for those not interested in getting too bogged down in statistics.

Bob. This years flood covered a relatively limited part of the watershed upstream of St Louis, so what might be a 500 year flood at Cedar rapids could be a 100year flood when it hits St Louis, and a 110year flood when it hits New Orleans. A bigger problem with these small data sets, is that it is not enough data to establish the shape of the curve at the tails. If the tail is fatter than expected, the computed recurrance intervals for rare events could be seriously off.

Another question, regarding secular global (or near global changes) GW and/or landuse changes, which are fairly similar across the world. It seems we’ve been seeing a lot of high amplitude events (Britain last summer, a couple of African floods earlier in this year, and now this one). If our effective media sample size is 500 (i.e. 500 independent flood basins in the world), we would expect about one 500year flood to hit the news every year. But if there are only 100 zones, and we see one per year, we either have a fat tails problem or climate change. Do we have any sort of idea what rate of these disasters would be “normal”?

Wonderful explanation. I haven’t seen it explained that well since I took a geology class with a professor who was fascinated with floods.

A big factor that is often overlooked, although was happy to see it above in some of the comments and in the article itself, is that people have a huge impact on flooding. With so much development, lots of rain water is siphoned directly into the river, rather that being soaked into the ground. Cutting down trees has the same effect.

Living near a river in a flood plain, and working on said river as a guide…I hear the ‘X year flood’ nonsense all the time. And often when I explain to them what it actually means and show them an image of the flood plain, a few of them get nervous as their house is usually in that area.

I think there needs to either be a better set of terminology that we can use for media work, or we need to get better news people that can explain science.

your passage is very helpful to me as i have a project on yhis topic.thankyou so much for putting a beautiful speech about 500 year floods.i encourage you to put on many passages and paragraphs.bye.