Home > About Us > EIA Guidelines for Statistical Graphs
EIA Guidelines for Statistical Graphs

DOE/EIA-0465(98)

EIA Guidelines for Statistical Graphs

Scatter Diagrams

Scatter diagrams are probably the most frequently used graph format in scientific analysis. They are, in essence, two-way, or bivariate, frequency distributions that show the degree and type of relationship or covariance between two data series, provided there are sound theoretical and/or empirical grounds to believe a relationship may exist. This chapter discusses the properties and construction of scatter diagrams, followed by an example (Figure 24). For simplicity, this chapter will discuss linear relationships between variables and only touch upon nonlinear analysis.

In scatterplots, the relationship or correlation between two variables may be either positive, negative, or zero. The relationship is positive and direct when one variable increases or decreases and the other changes proportionately or nearly proportionately in the same direction. The correlation between the two series is negative or inverse if the two variables vary in opposite directions.

The positive correlation is indicated by a linear or near linear pattern of tally marks (i.e., dots, asterisks, letters) from the lower-left corner to the upper-right corner of the graph. Conversely, if the tally marks tend to scatter diagonally from the lower-right corner to the upper-left corner, the correlation is negative. The precise degree of relationship between the two variables can be measured by the coefficient of correlation, which is a pure number that ranges in value from positive one (+1) down through zero (0) to negative one (-1).

When the relationship is weak or nonexistent, the degree of correlation is low. The data points are widely scattered over the entire chart, with little or no tendency to align themselves diagonally in either direction. The data could also show an arc pattern from one corner (i.e., lower left) to the opposite one (i.e., lower right), called curvilinear, or some other type of curving pattern. [32]

Scatter diagrams also have purposes other than showing the direction (positive or negative) and strength of a relationship. They can be used:

  1. To explore relationships between variables even when there is no linear dependence. For example, the data could reveal a circular pattern. This pattern is worth examining further through nonlinear analysis to see if there is some type of relationship.
  2. To analyze deviant cases or outliers that do not follow the pattern portrayed by most of the data. This is one of the stronger points of scatter diagrams. Analysis of outliers is often more interesting than the general pattern and can reveal important insights about the data. Analysis of outliers is an important aspect of An Assessment of the Quality of Selected EIA Data Series reports released periodically by EIA's Statistical Methods Group.
  3. As a preliminary tool prior to running a bivariate regression analysis, to show if there is a relationship (or association) between the dependent and independent variable(s) and, if there is, the direction and strength of that relationship. If the scatter diagram shows there is little or no relationship or association, or a strong non-linear relationship, the author will not need to run the regression analysis. A number of EIA products present regression analysis.
  4. To plot together two variables that are not linearly dependent in place of dual Y-axes in a simple line graph (discussed in the chapter on line graphs).
Constructing a Scatter Diagram

Axes
The axes are constructed at right angles to each other. The scales for each axis (variable) are from the lowest to highest value or score. Thus, it is not necessary to have a zero for scatter diagram scales.

The next step is to select suitable scale divisions or intervals for labeling. The intervals selected are the same for each variable, if each variable is measured in the same unit. The spacing of the scale divisions on each axis needs to be large enough to accommodate the symbols.

Plotting Data
Each entry that is recorded by a symbol in the proper coordinates (or cell) always represents two numerical values, one measured on the X-axis and the other on the Y-axis. Enhancements can be used to distinguish data points, or sets of data points, from each other. This can bring, in effect, other variables and, hence, more information and other dimensions to the analysis.

Different symbols or colors, for example, can be used to distinguish data points from different years, decades, countries, States, classes, etc. Research has shown that distinct groups of observations on a set of common variables in the same scatterplot can be compared by using different symbols, letters, or colors to differentiate each group. The findings have indicated that readers can distinguish among different strata and can accurately and quickly estimate the degree of correlation (or association) between the two variables for each group, particularly if letters (i.e., H, Q, and Z) and colors (i.e., green and yellow) that do not look alike are used. [33] (Cleveland and McGill also discuss several smoothing techniques that can be used to emphasize trends.) The research on differentiation was inconclusive on the selection of symbols (i.e., diamonds and squares). [34]

Blocks (or circles) can also be constructed of different sizes according to the number of observations in the display they represent, if group size is relevant. [35]

Quattro Pro offers a variation of this technique, called the "bubble graph." In a bubble graph, circle (or bubble) size does not represent the relative number of observations. The bubble represents the measurement of a third variable, in addition to the X and Y axes coordinates. For example, users could graph the sales (X-axis), profits (Y-axis) and assets (bubble size) of a half dozen interstate natural gas pipeline companies to show that a company with less assets (bubble size) had near or greater sales or profits from natural gas than a company with more assets. (One drawback to "bubble charts" is that circles are a poor visual metaphor. Readers have difficulty visually distinguishing circle sizes representing similar values.)

Example of a Scatter Diagram

Figure 24 is a scatter diagram that portrays the U.S. distribution of dry holes drilled versus total oil and gas wells completed every year from 1949 to 1992. Each data point represents the data for a particular year, with the number of dry holes completed on the Y-axis and the total number of wells completed on the X-axis. The pattern of data points clearly indicates a positive correlation between the two variables. This means that the ratio of the two is fairly constant during this period, even though the yearly totals vary widely.

The circled data in the upper right-hand corner in Figure 24 represent the 1980 to 1985 period after the "oil shock" of 1979 related to the Iranian hostage crisis. These data are below the best fit line. The line was started at the intersection (zero points) of the X and Y axes and drawn through the coordinates of the means of X and Y. The data illustrate that, when drilling is very high, the "hit rate" tends to improve. If these data (particularly total holes completed) were also plotted against crude prices for this period, there might be a strong, positive correlation.
Figure 24. U.S. Distribution of Dry Wells vs. Total Wells Completed,
Oil and Gas Exploratory and Development Wells, 1949
to 1992

Sources: 1949-1965, Gulf Publishing Company, World Oil, "Forecast Review" issue; 1966-1969,
American Petroleum Institute (API), Quarterly Drilling Statistics for the United States, annual
summaries and monthly reports; 1970 forward, EIA computations based on well reports submitted
by API.

Circling the 1980 to 1985 data in Figure 24 is technically a simple and unsophisticated example of "brushing." Brushing allows analysts to move a rectangle (or some other shape) around the screen with a mouse and to "click on" and label a subset of the data. The analyst can then color code, change the aspect ratio (the physical length of the Y-axis divided by the length of the X-axis), delete, or perform other operations on highlighted data points. [36] Brushing is particularly useful in scatter diagrams with outliers and/or many data points where it is difficult, if not impossible, to separate them with the naked eye. In short, software advances have allowed greater analytical utilization of scatter diagrams.

Click here to return to front of report.