Home > About Us > EIA Guidelines for Statistical Graphs
EIA Guidelines for Statistical Graphs

DOE/EIA-0465(98)

EIA Guidelines for Statistical Graphs


Design Criteria

 

This section discusses basic graphic design features such as titles, frames, X- and Y-axes scales, data point plotting, data line differentiation, and data line labeling as well as shading, horizontal and vertical grid lines, avoidance of data line overlap, information messages, and color. During the discussion, the basic decision rule on whether and how to use a feature is the one outlined in the Introduction; i.e., does the design support the message? If a feature does not provide information on the data, there is no advantage in using it and it may reduce clarity. Good graphs have a minimum of design and a maximum of data.

There are three sample graphs in this chapter. Figure 1 illustrates the zero-base and the use of break marks on the Y-axis. Figure 2 shows why "period data" (i.e., averages) should be plotted in the middle of the X-axis interval instead of at its end. Finally, Figure 3 illustrates the advantage of plotting daily rates instead of aggregates for monthly data.

Title

Every graph needs a clearly worded and concise title that answers three questions:

  • data illustrated (what)
  • geographic area represented (where)
  • and date of data (when).

In graphs, unlike as in data tables, the unit of measurement is not in the title because it is displayed on the axis, usually on the Y-axis.

There are variations and exceptions to these rules. In statistical maps, because there is no axis, the unit of measurement is in the title. Also, in publications devoted exclusively to a particular geographic locale, the title of the publication that appears on the same page is sufficient to answer the question "where." For a full discussion of correct figure (and table) title construction in Energy Information Administration (EIA) products, see the EIA Publishing Style Guide.

Frame Dimensions

Graphics tend towards the horizontal rather than the vertical. The primary reason is that "horizontally stretched time series are more accessible to the eye. [3] A pleasing shape is called the "golden rectangle" of frame design in which the ratio of the longer axis to the shorter axis is 1.6 to 1. For example, if a graph's horizontal axis is 8 inches long, its vertical axis is 5 inches long. The graphical perception research has found "a mild preference for proportions near the golden rectangle." Given that the "golden rectangle" is a tendency and a preference, Tufte concludes:

  • "If the nature of the data suggests the shape of the graphic, then follow that suggestion."
  • "Otherwise, move toward horizontal graphics about fifty percent wider than tall." [4]

Authors have the option to choose the frame dimensions that best illustrate the structure of the data. In this report, the "golden rectangle" is illustrated in Figures 1, 4, 5, and 9 -- all long time series graphs.

The DOS version of Harvard 3.0 had a ratio of 1.3 to 1. The Windows 3.0 version ratio is 1.7 to 1, close to the "golden rectangle." In Microsoft, the user can change the frames dimensions to whatever his or her purpose is.

Scale

In a time graph, the horizontal scale (X-axis) always represents time. Demarcations of time intervals must be proportional to the length of the time represented. There are exceptions, such as quarters of a year, where differences in the number of days are so small relative to the total that they are usually ignored. Monthly data need special treatment, which is described in the section on plotting for time series.

The vertical scale (Y-axis) displays the units of measurement, an average or a quantity, of the data. A time plot shows one measurement for each time unit on the horizontal axis. Thus, the vertical axis represents either:

  • A rate that has been determined as the total for one time period divided by the number of smaller time units in the time period, or
  • A quantity measured at one specific instant during the time interval; for example, stocks at the end of the month.

Logically, there is only one vertical scale because the purpose of the scale is to provide the background against which the data shown in the graph are judged. Multiple vertical scales may be used only if the scales are linearly related to one another; i.e., Btu and joules, gallons and liters, short tons and metric tons. Using multiple vertical scales of linearly unrelated variables implies an empirical relationship that may not exist. This subject is covered in detail in the chapters on line graphs and on scatter diagrams.

In recent years, many of EIA's statistical graphs presented data in both U.S. customary and metric units. Figure 19, "A Histogram With U.S. Customary and Metric Scales," is an example of such a graph.

The logarithmic scale can be used when it is important to understand percent change or multiplicative factors, and the intended audience will not be puzzled by such a scale. If the Y-axis in a plot is logarithmic and the X-axis is not, the graph is called a semi-logarithmic graph. Software packages (including Harvard Graphics) generally compute logarithms to the base 10, but logarithms to other bases, such as base 2, can also be used. The logarithmic scale can be used to improve the resolution of highly skewed data.

Authors also have the option to insert the Y-axis scale on the right side of the graph frame. For example, in a line graph that presents a long time series, the most recent trends may be of greater interest than those from earlier periods. In this case, the author may decide that the reader would be helped by placing the Y-axis on the right. Figure 4 (Line Graphs Chapter) and Figure 18 (Vertical/Horizontal Bars, Pie/Dot Charts, 3-D Features Chapter) have the Y-axis scale on the right side to illustrate this point.

Scale Intervals and Tick Marks
In designating scale values, 5 or multiples of 5 are frequently used in preference to multiples such as 3, 6, 7, 9, 11, and so on. If the values of the vertical scale run into thousands, millions, or larger numbers, the zeros are commonly dropped and the scale label will read, for example, "Thousand Short Tons."

Scale divisions for both axes are indicated by major tick marks. Major tick marks usually are outside the axes and are labeled with the scale values. In certain scales, the use of minor tick marks is a good idea, particularly on a Y-axis with a wide scale range (i.e., 0 to 5,000 with intervals of 1,000). Minor tick marks, though not labeled, assist the user to read the data more quickly and accurately. The minor tick mark divisor should be a number that the reader can instantaneously divide into the major tick mark number. For example, if the scale interval is 1,000, the minor tick divisor can be 2, 4, 5, 10, or 20, with minor tick mark intervals of 500, 250, 200, 100, or 50, respectively. If the reader has to think about the division (i.e., 6 into 1,000), then the minor tick marks are counterproductive.

Scale Labeling
For clarity, in addition to the units designated on the scale, both the horizontal and vertical axes need explanatory labels for the variables represented. The unit of measure is customarily also specified in the explanatory label. The exception is for the horizontal axis. If the horizontal axis scale displays months or years, the horizontal axis label (i.e., "Month" or "Year") is not needed. The months and/or years on the scale are self-explanatory.

Months or year labels, such as Jan., Feb., Mar., or 90, 91, 92, are logically placed in the interval between the tick marks, but numerical values, such as 5, 10, 15, are placed on the tick marks. The rationale is that a period of time is called, for example, "May," while only one point on a continuous scale is the precise amount or volume "X." Also, in long time series where there is not enough space between the tick marks to label every data point, the labels need not be on each interval. Instead, they may be placed in some regular pattern such as every 3 months (quarterly) or every 5 years. Single letters (i.e., "J," "M,") to note months on the horizontal scale can be confusing. There are three "Js," two "Ms," and two "As" in a single 12-month period.

The Y-axis label normally is placed parallel to and reading inward to the Y-axis because the Y-axis label refers to the Y-axis scale, not the top frame line where the default setting is in some software packages (such as in Harvard Graphics). There are different conventions for placing this label to avoid having to read it sideways. Some graph producers put the units in the graph title; others put it on the top frame line. The generally accepted, and most widely used, practice is to place it parallel to the scale to which it refers. In situations where there are multiple graphs on a page and all the graphs have the same scale, the Y-axis label can be placed in the title subheading.

It is a good idea to avoid abbreviations and acronyms, but if they must be used because of space limitations, they need to conform to the EIA Standard "Codes, Abbreviations, and Acronyms."

The Zero-Base
As a rule, the vertical scale starts at zero. Otherwise, the relative importance of changes in levels is hard to assess and comparison is difficult, or an insignificant change can be made to look like a major change.

The graphs in Figure 1 (below) present the same data, once without a break mark (left figure) and once with a break mark (right figure). The figure on the left shows an invariant data line with a lot of "white space" below it. This is not the best design to communicate the message. The figure on the right communicates the message and eliminates the "white space." Break marks are a visual warning to readers that there is a discontinuity in the Y-axis scale and that the data should be read accordingly.

Figure 1. Examples of the Zero-Base and the Use of Break Marks


Break marks are also useful if the purpose of the graph is to display the fine details of differences between two lines. Then, that portion of the scales needs to be presented.

There are instances where the zero-base is not necessary on the vertical axis. For example, in line graphs showing relative quantities, the natural basis (such as 100 for percentage indices) often is drawn in the middle of the vertical axis. Figures 6 ("Line Graphs" chapter) and 13 ("Measuring from the Baseline" chapter) are examples of this. Also, the zero-base is irrelevant when a logarithmic scale is used. Finally, the Y-axis is neither broken in cumulative line graphs nor, with rare exceptions as described in the chapter on vertical/horizontal bars, pie/dot charts, and three-dimensional features, in bar charts. The reason is that the area depicted is proportional to the quantity displayed.

Plotting Data for Time Series

Midpoint or Endpoint
When data on the X-axis of a line graph are being plotted, the points are located either directly above the tick marks or above the space between the tick marks, depending on the type of data presented. Data points for averages over the time interval, such as "average production per day" or "average price for the month" are "period" data (rates). Period data also include all quantities that are tabulated as the "total" for a month. Period data are plotted in the middle of the time interval. Although the point is plotted in the middle of the interval, the connecting line extends over the entire interval. This means that the line connecting the first and second point plotted needs to be extended to the beginning for the first interval. Similarly, the extension for the last point is to the end of the interval. This refinement is usually not necessary as it does not affect perception, but is needed if the area between the lines is shaded. Data points measured at one point in time, such as "stocks at end of month," are "point" data. These data are plotted at the specific time of measurement. Some graphics packages plot data (and division or value labels) on the tick marks by default rather than between tick marks, there are methods or options in most of these packages to override this default.

Tables 2 and 3 and Figures 2 and 3 illustrate the difference between "period" data and "point" data. Suppose production of a commodity on January 1 was 100 units and for the next 3 months, production was increased by 1 unit per day for each day. Thus, we have:

Table 2. Hypothetical Data With Increase of One Unit per Day


MonthDays in MonthMidmonth DayRange in Daily Production

Total for Month

Monthly Average Units Per Day

January

3116.0100-130

3,565

115.0

February

2814.5131-158

4,046

144.5

March

3116.0159-189

5,394

174.0

April

3015.5190-219

6,135

204.5

Total

120

 

In Figure 2, below, the points plotted at mid-month (series 1) correctly fall on the daily line. Points plotted at the end of the month (series 2) are 15 units below the line for 31-day months, 14.5 units below the line for a 28-day month; so, the line for average production per day is approximately 15 units too low.

Figure 2. Illustration of the Mid-Point Problem - Hypothetical Data


Now suppose that for April, instead of monthly data, we have weekly data. Assuming that April 1st is the start of the week, we have:

Table 3. April Weekly Data Derived from Table 2

Month/WeekMid-WeekdayRange in Daily Production

Total for Week

Weekly Average Units per Day

April 1-7

3.5190-196

1351

193

April 8-14

10.5197-203

1400

200

April 15-21

17.5204-210

1449

207

April 22-28

211-217

1498

214

 
Last 2 Days  
April 29-30

---218-219

437

---

6135

 

If the weekly averages are plotted mid-week, they will continue to fall on the series 1 line. If they are plotted at the end of the week, they will fall on a line that is 3.5 units below the correct line, series 1. Now, if we were to observe only series 2, the dotted line rises sharply in the first week in April when we change from monthly to weekly data.

When data are plotted at two periodicities, such as weekly and monthly, it is always necessary to use mid-interval plots to avoid the appearance of a false change in level when the periodicity changes. It is always preferable to use mid-interval plots for period data but, in practice, the error introduced by using the default option at the beginning or end of the interval is hard to detect when the time gradients (interval size) are graphically small. Thus, when 12 or more periods are presented, and all periods are the same lenght, it is not essential to plot at mid-interval.

Shading
Time series plots are essentially rates, and each value plotted denotes the rate that applies during a time period. The use of shading below the line is somewhat misleading as it implies that the area shaded is proportional to the rate during the referenced time period. Examination of an irregular graph shows this is not true. For example, if a low value has higher values on either side, then the lines will ascend from each side of the plotted low point; so, the shaded area will be slightly too large.

There is also a problem in determining where the shading should start and end horizontally. For values plotted at the end of the time periods, the connecting line will start at the end of the first period. Thus, no shading need be used for the first period when the area below the line is shaded. Similarly, for mid-interval plots, only half (n-1) of the first period will be shaded unless the line is extended to the beginning of the interval so that the shaded area is proportional to the quantity depicted. A similar procedure is needed at the end of the series. This is a reasonable solution where shading is needed to accentuate features of the data. Figure 4 (Line Graphs section) is an example of this.

Shading also can be used to highlight the differences between two data series (for example, surplus or deficit), or the range of a variable (the standard error if the data are uncertain). Shading is not necessary for highlighting the frame of a graph or for background color in it. It is occasionally needed for a balanced effect if the rest of the page has a heavy load of printing, such as multiple captions in heavy type.

Unless there is a need for shading, the practice of just using a line to connect the data points is preferable. Yet, if shading is used, it should be used consistently and in the same color for each commodity in time line graphs throughout a publication. (Color and shading are discussed in more detail further into this section.)

Monthly Total Dips in February
The observant reader will notice that the intervals on the horizontal scale in Figure 2 differ, February being narrower than the other months. This refinement was necessary to ensure that the series 1 plotted points fell exactly on the hypothetical straight line representing the daily rate of production. In practice, months are usually plotted as if each is of equal length. This slight inaccuracy is not perceptible and does not affect the perception of the pattern of average production per day.

When total production per month is plotted, the 3-day difference between February and adjacent months creates a loss of 3 days' production, about one-tenth of the monthly total. The resulting 10-percent dip in February is perceptible and sends a false message that there has been a change in the prevailing pattern. These dips cannot be eliminated by changing the width of the interval for February. This false message is the reason for plotting monthly data as a daily rate, i.e., barrels per day produced.

Figure 3 illustrates this point. The total monthly generation graph on the left shows a dip in February. Yet, when the data are converted to daily rates in the graph on the right, the dip is eliminated. (The right side Y-axis in the right-hand graph expresses the data as a yearly rate.)


Figure 3. U.S. Generation of Electricity, Month Totals (Left) and Daily (and Yearly) Rates (Right), January-June 1993

The producers of line graphs on a time scale should keep in mind that what is being plotted is a rate, such as a quantity per day or per year, where day and year are fixed constant increments of time. Quarterly intervals are not precisely equal, being 90 (91 in leap years), 91, 92, and 92 days, but the relative differences between lengths of quarters are small and the effect on production is usually not perceptible. This is not the case with months.

For many years, users of petroleum data have expressed all quantities produced or consumed as barrels per day. It is not customary to compute a daily rate for other fuels such as coal or electricity. Another solution, if plotting monthly data for these fuels, is to adjust the quantities to a 30-day month. The February total, thus, would be multiplied by 30/28 and the January, March, May, July, August, October, and December totals, by 30/31. The title or footnotes of the graph would state that monthly totals are adjusted to give the rate for a 30-day month.

Changing to either a daily rate or adjusting to a 30-day month is only necessary when plotting monthly period data. None of these difficulties occur with "point" data, such as stocks. These data measure the level at a specific point in time, usually the last day of the time period covered by other questions in the survey. They are correctly plotted at the point in time where they are measured.

Differentiating Lines

Line Patterns and Line Symbols
If there is more than one line in a graph, each line usually needs to be identified by a unique line pattern, unique line symbol, or color. For example, lines may be represented by a solid line, dashed line, dot-and-dash line, or some other pattern. Line symbols may be squares, circles, diamonds, triangles, etc. The symbols are placed where the data points are plotted, in the middle of the time period for "period data" and at the end of the interval for "point data." If symbols are used, the connecting lines should be solid. Line patterns and line symbols need not be combined. One or the other is sufficient to differentiate the data lines in a data series. See Figure 2 for an example of a data line with small squares and a dotted line with circles.

In cases where data lines are separated by a lot of space and do not come near each other, each line can have a solid (or some other) line pattern and a line label to identify the curve. This will reproduce clearly (when printed in black and white or when photocopied), not lead to confusion, and eliminate the need for a legend.

It is a service to the reader to use the same pattern (symbol or color) throughout a publication. If, for example, coal is always depicted as a solid line and crude oil is shown as a dotted line, the reader does not need to refer repeatedly to legends. Such consistency is desirable, if feasible, and not preempted for other reasons. As illustrated in the next chapter, line patterns are preferable to line symbols. Line symbols use more space than line patterns and are more distracting.

If a graph has too many lines, it may become confusing and lose its visual impact. It is not possible to state a hard-and-fast rule for the optimum number of lines in a graph as it depends on the amount of overlap. When overlap is not a problem, four lines, usually, is the optimal number that can be read clearly, but there can be graphs when more than four lines can be clearly read. Figure 18 (chapter on Vertical Bars, Horizontal Bars, Pie and Dot Charts, and 3-Dimensional Features) illustrates this. Figure 5 (Line Graphs chapter) also illustrates a convenient method to summarize information from multiple lines, presenting the range of all values with a line of primary interest (i.e., the mean or median) superimposed. The chapter on Measuring from the Baseline discusses other methods to eliminate overlap.

Real and Nominal Dollars
In some instances, where justifiable in economic theory and practice (i.e., refiner acquisition costs for crude oil), it is a good idea to present data (lines) in both real and nominal dollars. In the last several years, EIA publications have increasingly published financial data in both real and nominal dollars.

Grid Lines

Horizontal grid lines are often not necessary. When needed to guide the eye, they should be kept to a minimum, evenly spaced, and comparatively light in contrast to the data lines. If grid lines are not light, they may overpower or "hide" the data lines and, thus, obscure or distort the data presentation. Vertical grid lines, however, are often helpful. The following are situations in which it is recommended that they be used:

  • To separate calendar years in a monthly data series to reinforce the perception of seasonal patterns To separate historical data from projections
  • To separate a change in the data series (i.e., the method of sampling was changed).

Internal Labeling and Messages

Line Labels and Legends
To facilitate interpretation, each line needs a short, simple, self-explanatory label. Labels can be located either directly next to the lines or listed in the legend with the line pattern, line symbol, or color. Direct labeling is much preferred to the use of legends. With legends, readers have to look back and forth between the data lines and the legend, while at the same time decoding the information, instead of just reading and analyzing the data. This annoys readers and decreases their comprehension of the graph. Legends also take up space that could be used to present more data. If a legend is used and the space is available, placing the legend inside the graph reduces the amount of searching the reader must do.

Placement
Placement of line labels depends on the amount of overlap. If the line graph is uncluttered, the line label or message can be placed parallel to or near the line or point which the label or message is describing. If the data lines are close together or overlap, but there is enough space within the graph for line labels (as opposed to a line legend), then arrows need to be used to connect the line label (or message) to the data line. The design of the arrow is simple: a straight, narrow shaft with a small arrowhead. Arrows should not actually touch the curve nor should they cross each other.

Messages
If they do not clutter the graph, short informational messages (i.e., "Iraq Invaded Kuwait" in a graph on crude oil production or prices) are very useful and can be added to the graph. Messages may provide additional context to the data, emphasize a particular point, and/or explain ambiguous or possibly hidden features in the graph. Messages are particularly useful in explaining the data (and even the graph format) in graphs that may be easily comprehensible by some readers because they are familiar with the subject matter, but are not as well understood by other readers because they are not familiar with the subject matter. Authors are advised to keep in mind this distinction when constructing graphs. In some graphs, informational messages are distinguished from the line labels by enclosing them in a box. This, though, is not a recommended practice. Boxes take up space and are essentially ornamental.

Color

Color statistical graphs have become technologically, economically, and aesthetically more feasible to produce and, hence, more commonplace in publications because of advances in hardware (such as increased resolution in laser and other printers) and software (palettes with greater selection of colors). In addition, when graphs are presented on the World Wide Web, the use of colors is a free feature, where the main restraint would be to not use too many colors, so that an image will load fairly quickly into a browser window. While the choice of when to use colors and what colors to use is basically artwork, there are some general rules to observe.

The purposes of color are to differentiate and to encourage comparison. In this sense, color is another variable. These purposes can be achieved by using colors in consistent thematic patterns throughout each product. For example, to represent a low to high scale ina statistical map or grouped bar chart, use light to dark hues of the same color. This will be easy to remember and give readers "a sense of natural visual sequence." [5]

In Envisioning Information, Tufte also suggests that authors not use pure, bright, or very strong colors "unrelieved over large areas adjacent to each other." This creates "noise" that distorts, blurs, or covers the visual impact of lines, bars, strata, map contours, etc., which represent the data. Pure, bright, and very strong colors should be used sparingly in statistical graphs. Further, using light bright colors against a white background produces a "1 + 1 = 3 effect," a form of noise where the colors are at "visual war" with the information in the data display. [6] Use softer colors for the background and display.

The following are additional guidelines on the use of color advocated by Tufte:

  • Use only colors that enhance the data presentation. Do not use color for graph frames, large unnecessary symbols, or titles, for example.
  • Be careful in the selection of colors. The wrong colors can distort the data and lead to incorrect perceptions. For example, big differences in color should not be used to represent small quantitative changes. The color scale should be proportional to the data scale. [7]
  • Experiment with color combinations. Use colors or hues that are distinguishable without one color overpowering the other(s). Tufte recommends using the "smallest effective difference." The "visual move" from one hue (or color) to the next is as small as possible but is still distinctive and clear. This allows for more differences and, hence, data to be displayed. [8]
  • Test the colors you select throughout the composing, printing, and publishing process. The color you select from the software color palette that you see on your computer screen may not come out as the same color on a color printer, or the color in the paper-printed report, or the color in a Web browser.
  • Ask coworkers if they can distinguish between the colors or hues that are being used in the construction of, for example, a grouped bar chart or statistical map. When in doubt, use soft colors.
  • Visual perception of color varies among readers. Some are unable to distinguish among colors in different parts of the spectrum. For them, too much reliance on color is a disacvantage.
  • Hardware and software should provide sufficient resolution so that the edges of color components in a graph are not fuzzy (called the "jaggies").
  • Assume that readers will photocopy graphs and that reproduction could cause problems. Thus, it is advisable to photocopy graphs you produce that will be published to see if photocopying produces a drop-off in quality.

Click here to return to front of report.