The
first step in the visualization process is selecting and reading your data
into SAS/SPECTRAVIEW. The interface
guides you through the process.
When you first invoke SAS/SPECTRAVIEW, [Data] is selected by default, ready for you to
load data. Note that
you can load data at any time during a SAS/SPECTRAVIEW session
by reselecting [Data].
Loading Data
To
display the session's assigned librefs:
-
Select [Load data]. The software
displays the assigned librefs under the label Libname. To assign an additional libref for a session, you can do so from the SAS PROGRAM
EDITOR window (if you invoked SAS/SPECTRAVIEW with
a command), then refresh the session's librefs for SAS/SPECTRAVIEW by
reselecting the Data and Load data buttons.
-
Select the libref containing the data set that
you want to load. Use the scroll bar if there are more than 10. Once you select
the libref, the software displays the data sets associated with the libref.
Selecting a Libref
SAS/SPECTRAVIEW works
as well with small data sets (such as 20 observations) as it does with large
data sets (such as a quarter million observations). The SAS data set that
you select must have at least four variables to be specified for the three
axis variables and the response variable, the response variable must be numeric,
and each variable specified for SAS/SPECTRAVIEW must
contain at least two unique values. If you want to use a BY variable, the
data set must have a fifth variable as well. To load a data set that
has only three variables, see Loading a Data Set with Only Three Variables.
Select the input data set from the list of names. Use
the scroll bar if there are more than 10. Once you select the input data set,
the software lists the data set's variables in columns from which you can
select SAS/SPECTRAVIEW variables.
Selecting a Data Set
You must specify a different
data set variable for each SAS/SPECTRAVIEW variable.
That is, you must select a different variable from each of the X Variable, Y Variable, Z Variable, and
Response variable
columns. The axis variables can be either numeric or character, but the response
variable must be numeric.
To help you select appropriate variables, you can place
your cursor on a variable name, and the software will display a short description
of it in the text window. For example, for the EPA data set, which contains
the variables HOUR, LEVEL, LNGITUDE, LATITUDE, SULFATE, and OZONE, their descriptions
provide the following information:
-
All the variables are numeric. Specifically,
the description for SULFATE is
Type: Num, Label: Sulfate (ppm).
-
SULFATE and OZONE specify that their values are
in ppm (parts per million). SULFATE and OZONE are good candidates for the
Response variable, since you usually want a variable that is observed or generated
in various quantities. A Response variable is one that contains the values
that are of most interest.
-
Variables LEVEL, LNGITUDE, and LATITUDE are described
as RADM Model layer, RADM Cell X coordinate, and RADM Cell Y coordinate. Their
values are most likely not sampled or generated but represent where SULFATE
and OZONE values are located or from what types of data the response values
were generated. Therefore, LEVEL, LNGITUDE, and LATITUDE are good candidates
as axis variables, since they can be used to generate grid locations to display
the response values.
-
HOUR contains hour values. This type of variable
is useful to generate groups of observations by assigning it as a BY variable
as explained in Grouping Observations with a BY Variable.
Note that any variable that is appropriate as a Response
variable is not a valid choice as an axis variable, and any variable that
is appropriate as an axis variable is not a valid choice for a Response variable.
Attempting to read a data set with inappropriate variables selected could
result in the data set failing to load. You want to specify variables that
are the best ones as the axis variables to build as complete a volume grid
with actual data points as possible. And you want to avoid specifying axis
variables that are sparsely valued or have continuous data.
Specifying SAS/SPECTRAVIEW Variables
Once you select the four required variables,
the software
highlights [Read data], but you still have the option of specifying
BY variable processing, duplicate values handling, data categorizing, automatic
axis scaling, and data subsetting with a WHERE clause, which are discussed
in the following sections.
In
addition to the four required variables, you have the option of specifying
a fifth variable as a BY variable. The values of a BY variable
define groups of observations, such as hour, month, or year. Specifying a
BY variable allows you to animate an image so that you can see how response
values change according to some grouping, like over time.
A BY variable can be either character or numeric. BY
data usually includes multiple response values for a single data point.
For example, in the EPA data set, the variable HOUR
contains hour values, which would be useful as a BY variable. If you imagine
that the first four variables would generate a cube of data values, then specifying
a BY variable would generate a sequence of cubes of data values that can be
cycled through to determine how response values change over time (in this
case).
If you select LNGITUDE, LATITUDE, and LEVEL as the axis
variables, SULFATE as the Response variable, then HOUR as the BY variable,
you will create a sequence of volumes of data to be displayed and analyzed.
Specifying a BY Variable
Note:
If you do not
specify a BY variable
but your data contains BY data (like a time variable), you may receive a message
in the text window after loading the data. The message warns that there is
more than one response value for an x,y,z coordinate. When this occurs, the
software handles the response values according to the setting on the Duplicate Values panel. ![[cautend]](../common/images/cautend.gif)
Duplicate values occur when the data has more than one observation
for the same x,y,z coordinate, which could result in more than one response
value for a data point. Note that if you also categorize the data or if you
have specified a BY variable, the instances of duplicate values may increase.
You determine how the software handles duplicate values
by selecting one of the choices under the label Duplicate Values. The default is [Last], which means that the
last response value encountered for a data point is used as that location's
response value.
Handling Duplicate Values
To specify
how the software handles duplicate values,
select one of the following options:
-
[Count]
-
For each unique x,y,z location, the software
counts the number of observations and uses that count as the response value.
For example, if there are three observations that specify the x,y,z location
1,1,1, the response value is 3, regardless of the actual response values in
the data.
When you load data, each response value for the resulting
data points represents a count of the observations for that location. If there
are no duplicate observations for a particular x,y,z location, the response
value is 1, indicating that only one observation was found for that location.
Similarly, if the data includes no observations for a particular x,y,z location,
the response value would be 0, meaning that the data point is missing. [Count] allows you to find the number of response values that were
used to calculate other values, for example, [Mean] or [Sum]. If you load data with [Mean], you may want to
know how many values were used to calculate the mean value shown at a particular
x,y,z location. You can load again using [Count], then probe
the data to reveal the number used for the mean.
-
[Nmiss]
-
For each unique x,y,z location, the software
counts the number of observations with missing response values. For example,
if an x,y,z location has two observations and both have a valid response value,
the result is a response value of 0, meaning no observations with a missing
response value were found for that location.
With [Nmiss] specified, every data point
has a response value indicating how many missing response values were encountered
for that location. If a valid data point has five observations and only three
had response values, then that data point's response value is 2, meaning two
observations were found missing a response value for that location. [Nmiss] only counts valid data points having no response value. It
does not count filler points generated by the software. If the data does
not contain an observation for an x,y,z location, the software inserts a data
point that has a missing response value. This means that if you load a data
set, display it as a point cloud, and discover there are several missing values
in the volume grid, you can reload the data with [Nmiss] selected
and determine which missing values are caused by missing response values as
opposed to missing axis values.
-
[Minimum]
-
If there are two or more response values
for the same x,y,z location, the software uses the minimum value as the response
value.
-
[Maximum]
-
If there are two or more response values
for the same x,y,z location, the software uses the maximum value as the response
value.
-
[Sum]
-
If there are two or more response values
for the same x,y,z location, the software uses the sum as the response value.
-
[Range]
-
If the data contains at least two response
values for each x,y,z location, the software uses the range as the response
value. The range is calculated by subtracting the minimum response value from
the maximum response value. If there is only one value for a location, the
response value is set to missing.
-
[Last]
-
If the data contains two or more response
values for the same x,y,z location, the software uses the last response value
as the response value. This is the default.
-
[Mean]
-
If the data contains two or more response
values for the same x,y,z location, the software uses the mean as the response
value.
Categorizing data is an option that groups numeric
data to create
distinct ranges (called categories) for each axis. You cannot categorize
character variables. The result is a reduced number of data points
in the volume grid. By categorizing all three axes, you can set exactly how
many data points the software will create. Categorizing data is useful
Continuous data (containing few gaps that vary slightly
over a large range like weight and height) are a good candidate for categorizing.
For example, to analyze a group of people's heart rate based on their age,
activity level, and weight, the weight values, which would be in pounds like
139.5, 143.6, would be considered continuous. That is, it is not likely that
any two people (let alone several) would have the same weight but a different
age and activity level. Categorizing the weight values by creating weight
categories for ranges of weight with one value to represent each category
would make the data clearer and easier to use.
Discrete data (containing natural gaps like patient
IDs and years) would probably not be as useful to categorize. But discrete
data such as hour could be categorized into groups if the degree of precision
can be reduced without losing data integrity.
To categorize data:
-
Select [Categorize]. The software
displays a group of sliders and buttons at the bottom of the interface.
Categorizing Data
-
Under
the label CATEGORIZE AXIS,
specify which axis you want to categorize. By default, all three are turned
on for categorizing. Use the on/off buttons to turn categorizing on or off
for a particular axis. For example, selecting [X on] turns
on categorizing for the X axis, and selecting [Y off] turns
off categorizing for the Y axis.
-
Under NUMBER GROUPS,
use the sliders to specify the number of categories you want for each axis.
You can specify between two and 100 categories for each, with 10 being the
default.
-
Under GROUP AXIS VALUE,
for each categorized axis, specify the axis tick mark value:
Categorizing data makes it
more likely that the software encounters
more than one response value for a given x,y,z coordinate. (Uncategorized
data usually contain only one response value for each x,y,z coordinate.) When
one or more of the axes are categorized, some of the data points become duplicates
within a group, which could result in more than one response value for a single
data point.
For example, suppose values for the X variable are integers
from 1 to 100. If you categorize the X values into groups of 10 values, 1-10
would be a single category. The data points 1,1,1 and 2,1,1 and 3,1,1 and
so forth are viewed by the software as the same data point in the volume grid,
because they would all have the same X, Y, and Z values.
The response values for the 10 data points would appear
to be 10 different response values for the same data point. The response values
for the duplicate locations are handled according to the method specified
for duplicate values handling, with the default being to use the last response
value found as the category's response value.
By selecting [Auto scale], you can automatically scale the volume's three axes
to
the same length. The default is that the length of each axis is determined
by the range of axis values. For example, an axis with values from 1 to 100
is ten times as long as an axis with values from 1 to 10.
Note:
Once a data set is loaded, [Auto scale] is deselected. To
load a subsequent data set with automatic scaling, you must select [Auto scale] again. ![[cautend]](../common/images/cautend.gif)
Optionally, you can specify a
subset of data to be loaded
into SAS/SPECTRAVIEW by specifying
condition(s) that observations must meet. You can subset response values by
specifying criteria for the response variable, and you can subset data points
by specifying criteria for the axis variables.
Subsetting can change the size and shape of the volume
grid. For example, subsetting data can create holes that are replaced with
filler points, or subsetting can remove holes in data.
Prior to selecting [Read data], you can
specify subsetting conditions using a SAS WHERE clause:
-
Select [Where clause].
-
In the
text window, type a SAS WHERE clause, without
the keyword WHERE and no ending semicolon. A condition consists of a variable
name, an operator (such as EQ, NE, LT), and a value, such as
sulfate
> .00005060.
Subsetting Data
-
Press
Enter.
For details on specifying conditions, see the appropriate
WHERE clause documentation. Note that before you invoke SAS/SPECTRAVIEW,
you can create a smaller SAS data set containing only the values that you
want to use. For example, you could choose certain ranges of axis values or
specific response values.
To have the software read the data, select [Read data].
The software loads the input data, applying any
optional
specifications. For example, if a WHERE clause is specified, the software
loads only those observations meeting the criteria, and if categorizing is
specified, the software changes the number of data points accordingly. Once
the data set is loaded, the variable list disappears, and the software is
ready for you to
If you have loading problems,
see Resolving Data Loading Problems.
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.