JGOFS Data System Overview
Glenn R. Flierl, MIT
James K.B. Bishop, LDEO
David M. Glover, WHOI
Satish Paranjpe, LDEO
Index
Obtaining the Software
Full Report Available On-line
You can obtain a complete web-based copy of the report by
clicking here.
Further Documentation
We have various Postscript documents on details of the system. These are
somewhat more technical. See here
Adapted from original web-based report.
Last modified: September 21, 2004
Introduction
Introduction
Large oceanographic programs such as JGOFS (The Joint Global Ocean Flux
Study) require data management systems which enable the exchange and
synthesis of extremely diverse and widely spread data sets. We have
developed a distributed, object-based data management system for
multidisciplinary, multi-institutional programs. It provides the
capability for all JGOFS scientists to work with the data without regard
for the storage format or for the actual location where the data
resides. The approach used yields a powerful and extensible system (in
the sense that data manipulation operations are not predefined) for
managing and working with data from large scale, on-going field experiments.
In the ``object-based'' system, user programs obtain data by
communicating with a program (the ``method'') which can interpret the
particular data base. Since the communication protocol is standard and
can be passed over a network, user programs can obtain data from any
data object anywhere in the system. Data base operations and data
transformations are handled by methods which read from one or more data
objects, process that information, and write to the user program.
Purpose:
- Permit scientists to use data without concern for storage
technique, location, or format
- Networked interchange of data sets
- Access to most recent versions of data sets during experiments
- Handle multidimensional data
- Transmit metadata
- Extensible data manipulation routines
- Usable interactively or from programs
Basic Elements
Basic Elements:
- data objects which receive requests and respond with data
- application programs/interfaces to other software:
- a server which connects applications to objects
Data Objects
Data Objects
Data Objects package together a program (the translator or
method) and data. User programs never look at the data directly;
rather, they communicate with the data object
Data Objects communicate with a common protocol
- All data objects present the same appearance to outside
(described here)
- Programs can work with any data in
the system (Example)
Data Objects handle
Projection (subsetting by variable name)
Selection (subsetting by variable values)
- Can minimize transmission of data
- Individual objects may have other functions
Data Objects
Data Objects
Data Objects package together a program (the translator or
method) and data. User programs never look at the data directly;
rather, they communicate with the data object
Data Objects communicate with a common protocol
- All data objects present the same appearance to outside
(described here)
- Programs can work with any data in
the system (Example)
Data Objects handle
Projection (subsetting by variable name)
Selection (subsetting by variable values)
- Can minimize transmission of data
- Individual objects may have other functions
Translators/ methods
Translators/ methods
The "translators" (or "methods" in object-based terminology) are
programs which give other PI's a viewport into a data set. The program
both makes the data set visible to the outside world and shields outside
users from needing to know the details of where and how the data is stored.
These programs are responsible for
- receiving requests for subselections of the data
- gathering the requested information from the data set
- translating the information into the internal form used for
transferring data
- sending the information through the communication line to the
process which made the request.
One translator may serve several different data sets -- the translators
depend on the format chosen by the PI, but generally not on the
information itself, though there can be exceptions.
Data Model - appearance to applications
Data Model - appearance to applications
The JGOFS data model is the critical part of the communications
protocol. It includes:
- Comments (text)
- Variable descriptions
- Name
- Dimensions for vectors/ matrices/ tensors [not implemented!]
- Attributes (e.g., units)
- Hierarchical structure
- Data
- End of data set indicator
The hierarchical structuring is an important way of organizing many
kinds of data. It groups the least rapidly changing variables (e.g.,
header data), then the next-most rapidly changing information, etc. For
example, a hydrographic section might look like
leg
year /[lowest (0) level]/
month
station
lat /[level 1]/
lon
date
press
temp
sal /[level 2]/
o2
sigth
A current meter mooring might have
mooring_id
lat
lon /[level 0]/
nominal_depth
start_time
end_time
time
u /[level 1]/
v
temp
Often one scans the lower level information first to pick out the
desired station or mooring and then retrieves the information only for
that subset of the data base.
Putting Data on the system
Putting Data on the system
To add a new data object to the system, one needs a translator/method
which can properly interpret the data. The options are:
- Write a new translator to conform to the data. If there is a
large, established database with existing programs for updating
and access, this may be the best procedure. Often this translator
may also glue together a number of different files to form a full
database.
- Transform the data into a form compatible with an existing
translator/method. This may be the easiest thing to do when a
measurement program is just beginning.
Two existing methods, shipped with the system, are the default method,
def, and the method for reading output from the list program, nm.
def
This is intended for data with each station (or mooring, etc.) in a
single file, with header files linking them. Thus a hydrographic data
set might look like:
Header file
# Gulf Stream Cruise Stations 3-5
# p<1000
station lat lon > [variable names for this file's data]
press temp sal o2 sigth [variable names for the next level files]
3 38.28 -73.53 s3
4 38.19 -73.52 s4
5 38.16 -73.26 s5
file s3
# Station 3
# lat=38.28, lon=-73.53
# This data prepared by someone
# Measurement at station 21 decibars contaminated
# 2/18/93
depth temp sal oxy
1.000 21.800 25.380 5.700
3.300 nd nd nd
5.000 21.800 25.580 5.600
10.000 21.400 25.670 5.400
13.000 21.000 25.850 5.000
15.000 20.500 26.020 5.000
21.000 19.900 26.400 5.000
The # sign indicates comments; the > in the header variable name list
indicates that item points to a subfile containing more detailed
information.
nm
This method is for a single file with multiple stations.
# Gulf Stream Cruise Stations 3-5
# p<1000
station = 3 lat = 38.28, lon = -73.53
press, temp, sal, o2, sigth
5.000, 18.334, 33.570, 5.970, 24.096
25.000, 12.848, 34.159, 6.990, 25.773
49.000, 11.070, 34.523, 6.060, 26.394
99.000, 11.093, 35.090, 5.340, 26.831
149.000, 11.906, 35.487, 5.020, 26.990
199.000, 10.819, 35.435, 4.210, 27.152
station = 4, lat = 38.19, lon = -73.52
press, temp, sal, o2, sigth
5.000, 17.516, 33.160, 5.840, 23.981
25.000, 12.315, 33.958, 7.090, 25.721
49.000, 9.612, 34.192, 6.020, 26.387
99.000, 12.095, 35.402, 5.340, 26.887
149.000, 12.407, 35.625, 5.290, 27.000
199.000, 11.287, 35.487, 4.340, 27.108
station = 5, lat=38.16, lon=-73.26
press, temp, sal, o2, sigth
5.000, 18.382, 33.647, 5.770, 24.143
25.000, 12.040, 34.196, 6.660, 25.959
49.000, 11.951, 34.925, 5.510, 26.543
99.000, 11.914, 35.390, 5.100, 26.912
149.000, 12.045, 35.547, 5.070, 27.010
149.000, 12.045, 35.547, 5.070, 27.010
199.000, 11.976, 35.589, 4.940, 27.057
Comment lines begin with #. The lines with an equals sign = contain
assignments for variables at level 0 (comma or space separated). The
assigments need only be done when the variable changes. The first line
without an equals sign contains the names of the level 1 variables
(comma or space separated).
Communications
Communications
There are two parts to the problem of communicating information from the
object on one machine to the application on another:
- The "physical" connection which involves setting up a data
transfer pathway between the two processes on the different
machines. To do this, the software uses NCSA's
HTTPD
and a JGOFS data server program.
- The protocol for the communication
which ensures that
the processes understand the requests and replies.
All exchanges between the user's application program (process 1) and the
method/ translator (process 2 -- perhaps on another machine) are made
via interprocess communications using ``pipes'' or ``sockets'' as
defined in Berkeley UNIX. In the case of a locally defined object, a
pipe is opened between the application and the method processes. For a
remotely defined object, the application opens a socket to the
HTTP
daemon on the other machine and starts the
server. The server effectively connects the standard output stream
on the method to the socket in the application. The processes then begin
exchanging information according to the JGOFS
protocol.
Servers and Dictionaries
Servers and Dictionaries
Servers
Each JGOFS system which is providing data must have the
HTTPD process running as a background task.
When a request for
data comes in, HTTPD starts our *server* process. This process consults
the dictionary, starts up the method process and passes it the requested
subselections and other parameters.
The method analyzes the request, gets the information from the data
files or database, and writes out the results (in the JGOFS
protocol). These pass through the communication
pathway to the
application program. In this sense, the method acts like an input
subroutine which the main program calls to get data from files. However,
the the data can be gathered from across the network.
Dictionaries: .objects files
The server works with two dictionaries, the user's (in the current
working directory) and a tree of system dictionaries (set up when the
software is built). These translate between a shorthand notation for the
object and the detailed description either of where the object is [what
machine it's on], or, if it's locally held, what method is used, and
what default arguments are to be passed to the method. Thus the user can
generally deal with brief names.
So users can specify objects in the following forms:
1. method(parameters)
In this case, the software will use the method named as the
translator, passing it the parameters. Methods are stored in the
methods subdirectory of the JGOFS software directory. The parameters
are passed as command line arguments to the process.
2. datafilename or datafilename(parameters)
In this case, the software assumes the
default method,
def, is being used.
3. nameindictionary or nameindictionary(parameters)
The name is looked up in a file, .objects, in the present directory
and replaced with the information found therein. The parameters are
merged. For example, if the local .objects file contains
stuff=nm(myfile)
farstuff=//globec.whoi.edu/test
Then a request for stuff(press<100) will translate to
nm(myfile,press<100) and then be reinterpreted by the first rule. A
request for farstuff(press<100) will be translated to
//globec.whoi.edu/test(press<100) and reinterpreted by the fifth rule
below.
4. /path/nameindictionary or /path/nameindictionary(parameters)
The name is looked up in a file, .objects, in the JGOFS system
directory, following the path given. The ``root'' of the objects
tree is the subdirectory objects of the JGOFS software directory.
Replacement occurs as above.
5. //machinename/path/nameindictionary or
//machinename/path/nameindictionary(parameters)
The path, name, and parameters are transferred to the remote machine
which then follows the procedure outlined just above.
Dictionaries have two types of entries:
Local entries
These map the name to a method on this machine and (usually) some
required parameters: e.g.,
bot=jgbl2(/d5/glenn/bloom/bot)
Remote entries
Usually, these just map a name on this machine to a name on the
other machine. Thus if a data object on the remote machine is moved
or replaced, only the dictionary on that machine needs to be
updated. This also shields remote users from needing any details
about the remote filesystem, methods, or data locations. An entry of
this type looks like
bot=//puddle.mit.edu/jgofs/bloom/bot
Dictionaries: .remoteobjects files
In addition, the system supports a set of dictionarys which tell the
outside world what objects are available on this machine. In addition,
other information about the object is provided, usually with loinks to
an HTML page giving textual description of the information in the
object, the variables, etc. Such a file looks like
tco2=//puddle.mit.edu/jgofs/bloom/tco2
- P.Brewer
- Total carbon dioxide
optics=//dataone.whoi.edu/jgofs/bloom/optics
- C.Davis
- Bio Optical Profiler Data
poc=//puddle.mit.edu/jgofs/bloom/poc
- H.Ducklow
- Particulate C, N
stuff=//puddle.mit.edu/test
-
- http://puddle.mit.edu/notready.html This will contain good stuff
Protocol for Communication
Protocol for Communication
Methods provide three different kinds of data stream. You can view all
of these from browsers:
The rest of this document concentrates on the last case.
Example
We illustrate the communication protocol with a simple example: for a
data object which looks like
list "test(station<=5&press<100,station,lat,lon,press,o2)"
# wunsch stations 3-10
# p<1000
=======================
station, lat, lon
........................
3, 38.28, -73.53
=======================
press, o2
------------------------
5.000, 5.970
25.000, 6.990
49.000, 6.060
99.000, 5.340
=======================
station, lat, lon
........................
4, 38.19, -73.52
=======================
press, o2
------------------------
5.000, 5.840
25.000, 7.090
49.000, 6.020
99.000, 5.340
=======================
station, lat, lon
........................
5, 38.16, -73.26
=======================
press, o2
------------------------
5.000, 5.770
25.000, 6.660
49.000, 5.510
99.000, 5.100
=======================
The dictionary entry is assumed to be
test=def(/usr/users/jgofs/data/t0)
The communications look like:
list -> method (def)
argv = [/usr/users/jgofs/data/t0,station<=5&press<100,station,lat,lon,press,o2
/def -> list/
&c***********************
wunsch stations 3-10
p<1000
&v0======================
&v1======================
station lat lon
&v2======================
press o2
&r=======================
&c***********************
wunsch stations 3-5
p<1000
&d0----------------------
&d1----------------------
3 38.28 -73.53
&d2----------------------
5.000 5.970
25.000 6.990
49.000 6.060
99.000 5.340
&d1----------------------
4 38.19 -73.52
&d2----------------------
5.000 5.840
25.000 7.090
49.000 6.020
99.000 5.340
&d1----------------------
5 38.16 -73.26
&d2----------------------
5.000 5.770
25.000 6.660
49.000 5.510
99.000 5.100
&e**** End of object ****
Thus the application begins by sending the parameters to the method and
then reading the blocks of data. The blocks are indicated by commands
with an & in the first position. There are four types of protocol
blocks: comments, variable names, data, and end.
Protocol blocks
Comments
The &c introduces the plain text comments section. Comments consist of
lines of no more than 80 characters.
Variables
This section gives the names, dimensions, and attributes of variables at
each hierarchical level. The outermost level, 0, is defined first and
then we work our way inward. The signal is &vn with n=0...9 the level
indicator. Each variable definition has:
- The name (avoid embedded blanks --- use _ )
- Attribute list appended to the variable name surrounded by []. The
attribute list is a comma-separated set of strings, usually
(except for units) of the form attribute=value. The variables
section is closed by a record marker, &r.
Variable fields are tab-separated.
Data
The data is likewise presented in a hierarchical fashion. The &dn
intoduce the data from the n'th level. Note that the innermost level can
drop the &dn. The data from the outermost levelis sent, followed by the
next level, up to the innermost level. The innermost level repeats until
the next level up changes or the data ends. Data fields are tab-separated.
End
This indicates the end of the data object. The indicator is &e.
Errors
Errors are indicated by the method returning &x [descriptive string] and
exiting.
User Applications
User Applications
There are many different ways of using the JGOFS data system:
- Browsers: World-Wide Web browsers can be used to list and (for
those which support imaging) plot data.
- For local listing and plotting data, one can use a
user interface to construct calls to the
data system, acquire the
data, and do certain operations on it. The user interface is not
a core part of the system, but is built using the programming
interface; thus we can and do have several different interfaces
available:
- Command line
- Menu shell (X windows and VT100)
- brower shell
- Data can be imported into commercial packages, in some cases quite
directly. Thus we have a MATLAB function, loadjg, which can read
data directly into matrices from the data system (including all
the data manipulation operations). See
here.
- We provide a simple set of
subroutine calls by which C
and Fortran programs can read data.
User Interfaces
User Interfaces
Command Line
After setting one's path appropriately to include the JGOFS binaries,
one can type commands which directly invoke programs that communicate
with the data system. Such commands can also be incorporated in a shell
script (perhaps with arguments) for repetitive operations. The basic
commands are
Listing Data Objects
readdct
Presents the names of the data objects with a brief description.
Listing an Object
listvar "object"
Lists the variables in the object, with indenting indicating the
hierarchical structure.
list "object"
Lists the data
In addition, there are options for other kinds of
listings:
list [-n] [-s] [-t] [-f] [-b] [-c] [-z] object [outfile]
Plotting
p "object" [-r] [-l] xvar [-r] [-l] yvar [-sym ] [+sym ] [-siz ]
[-nobrk] [-brk breakvar]
X-Y plot (autoscaled)
view xv0 yv0 xv1 yv1
Sets position of lower left and upper right corners of plotting area
on the page (0 to 100)
window x0 y0 x1 y1
Sets user units for lower left and upper right corner
axis [-l] x/y tic label num format
Draws axis
pl "object" [-l] xvar [-l] yvar [-sym ] [+sym ] [-siz ]
[-nobrk] [-brk breakvar]
Plots lines, symbols, or connected symbols
dash "object" [-l] xvar [-l] yvar dashpattern
Plots with dashed lines
text "text" x/y/-x/-y size xloc yloc [-u]
Write text on graphics screen
ers
Erase page
ch filename var=value [var=value ...]
Change a file with lines giving values to variables; c.f. plot.var
Example
As an example, consider the following shell script twoplot:
!/bin/csh -f
ers
view 15 20 45 80
window 0 500 20 0
axis x 5 Temperature 10 dd.d
axis y 100 Pressure 500 ddd
pl "/test(station=$1)" temp press
view 65 20 95 80
window 33 500 36 0
axis x 0.5 Salinity 1 dd.d
axis y 100 Pressure 500 ddd
dash "/test(station=$1)" sal press ----....----....
The command
twoplot 6
produces
Menu Interfaces
The menu interfaces provide a means for organizing the commands, showing
information in several windows so that one can keep more than one result
on the screen, and keeping information such as the object name from
command to command. The X version provides a "point and click" kind of
access, but can, in fact also execute typed commands.
Browser Interface
The browser interface can be used from UNIX-X, MS-DOS-Windows, and Mac
machines. You can try it
here. In addition, It is possible to
install software on UNIX machines which allows you to use the local
server and obtain full functionality through browers.
Application Program Interface
Application Program Interface
From a C programs, you can call the folowing functions to obtain
information from the data system:
maxlev = jdbopen_(&unit,obj,names,&namesize,&num)
Opens a data object. The variable char obj[1024] is a long string
containing the object name, including parameters if desired for
selections, etc. The array of char names[][namesize] contains num
names for the variables to be returned (if num > 0) or space for
|num| names if num< 0. In the latter case, the number of
variables found is also returned innum. The result of the call is
the maximum heirachical level of the dataset. Negative values are
error returns.
lev = jdblevel_(&unit,&varnum)
Return the level in the hierarchy (0=outermost, 1=next,...) of the
variable number varnum.
lev = jdbread_(&unit,values)
Read the next realization of the data from the object. The
subroutine fills in num values in the array of floats.
lev = jdbreada_(&unit,values,&valuesize)
Same as above, but the values are read into strings.
ok = jdbcomments_(&unit,outcom)
Return the next comment in the string outcom. The returned value is
0 if there are no more comments.
ok = jdbattributes_(&unit,&id,outcom)
Return the next attribute of variable number id in the string
outcom. The returned value is 0 if there are no more attributes.
jdbclose_(&unit)
Close the object.
In Fortran, the calls are
maxlev = jdbopen(unit,obj,names,namesize,num)
lev = jdblevel(unit,varnum)
lev = jdbread(unit,values)
lev = jdbreada(unit,values,valuesize)
ok = jdbcomments(unit,outcom)
call jdbclose(unit)
Manipulating Data
Manipulating Data
The system has two built-in operations -- functions which each data
object is expected to carry out:
But, although these are the most common operations, they do not in and
of themselves satisfy all the requirements for a data system. One
significant advantage of an object-based data system is that new
operations can be added at any time. One simply builds a "method" which
takes as its input information that supplied by one (or more) objects
rather than data files, transforms the information in some way, and
passes it on to the user application.
We call the combination of the new method and the sub-objects a
"constructed object." You can also think of these as similar to UNIX
filters.
For example, we can add a column to the /test (hydrographic data) which
gives a linearized estimate of density:
rho=28.5-0.2 T +0.7 (S-35)
by using the ``math'' constructed object which takes as parameters an
input object name and formulae for changing/ adding columns. The new
object name is
math(/test,rho=28.5-0.2*temp+0.7*(sal-35))
and this can be used by the lister/plotter/... in exactly the same way
as any other object --- see figure.
As another example, there is a plot from two data objects joined
together by common station, cast, and pressure.
join(/tco2,/poc(station,cast,press,poc))
Becoming a Data Server
Becoming a Data Server
To serve data in the JGOFS software distributed database system
requires setting up a multi-processing computer system, connected to
the network, with the required software. You can set this
software up yourself using a procedure accessed via your Web
browser. For the most recent U.S. GLOBEC version, which is
compatible with the original version1, you can ftp
the tar file from
ftp://globec.whoi.edu/pub/software/JGOFS/. For information about
this latest US GLOBEC release of the JGOFS software click here.
Documentation about the
JGOFS system is available on-line and
is called the JGOFS Data
System Overview.
For more information and help installing the software contact the
Data Management Administrator.
For detailed information about how data are added to the US GLOBEC
data management system click
here.
1
For the MIT V1.5 version of the the JGOFS software,
open the URL
http://lake.mit.edu/~glenn/jg/install-1-5.html
and follow the directions in the form.
Last modified: August 29, 2005
JGOFS Reference Documents
JGOFS Reference Documents
These documents will be returned as PostScript files. If you wish to
print them, you should use the save-to-disk option (under the Options
menu) before clicking on them.
- Paper describing data system This
manuscript describes
the principles of the data system. (Note - it is a large file (1.4
MB) because of figures and, on some UNIX systems, requires using
lpr -s in order to print it.
- Default Formats Describes pre-existing
formats. If you
put your data in one of these, you will not need to write a "method."
- Method Writing Guide How to construct
a "method" or
translator for a particular data storage scheme in order to put
the data on-line.
- A beginner's guide to the X windows
JGOFS interface Describes how to use the data system from the X
windows menu interface. Also discusses how to subselect
information from data objects.
- Using the data system from programs
Describes the calls
you can use from C or Fortran programs to open a data object, read
the values, and close it. Other functions access comments,
variable attributes, etc.
- Summary from the data system workshop
This has all the
slides from the data system workshop May 24-25, 1995.