JGOFS Data System Overview

Glenn R. Flierl, MIT
James K.B. Bishop, LDEO
David M. Glover, WHOI
Satish Paranjpe, LDEO

Index

Obtaining the Software

Full Report Available On-line

You can obtain a complete web-based copy of the report by clicking here.

Further Documentation

We have various Postscript documents on details of the system. These are somewhat more technical. See here


Adapted from original web-based report.
Last modified: September 21, 2004
Introduction

Introduction

Large oceanographic programs such as JGOFS (The Joint Global Ocean Flux Study) require data management systems which enable the exchange and synthesis of extremely diverse and widely spread data sets. We have developed a distributed, object-based data management system for multidisciplinary, multi-institutional programs. It provides the capability for all JGOFS scientists to work with the data without regard for the storage format or for the actual location where the data resides. The approach used yields a powerful and extensible system (in the sense that data manipulation operations are not predefined) for managing and working with data from large scale, on-going field experiments.

In the ``object-based'' system, user programs obtain data by communicating with a program (the ``method'') which can interpret the particular data base. Since the communication protocol is standard and can be passed over a network, user programs can obtain data from any data object anywhere in the system. Data base operations and data transformations are handled by methods which read from one or more data objects, process that information, and write to the user program.

Purpose:

Basic Elements

Basic Elements:

Data Objects

Data Objects

Data Objects package together a program (the translator or method) and data. User programs never look at the data directly; rather, they communicate with the data object

<data

Data Objects communicate with a common protocol

Data Objects handle Projection (subsetting by variable name)
Selection (subsetting by variable values)
Data Objects

Data Objects

Data Objects package together a program (the translator or method) and data. User programs never look at the data directly; rather, they communicate with the data object

<data

Data Objects communicate with a common protocol

Data Objects handle Projection (subsetting by variable name)
Selection (subsetting by variable values)
Translators/ methods

Translators/ methods

The "translators" (or "methods" in object-based terminology) are programs which give other PI's a viewport into a data set. The program both makes the data set visible to the outside world and shields outside users from needing to know the details of where and how the data is stored.

JGOFS translator

These programs are responsible for

One translator may serve several different data sets -- the translators depend on the format chosen by the PI, but generally not on the information itself, though there can be exceptions. Data Model - appearance to applications

Data Model - appearance to applications

The JGOFS data model is the critical part of the communications protocol. It includes:

The hierarchical structuring is an important way of organizing many kinds of data. It groups the least rapidly changing variables (e.g., header data), then the next-most rapidly changing information, etc. For example, a hydrographic section might look like

leg
year          /[lowest (0) level]/
month
   station
   lat        /[level 1]/
   lon
   date
      press
      temp
      sal     /[level 2]/
      o2
      sigth

A current meter mooring might have

mooring_id
lat
lon                 /[level 0]/
nominal_depth
start_time
end_time
   time
   u                /[level 1]/
   v
   temp

Often one scans the lower level information first to pick out the desired station or mooring and then retrieves the information only for that subset of the data base. Putting Data on the system

Putting Data on the system

To add a new data object to the system, one needs a translator/method which can properly interpret the data. The options are:

Two existing methods, shipped with the system, are the default method, def, and the method for reading output from the list program, nm.

def

This is intended for data with each station (or mooring, etc.) in a single file, with header files linking them. Thus a hydrographic data set might look like the lines below (and the online version would look like this):

Header file
# Gulf Stream Cruise Stations 3-5
# p<1000
station lat lon >    [variable names for this file's data]
 press  temp sal o2 sigth [variable names for the next level files] 
  3     38.28 -73.53  s3
  4     38.19 -73.52  s4
  5     38.16 -73.26  s5
file s3
# Station 3
# lat=38.28, lon=-73.53
# This data prepared by someone
# Measurement at station 21 decibars contaminated
# 2/18/93
depth   temp    sal     oxy
 1.000  21.800  25.380   5.700
 3.300  nd      nd        nd
 5.000  21.800  25.580   5.600
10.000  21.400  25.670   5.400
13.000  21.000  25.850   5.000
15.000  20.500  26.020   5.000
21.000  19.900  26.400   5.000

The # sign indicates comments; the > in the header variable name list indicates that item points to a subfile containing more detailed information.

nm

This method is for a single file with multiple stations.

# Gulf Stream Cruise Stations 3-5
# p<1000

 station = 3    lat = 38.28,    lon = -73.53

  press,   temp,    sal,     o2,  sigth
  5.000, 18.334, 33.570,  5.970, 24.096
 25.000, 12.848, 34.159,  6.990, 25.773
 49.000, 11.070, 34.523,  6.060, 26.394
 99.000, 11.093, 35.090,  5.340, 26.831
149.000, 11.906, 35.487,  5.020, 26.990
199.000, 10.819, 35.435,  4.210, 27.152

station =       4,    lat =   38.19,    lon =  -73.52

  press,   temp,    sal,     o2,  sigth

  5.000, 17.516, 33.160,  5.840, 23.981
 25.000, 12.315, 33.958,  7.090, 25.721
 49.000,  9.612, 34.192,  6.020, 26.387
 99.000, 12.095, 35.402,  5.340, 26.887
149.000, 12.407, 35.625,  5.290, 27.000
199.000, 11.287, 35.487,  4.340, 27.108

station = 5,    lat=38.16,    lon=-73.26

  press,   temp,    sal,     o2,  sigth

  5.000, 18.382, 33.647,  5.770, 24.143
 25.000, 12.040, 34.196,  6.660, 25.959
 49.000, 11.951, 34.925,  5.510, 26.543
 99.000, 11.914, 35.390,  5.100, 26.912
149.000, 12.045, 35.547,  5.070, 27.010
149.000, 12.045, 35.547,  5.070, 27.010
199.000, 11.976, 35.589,  4.940, 27.057

Comment lines begin with #. The lines with an equals sign = contain assignments for variables at level 0 (comma or space separated). The assigments need only be done when the variable changes. The first line without an equals sign contains the names of the level 1 variables (comma or space separated). Communications

Communications

There are two parts to the problem of communicating information from the object on one machine to the application on another:

All exchanges between the user's application program (process 1) and the method/ translator (process 2 -- perhaps on another machine) are made via interprocess communications using ``pipes'' or ``sockets'' as defined in Berkeley UNIX. In the case of a locally defined object, a pipe is opened between the application and the method processes. For a remotely defined object, the application opens a socket to the HTTP daemon on the other machine and starts the server. The server effectively connects the standard output stream on the method to the socket in the application. The processes then begin exchanging information according to the JGOFS protocol.

Protocol Servers and Dictionaries

Servers and Dictionaries

Servers

Each JGOFS system which is providing data must have the HTTPD process running as a background task. When a request for data comes in, HTTPD starts our *server* process. This process consults the dictionary, starts up the method process and passes it the requested subselections and other parameters.

Servers

The method analyzes the request, gets the information from the data files or database, and writes out the results (in the JGOFS protocol). These pass through the communication pathway to the application program. In this sense, the method acts like an input subroutine which the main program calls to get data from files. However, the the data can be gathered from across the network.

Dictionaries: .objects files

The server works with two dictionaries, the user's (in the current working directory) and a tree of system dictionaries (set up when the software is built). These translate between a shorthand notation for the object and the detailed description either of where the object is [what machine it's on], or, if it's locally held, what method is used, and what default arguments are to be passed to the method. Thus the user can generally deal with brief names.

So users can specify objects in the following forms:

1. method(parameters)

In this case, the software will use the method named as the translator, passing it the parameters. Methods are stored in the methods subdirectory of the JGOFS software directory. The parameters are passed as command line arguments to the process. 2. datafilename or datafilename(parameters) In this case, the software assumes the default method, def, is being used. 3. nameindictionary or nameindictionary(parameters) The name is looked up in a file, .objects, in the present directory and replaced with the information found therein. The parameters are merged. For example, if the local .objects file contains stuff=nm(myfile)
farstuff=//globec.whoi.edu/test
Then a request for stuff(press<100) will translate to nm(myfile,press<100) and then be reinterpreted by the first rule. A request for farstuff(press<100) will be translated to //globec.whoi.edu/test(press<100) and reinterpreted by the fifth rule below.
4. /path/nameindictionary or /path/nameindictionary(parameters) The name is looked up in a file, .objects, in the JGOFS system directory, following the path given. The ``root'' of the objects tree is the subdirectory objects of the JGOFS software directory. Replacement occurs as above. 5. //machinename/path/nameindictionary or //machinename/path/nameindictionary(parameters) The path, name, and parameters are transferred to the remote machine which then follows the procedure outlined just above.

Dictionaries have two types of entries:

Local entries These map the name to a method on this machine and (usually) some required parameters: e.g.,
bot=jgbl2(/d5/glenn/bloom/bot)
Remote entries Usually, these just map a name on this machine to a name on the other machine. Thus if a data object on the remote machine is moved or replaced, only the dictionary on that machine needs to be updated. This also shields remote users from needing any details about the remote filesystem, methods, or data locations. An entry of this type looks like bot=//puddle.mit.edu/jgofs/bloom/bot

Dictionaries: .remoteobjects files

In addition, the system supports a set of dictionarys which tell the outside world what objects are available on this machine. In addition, other information about the object is provided, usually with loinks to an HTML page giving textual description of the information in the object, the variables, etc. Such a file looks like

tco2=//puddle.mit.edu/jgofs/bloom/tco2
- P.Brewer
- Total carbon dioxide
optics=//dataone.whoi.edu/jgofs/bloom/optics
- C.Davis
- Bio Optical Profiler Data
poc=//puddle.mit.edu/jgofs/bloom/poc
- H.Ducklow
- Particulate C, N
stuff=//puddle.mit.edu/test
- 
- http://puddle.mit.edu/notready.html This will contain good stuff
Protocol for Communication

Protocol for Communication

Methods provide three different kinds of data stream. You can view all of these from browsers:

The rest of this document concentrates on the last case.

Example

We illustrate the communication protocol with a simple example: for a data object which looks like

list "test(station<=5&press<100,station,lat,lon,press,o2)"

#  wunsch stations 3-10 
#  p<1000 
======================= 
station,    lat,    lon
........................ 
      3,  38.28, -73.53
======================= 
  press,     o2
------------------------ 
  5.000,  5.970
 25.000,  6.990
 49.000,  6.060
 99.000,  5.340
======================= 
station,    lat,    lon
........................ 
      4,  38.19, -73.52
======================= 
  press,     o2
------------------------ 
  5.000,  5.840
 25.000,  7.090
 49.000,  6.020
 99.000,  5.340
======================= 
station,    lat,    lon
........................ 
      5,  38.16, -73.26
======================= 
  press,     o2
------------------------ 
  5.000,  5.770
 25.000,  6.660
 49.000,  5.510
 99.000,  5.100
======================= 

The dictionary entry is assumed to be

test=def(/usr/users/jgofs/data/t0)

The communications look like:

list -> method (def)
argv = [/usr/users/jgofs/data/t0,station<=5&press<100,station,lat,lon,press,o2
/def -> list/
&c***********************
 wunsch stations 3-10
 p<1000
&v0======================

&v1======================
station lat     lon
&v2======================
press   o2
&r=======================
&c***********************
 wunsch stations 3-5
 p<1000
&d0----------------------

&d1----------------------
3       38.28   -73.53
&d2----------------------
5.000   5.970
25.000  6.990
49.000  6.060
99.000  5.340
&d1----------------------
4       38.19   -73.52
&d2----------------------
5.000   5.840
25.000  7.090
49.000  6.020
99.000  5.340
&d1----------------------
5       38.16   -73.26
&d2----------------------
5.000   5.770
25.000  6.660
49.000  5.510
99.000  5.100
&e**** End of object ****

Thus the application begins by sending the parameters to the method and then reading the blocks of data. The blocks are indicated by commands with an & in the first position. There are four types of protocol blocks: comments, variable names, data, and end.

Protocol blocks

Comments

The &c introduces the plain text comments section. Comments consist of lines of no more than 80 characters.

Variables

This section gives the names, dimensions, and attributes of variables at each hierarchical level. The outermost level, 0, is defined first and then we work our way inward. The signal is &vn with n=0...9 the level indicator. Each variable definition has:

  1. The name (avoid embedded blanks --- use _ )
  2. Attribute list appended to the variable name surrounded by []. The attribute list is a comma-separated set of strings, usually (except for units) of the form attribute=value. The variables section is closed by a record marker, &r.

Variable fields are tab-separated. Data

The data is likewise presented in a hierarchical fashion. The &dn intoduce the data from the n'th level. Note that the innermost level can drop the &dn. The data from the outermost levelis sent, followed by the next level, up to the innermost level. The innermost level repeats until the next level up changes or the data ends. Data fields are tab-separated.

End

This indicates the end of the data object. The indicator is &e.

Errors

Errors are indicated by the method returning &x [descriptive string] and exiting. User Applications

User Applications

There are many different ways of using the JGOFS data system:

  1. Browsers: World-Wide Web browsers can be used to list and (for those which support imaging) plot data.
  2. For local listing and plotting data, one can use a user interface to construct calls to the data system, acquire the data, and do certain operations on it. The user interface is not a core part of the system, but is built using the programming interface; thus we can and do have several different interfaces available:
  3. Data can be imported into commercial packages, in some cases quite directly. Thus we have a MATLAB function, loadjg, which can read data directly into matrices from the data system (including all the data manipulation operations). See here.
  4. We provide a simple set of subroutine calls by which C and Fortran programs can read data.
User Interfaces

User Interfaces

Command Line

After setting one's path appropriately to include the JGOFS binaries, one can type commands which directly invoke programs that communicate with the data system. Such commands can also be incorporated in a shell script (perhaps with arguments) for repetitive operations. The basic commands are

Listing Data Objects

readdct
Presents the names of the data objects with a brief description.

Listing an Object

listvar "object"
Lists the variables in the object, with indenting indicating the hierarchical structure. list "object" Lists the data
In addition, there are options for other kinds of listings:
list [-n] [-s] [-t] [-f] [-b] [-c] [-z] object [outfile]

Plotting

p "object" [-r] [-l] xvar [-r] [-l] yvar [-sym ] [+sym ] [-siz ] [-nobrk] [-brk breakvar]
X-Y plot (autoscaled) view xv0 yv0 xv1 yv1
Sets position of lower left and upper right corners of plotting area on the page (0 to 100) window x0 y0 x1 y1 Sets user units for lower left and upper right corner axis [-l] x/y tic label num format Draws axis pl "object" [-l] xvar [-l] yvar [-sym ] [+sym ] [-siz ] [-nobrk] [-brk breakvar] Plots lines, symbols, or connected symbols dash "object" [-l] xvar [-l] yvar dashpattern Plots with dashed lines text "text" x/y/-x/-y size xloc yloc [-u] Write text on graphics screen ers Erase page ch filename var=value [var=value ...] Change a file with lines giving values to variables; c.f. plot.var

Example

As an example, consider the following shell script twoplot:

!/bin/csh -f
ers
view 15 20 45 80
window 0 500 20 0
axis x 5 Temperature 10 dd.d
axis y 100 Pressure 500 ddd
pl "/test(station=$1)" temp press
view 65 20 95 80
window 33 500 36 0
axis x 0.5 Salinity 1 dd.d
axis y 100 Pressure 500 ddd
dash "/test(station=$1)" sal press ----....----....
The command

twoplot 6

produces

Plot pressure vs. temperature

Menu Interfaces

The menu interfaces provide a means for organizing the commands, showing information in several windows so that one can keep more than one result on the screen, and keeping information such as the object name from command to command. The X version provides a "point and click" kind of access, but can, in fact also execute typed commands.

Browser Interface

The browser interface can be used from UNIX-X, MS-DOS-Windows, and Mac machines. You can try it here. In addition, It is possible to install software on UNIX machines which allows you to use the local server and obtain full functionality through browers. Application Program Interface

Application Program Interface

From a C programs, you can call the folowing functions to obtain information from the data system:

maxlev = jdbopen_(&unit,obj,names,&namesize,&num)

Opens a data object. The variable char obj[1024] is a long string containing the object name, including parameters if desired for selections, etc. The array of char names[][namesize] contains num names for the variables to be returned (if num > 0) or space for |num| names if num< 0. In the latter case, the number of variables found is also returned innum. The result of the call is the maximum heirachical level of the dataset. Negative values are error returns. lev = jdblevel_(&unit,&varnum) Return the level in the hierarchy (0=outermost, 1=next,...) of the variable number varnum. lev = jdbread_(&unit,values) Read the next realization of the data from the object. The subroutine fills in num values in the array of floats. lev = jdbreada_(&unit,values,&valuesize) Same as above, but the values are read into strings. ok = jdbcomments_(&unit,outcom) Return the next comment in the string outcom. The returned value is 0 if there are no more comments. ok = jdbattributes_(&unit,&id,outcom) Return the next attribute of variable number id in the string outcom. The returned value is 0 if there are no more attributes. jdbclose_(&unit) Close the object.

In Fortran, the calls are

maxlev = jdbopen(unit,obj,names,namesize,num)
lev = jdblevel(unit,varnum)
lev = jdbread(unit,values)
lev = jdbreada(unit,values,valuesize)
ok = jdbcomments(unit,outcom)
call jdbclose(unit)
Manipulating Data

Manipulating Data

The system has two built-in operations -- functions which each data object is expected to carry out:

But, although these are the most common operations, they do not in and of themselves satisfy all the requirements for a data system. One significant advantage of an object-based data system is that new operations can be added at any time. One simply builds a "method" which takes as its input information that supplied by one (or more) objects rather than data files, transforms the information in some way, and passes it on to the user application.

Constructed objects

We call the combination of the new method and the sub-objects a "constructed object." You can also think of these as similar to UNIX filters. For example, we can add a column to the /test (hydrographic data) which gives a linearized estimate of density:

rho=28.5-0.2 T +0.7 (S-35)

by using the ``math'' constructed object which takes as parameters an input object name and formulae for changing/ adding columns. The new object name is

math(/test,rho=28.5-0.2*temp+0.7*(sal-35))

and this can be used by the lister/plotter/... in exactly the same way as any other object --- see figure.

Plot rho vs sigth

As another example, there is a plot from two data objects joined together by common station, cast, and pressure.

join(/tco2,/poc(station,cast,press,poc))

Plot join


Becoming a Data Server

Becoming a Data Server

To serve data in the JGOFS software distributed database system requires setting up a multi-processing computer system, connected to the network, with the required software.

You can set this software up yourself using a procedure accessed via your Web browser. For the most recent U.S. GLOBEC version, which is compatible with the original version1, you can ftp the tar file from ftp://globec.whoi.edu/pub/software/JGOFS/. For information about this latest US GLOBEC release of the JGOFS software click here.

Documentation about the JGOFS system is available on-line and is called the JGOFS Data System Overview.

For more information and help installing the software contact the Data Management Administrator.

For detailed information about how data are added to the US GLOBEC data management system click here.

1 For the MIT V1.5 version of the the JGOFS software, open the URL http://lake.mit.edu/~glenn/jg/install-1-5.html and follow the directions in the form.


Last modified: August 29, 2005 JGOFS Reference Documents

JGOFS Reference Documents

These documents will be returned as PostScript files. If you wish to print them, you should use the save-to-disk option (under the Options menu) before clicking on them.