doc.tex

\section{Purpose and limitations}
The purpose of this document is to serve as an unambigious single resource for reference by administrators of IS-ENES2 ESGF datanodes, to configure their datanodes and publish data in compliance with regulations discussed and adopted by all datanode managers. This document aggregates information from sources such as the Trieste meeting notes \cite{trieste}, Martin Juckes' `CORDEX: ESGF Search Facet Mappings' document \cite{cordexfacetsdoc} and other discussions which have led to collective consensus. This document only contains information from the perspective of publishing/maintaining data on the ESGF datanode and may not be refered to for any other purpose.

\section{Latest Version}
The latest version of this document will always be available at:\\
\url{https://github.com/snic-nsc/datanode-mgr-doc/raw/master/ro/Datanodemgr-doc.pdf} \\
The entire repository, which includes the \LaTeX{} source file can be cloned from:\\
\url{https://github.com/snic-nsc/datanode-mgr-doc.git}


\section{IS-ENES2 ESGF datanode Search Facet Configuration}
IS-ENES2 ESGF datanodes have some additional search facets pertaining to CORDEX. Here below are the entire list of facets used, on an IS-ENES2 ESGF datanode. This file is available in the `\textbf{configfiles}' directory in the \href{https://github.com/snic-nsc/datanode-mgr-doc.git}{repo}. \\
Filename: \texttt{facets.properties}\\
Standard location: \texttt{/esg/config/facets.properties}
\begin{footnotesize}
\begin{verbatimtabinput}[4]{configfiles/facets.properties}
\end{verbatimtabinput}
\end{footnotesize}
\newpage
\section{ESGF Attribute Services}
\label{attribservicesfile}
File name: \texttt{/esg/config/esgf\_ats\_static.xml}\\
For information about how to setup your datanode to correctly enforce restrictions on CORDEX data usage, refer to Section~\ref{enforcegrouprestrictions}. This file is available in the `\textbf{configfiles}' directory in the \href{https://github.com/snic-nsc/datanode-mgr-doc.git}{repo}.

\begin{tiny}
\begin{verbatimtabinput}[4]{configfiles/esgf_ats_static.xml}
\end{verbatimtabinput}
\end{tiny}
\section{ESGF IDP Whitelist settings}
File name: \texttt{/esg/config/esgf\_idp\_static.xml}. This file is available in the `\textbf{configfiles}' directory in the \href{https://github.com/snic-nsc/datanode-mgr-doc.git}{repo}.
\begin{tiny}
\begin{verbatimtabinput}[4]{configfiles/esgf_idp_static.xml}
\end{verbatimtabinput}
\end{tiny}
\section{ESGF Search Shard configuration settings}
File name: \texttt{/esg/config/esgf\_shards\_static.xml}. This file is available in the `\textbf{configfiles}' directory in the \href{https://github.com/snic-nsc/datanode-mgr-doc.git}{repo}.

\begin{tiny}
\begin{verbatimtabinput}[4]{configfiles/esgf_shards_static.xml}
\end{verbatimtabinput}
\end{tiny}

\section{Publication Version}
It was decided at the Trieste meet that all data published on IS-ENES2 datanodes will clearly specify the version number which is the date of the publication, expressed in the format \textbf{v}\textit{yyyymmdd}. This requires the creation of directory with that name, in the physical directory structure. This directory has to be created after the `Variable name' directory. Examples:\\
\begin{tiny}
\texttt{/datapool1/cordexgeneral/cordex/output/MNA-22/SMHI/ECMWF-ERAINT/evaluation/r0i0p0/SMHI-RCA4/v1/fx/orog/\yellowline{v20131101}}\\
\texttt{/datapool1/cordexgeneral/cordex/output/ARC-44/SMHI/NCC-NorESM1-M/historical/r0i0p0/SMHI-RCA4/v1/fx/sftlf/\yellowline{v20140123}}\\
\end{tiny}
\\To get this version number correctly, the procedure is to append a \texttt{\myopt new-version $<$versionnum$>$} to the \texttt{esgpublish} command.

\section{Directory Structure}
The path to the directory tree containing the data shall have \texttt{Project/Product} followed by the directory tree containing the data. \\
Given below are examples of valid and invalid directory structures.\\
\vspace{0mm}\\
\texttt{/cordex/output/...} \cmark\\
\texttt{/localfs/localpath/cordex/output/...} \cmark \footnote{Some sites use the lower-case `cordex' while some use `CORDEX'; While there is no rule, the lower-case `cordex' may be considered as the prefered option.}\footnote{`output' is the value of the `Product' facet option here. It may take other values that are applicable to the `Product' facet in the future.} \\
\texttt{/corddata/output/...} \xmark{ } //non-standard name corresponding to `Project'.  \\ 
\texttt{/cordex/AFR-44/...} \xmark{ } //there is no directory corresponding to `Product'. Here is a complete \texttt{directory\_format} line, for reference:
\begin{verbatim}
directory_format = %(root)s/cordex/%(product)s/%(domain)s/%(institute)s/\
%(driving_model)s/%(experiment)s/%(ensemble)s/%(rcm_models/%(rcm_version)s/\
%(time_frequency)s/%(variable)s/v%(version)s
\end{verbatim}

\section{Variables to be excluded during publish: CORDEX}
\label{skipvars}

The following declaration inside \texttt{/esg/esgcet/esg.ini} should be used to exclude certain variables from the THREDDS catalogues generated by \texttt{esgpublish}.  Note that this differs from the default value created by previous versions of \texttt{esgsetup}; \yellowline{in particular managers should ensure that the variable \texttt{basin} is NOT excluded.}

\begin{verbatimtab}[4]
thredds_exclude_variables = a,a_bnds,alev1,alevel,alevhalf,alt40,b,b_bnds,bnds,\
bounds_lat,bounds_lon,dbze,depth,depth0m,depth100m,depth_bnds,geo_region,height,\
height10m,height2m,heightv,Lambert_Conformal,lat,lat_bnds,lat_bounds,\
lat_vertices,latitude,latitude_bnds,layer,lev,lev_bnds,location,lon,lon_bnds,\
lon_bounds,lon_vertices,longitude,longitude_bnds,olayer100m,olevel,oline,p0,\
p220,p500,p560,p700,p840,plev,plev3,plev7,plev8,plev_bnds,plevs,pressure1,region,\
rho,rlat,rlat_bnds,rlon,rlon_bnds,rotated_pole,Rotated_Pole,scatratio,sdepth,\
sdepth1,sza5,tau,tau_bnds,time,time1,time2,time_bnds,vegtype,x,y
\end{verbatimtab}

\section{Checking for variables that need to be skipped}
We saw in Section~\ref{skipvars} the compiled list of variables that need to be present in the \\
\texttt{thredds\_exclude\_variables} list, prior to data publication. However, there might well be variables in your data which need to be similarly excluded, but are not part of this list yet. This has caused problems in the past.  Here's a simple script that can be used to inspect the data prior to publication.  It reports variables which \textbf{may} need to be added to the exclude list. It also logs potential problems. The script uses the \texttt{ncdump} utility which is shipped with \textbf{uvcdat/cdat} on ESGF datanodes. This script is available in the `\textbf{scripts}' directory in the \href{https://github.com/snic-nsc/datanode-mgr-doc.git}{repo}.
\begin{tiny} 
\begin{verbatimtabinput}[4]{scripts/exclvarcheck.sh}
\end{verbatimtabinput}
\end{tiny}
\textbf{How to use}
\begin{enumerate}
\item Ensure that you set the correct path to the variable `ncdumplocation'.
\item Choose a location for `scripthome'. Copy the \texttt{exclvarcheck.sh} and \texttt{excl\_cordex}\footnote{available in the `\textbf{scripts}' directory in the \href{https://github.com/snic-nsc/datanode-mgr-doc.git}{repo}} files to that location. 
\item If you have run the script before, remember to clear the files in \texttt{/tmp/ncchecks}
\item To run, simply \textbf{cd} to the directory where your datafiles reside and run:
\begin{verbatim}
find . -name '*.nc' -exec bash <script home>/exclvarcheck.sh {} \;
\end{verbatim}
\item When it completes, inspect the file \texttt{/tmp/ncchecks/exclvars} to see if you might need to add any variables.
\item The file \texttt{/tmp/ncchecks/suspects} lists the files which produced the additions. 
\item The file \texttt{/tmp/ncchecks/notfound} will list cases, if any, where a variable after which the file is named, isn't present in the file itself!
\item If you indeed find variables that need to be added to the list of excluded variables, please let me know 
\end{enumerate}
\section{Value for the RCMModelName facet}
It was decided that the value of the `rcm\_name' facet should NOT contain the institute information, as this information is already captured and presented by the `Institute' facet.  However, the directory corresponding to the `rcm\_model' contains the name of the institute too, along with the model name, as stipulated by the CORDEX archive specifications \footnote{``RCMModelName is an alphanumeric identifier chosen by the modeling group; it should consist of an institute acronym and a model acronym, connected by a dash, e.g., DMI-HIRHAM5 or SMHI-RCA3.''\cite{cordexarchivespecs}}. This results in the requirement for some special handling.
\mypar
The easiest way to handle this is by creating a substitution map for the variable.
\begin{enumerate}
\item Under the options for \texttt{[project:cordex]}, find the configuration line that says \texttt{maps}
\item Add the \texttt{rcm\_name} categorie:\\
\texttt{categories =\\ rcm\_name | string | false | true  | 7}
\item Edit the line to say the following:\\
\texttt{maps = rcm\_name\_map,institute\_map, las\_time\_delta\_map, domain\_map}
\item Create a new map `rcm\_name\_map' and populate it with entries that correspond to  the models that you handle, leaving out the institute part in the last field.
\item Look at the example below for reference:
\begin{verbatimtab}[4]
rcm_name_map = map(project,rcm_model : rcm_name)
    cordex |SMHI-RCA4| RCA4
    cordex |SMHI-RCA4-SN| RCA4-SN
\end{verbatimtab}
\item Use the regex `rcm\_name' in the place of the directory corresponding to the model directory, in the \texttt{dataset\_id} string.
\vspace{-6mm}\\
\begin{verbatimtab}[4]
dataset_id = cordex.%(product)s.%(domain)s.%(institute)s.%(driving_model)s.\
%(experiment)s.%(ensemble)s.%(rcm_name)s.%(rcm_version)s.%(time_frequency)s.\
%(variable)s
\end{verbatimtab}
\end{enumerate}
\subsection{Complete \texttt{rcm\_name\_map}}
The current and comprehensive list of CORDEX models may be obtained from:\\
\url{http://cordex.dmi.dk/joomla/images/CORDEX/RCMModelName.txt}. \\
Given below is a script that can generate a complete \texttt{rcm\_name\_map} table, that could then be pasted into the ini file. This script is available in the `\textbf{scripts}' directory in the \href{https://github.com/snic-nsc/datanode-mgr-doc.git}{repo}. Also produced here is the full output of the script. 
\begin{tiny}
\begin{verbatimtabinput}[4]{scripts/makemodelmap.sh}
\end{verbatimtabinput}
\vspace{-8mm}
\begin{verbatimtab}[4]
rcm_name_map = map(project,rcm_model : rcm_name)
        cordex| AUTH-LHTEE-WRF321B| WRF321B
        cordex| AUTH-Met-WRF331A| WRF331A
        cordex| AWI-HIRHAM5| HIRHAM5
        cordex| BCCR-WRF331C| WRF331C
        cordex| CCCma-CanRCM4| CanRCM4
        cordex| CHMI-ALADIN52| ALADIN52
        cordex| CLMcom-CCLM4-8-17| CCLM4-8-17
        cordex| CNRM-ALADIN52| ALADIN52
        cordex| CNRM-ARPEGE52| ARPEGE52
        cordex| CRP-GL-WRF331A| WRF331A
        cordex| CUNI-RegCM4-2| RegCM4-2
        cordex| DHMZ-RegCM4-2| RegCM4-2
        cordex| DMI-HIRHAM5| HIRHAM5
        cordex| ENEA-RegCM4-3| RegCM4-3
        cordex| HMS-ALADIN52| ALADIN52
        cordex| ICTP-RegCM4-3| RegCM4-3
        cordex| IDL-WRF331D| WRF331D
        cordex| IPSL-INERIS-WRF331F| WRF331F
        cordex| KNMI-RACMO21P| RACMO21P
        cordex| KNMI-RACMO22T| RACMO22T
        cordex| MIUB-WRF331A| WRF331A
        cordex| MOHC-HadGEM3-RA| HadGEM3-RA
        cordex| MOHC-HadRM3P| HadRM3P
        cordex| MPI-CSC-REMO2009| REMO2009
        cordex| NUIM-WRF331F| WRF331F
        cordex| SMHI-RCA4| RCA4
        cordex| SMHI-RCA4-SN| RCA4-SN
        cordex| SMHI-RCAO| RCAO
        cordex| SMHI-RCAO-SN| RCAO-SN
        cordex| UCAN-WRF331G| WRF331G
        cordex| UCAN-WRF350I| WRF350I
        cordex| UCLM-PROMES| PROMES
        cordex| UHOH-WRF331H| WRF331H
        cordex| UQAM-CRCM5| CRCM5
\end{verbatimtab}
\end{tiny}

\section{\texttt{esgcet\_models\_table.txt}}
Apart from the rcm\_name map, another map that lists models and their parent organizations is the \texttt{/esg/config/esgcet\_models\_table.txt}. After making any changes to it, one needs to execute \texttt{esginitialize -c}, to update it, and if that doesn't work, you may need to `downgrade' the database by excuting \texttt{esginitialize \myopt d0} and then executing \texttt{esginitialize -c}.
\begin{tiny}
\begin{verbatim}
  test | test | http://www-pcmdi.llnl.gov | Test
  test | ncar_ccsm3_0 | http://www.ccsm.ucar.edu| NCAR Community Climate System Model, CCSM 3.0
  cordex | RCA4 | SMHI | www.smhi.se
  cordex | RCA4-SN | SMHI | www.smhi.se
  cordex | RCAO | SMHI | www.smhi.se
  cordex | RCAO-SN | SMHI | www.smhi.se
\end{verbatim}
\end{tiny}

\section{Displaying the project name in upper case}
Though the project name is always expressed in lower case in catalogs and metadata, it is displayed in the upper-case in the web frontend. This requires setting a simple substitution string. Simply add the name of the project, first in lower case and then in upper case, separated by a colon. The file into which this string goes in is:
\vspace{-4mm}
\begin{small}
\begin{verbatim}
/usr/local/tomcat/webapps/esg-search/WEB-INF/classes/esg/search/config/projects.properties
cmip5:CMIP5
obs4mips:obs4MIPs
cssef:CSSEF
tamip:TAMIP
lucid:LUCID
test:TEST
pmip3:PMIP3
geomip:GeoMIP
euclipse:EUCLIPSE
cordex:CORDEX
\end{verbatim}
\end{small}

\section{Enforcing group restrictions on CORDEX data}
\label{enforcegrouprestrictions}
CORDEX data published on the ESGF datanodes in the federation are made available only to those who apply for membership to one of the two groups associated with CORDEX data. These groups, apart from restricting who can access these datasets can also serve as a mechanism to specify additional terms of data access.  The \texttt{CORDEX\_RESEARCH} group is for individuals who wish to download and use the data only for non-commercial purposes whereas \texttt{CORDEX\_COMMERCIAL} is for those individuals who may wish to use the data for commercial purposes. CORDEX data which is open for unrestricted use is made available to both groups whereas data which is meant to be only used for non-commercial use is only made accessible to members of the \texttt{CORDEX\_RESEARCH} group. \greenline{Unless otherwise specified by the data-provider, all CORDEX datasets should be accessible by members of both \texttt{CORDEX\_RESEARCH} and \texttt{CORDEX\_COMMERCIAL} groups.} Attribute management for these CORDEX groups is managed on the \texttt{esg-dn1.nsc.liu.se} datanode and for configuring your datanode to use this attribute service, refer to Section~\ref{attribservicesfile}.
\subsection{Ensure presence of license files}
If you are running the latest version of the middleware (1.6.x), you may skip to Section~\ref{segregatedata}. If you are running an older release, check whether the following files are present, on your datanode:
\begin{enumerate}
\item \$tomcatdir/webapps/esg-orp/licenses/CordexResearchLicense.xml
\item \$tomcatdir/webapps/esg-orp/licenses/CordexCommercialLicense.xml
\end{enumerate}
If the above listed files are NOT present:
\begin{enumerate}
\item git clone the repository containing this document, along with the license files from \url{https://github.com/snic-nsc/datanode-mgr-doc.git} 
\item Copy the license files present in the \texttt{cordexlicensefiles} directory over to their respective target locations on the datanode(specified in file `filelocations', also in the same directory). 
\item Ensure that you replace the default `registration-request.jsp' file with the one present in the \texttt{cordexlicensefiles} directory, as this file activates the usage of the CORDEX license files.
\item Restart \texttt{esg-node}
\end{enumerate}
\subsection{Segregating data}
\label{segregatedata}
The ESGF attribute service can be used to restrict access to data by creating different policies for different file paths. This means that data with different levels of access restrictions ought to be in distinct directory heirarchies. This needs some conscious planning by datanode managers, preferably prior to data publication, as it may be inconvenient to move data directories later. Planning is required to setup unambigious and intuitive directory trees which will then have different restriction policies applied on them. For the purpose of reducing publication time confusions and or possibility of errors, it is strongly recommended to set up entirely seperate directory trees, rather than having a mix of the two types under the same tree, that is, under distinct  \texttt{thredds\_dataset\_roots}.
\subsection{Caveat}
Unlike most commercial scenarios where a paying or `commercial' customer gets additional features/privileges, in the CORDEX sense, a commercial user is one who has fewer datasets he/she can possibly access; This is because datasets which are meant for non-commercial access would not be available for these users.  What this means is that \yellowline{naming a top-level directory/dataset\_root as \texttt{Commercial} or similar, would be counter-intuitive as it would be available for all users.}  It is however beneficial to create a directory/dataset\_root called \texttt{Non-Commercial}, as this would clearly indicate that it's only for non-commercial use, that is, it's only available for users belonging to the \texttt{CORDEX\_RESEARCH} group. 
\subsection{Paths and regexes}
The ESGF attribute service sees paths as presented to it by thredds. You can use that to design the regex that you need. \yellowline{Ensure that you don't design a regex which gets triggered by unintended elements in the path, including the hostname of the node itself!} While configuring the attribute service on the DMI datanode, the hostname of the node, \texttt{cordexesg.dmi.dk} was triggering the regex match for the expression \texttt{.*cordex.*} causing every url to match!!
\subsection{Setting up the \texttt{esgf\_policies\_local.xml}}
Let's consider the following configuration lines:
\begin{tiny}
\begin{verbatimtab}[4]
<policy resource=".*fileServer.*cordexnoncommercial.*" attribute_type="CORDEX_Research" attribute_value="user" action="Read"/>
<policy resource=".*fileServer.*cordexgeneral.*" attribute_type="CORDEX_Research" attribute_value="user" action="Read"/>
<policy resource=".*fileServer.*cordexgeneral.*" attribute_type="CORDEX_Commercial" attribute_value="user" action="Read"/>
<policy resource=".*fileServer.*cord.*" attribute_type="wheel" attribute_value="super" action="Write"/>
\end{verbatimtab}
\end{tiny}
These lines indicate that thredds urls containing the element \texttt{cordexnoncommercial} are only accessible to members of \texttt{CORDEX\_RESEARCH} group whereas urls containing \texttt{cordexgeneral} are accessible by all CORDEX data users. We can also see that \texttt{Write} or \texttt{Publish} access is only provided to users of group \texttt{wheel} with attribute \texttt{super}. This would allow the special user account \texttt{rootAdmin} to be used for all publication activities.
\subsection{Corresponding \texttt{thredds\_dataset\_roots} entries}
The \texttt{thredds\_dataset\_roots} entries can be set up in many ways.  Let's consider two cases.
\begin{enumerate}
\item Both non-commercial and general data being under the same dataset\_root:
\begin{verbatim}
thredds_dataset_roots =
	esg_dataroot1| /data
\end{verbatim}
Here, the non-commercial data would be placed under \texttt{/data/cordexnoncommercial} whereas the general data would be under \texttt{/data/cordexgeneral}.
\item Non-commercial and general data being under different dataset\_roots:
\begin{verbatim}
thredds_dataset_roots =
	esg_cordexnoncommercial| /dir1/cordex
	esg_cordexgeneral| /dir2/cordex
\end{verbatim}
\end{enumerate}
\redline{Caution!!} The part of the path specified as the \texttt{thredds\_dataset\_root} would be subsituted by the name associated with the dataset\_root in the thredds filename. This means that if your \texttt{thredds\_dataset\_root} value reads thus: \texttt{esg\_data| /partion1/noncommercial}, the `\texttt{partition1/noncommercial}' part of the path will be substituted by \texttt{esg\_data} in the thredds url and hence would not match the regex you'd planned to capture `noncommercial'.  It is therefore preferred to simply use the name of the \texttt{thredds\_dataset\_root} as the regex match.
\subsection{Data restricted to `Non-Commercial usage only', by site}
\begin{longtable}{|l|l|c|}
\hline
\multicolumn{1}{|c|}{\textbf{Sl}} & \multicolumn{1}{c|}{\textbf{Site}} & \multicolumn{1}{c|}{\textbf{Data}}\endhead
\hline
1. & BADC & None \\
\hline
2. & DKRZ & CLMcom data\\
\hline 
3. & DMI & HMS data\\
\hline
4. & IPSL & None\\
\hline
5. & LIU-NSC & None\\
\hline
6. & UIO & None\\
\hline
7. & UNICAN & All\\
\hline
\end{longtable}
\captionof{table}{Data restricted to `non-commercial usage only', by site}
\section{Enabling Gridftp and OPeNDAP access}
\label{gridftpaccess}
\label{opendapaccess}
In order to enable OPeNDAP access, simply ensure that the following lines are present in your \texttt{esg.ini} file:
\begin{verbatimtab}[4]
thredds_file_services =
    HTTPServer | /thredds/fileServer/ | HTTPServer | fileservice
    GridFTP | gsiftp://<nodename>:2811/ | GRIDFTP | fileservice
    OpenDAP | /thredds/dodsC/ | OpenDAP | fileservice
\end{verbatimtab}
To allow incoming OpenDAP requests, you should also ensure that access is provided to CORDEX group members. It's done by putting these lines in the \texttt{/esg/config/esgf\_policies\_local.xml} file:
\begin{tiny}
\begin{verbatimtab}[4]
<policy resource=".*dodsC.*cordex.*" attribute_type="CORDEX_Research" attribute_value="user" action="Read"/>
<policy resource=".*dodsC.*cordex.*" attribute_type="CORDEX_Commercial" attribute_value="user" action="Read"/>
<policy resource=".*cordex.*aggregation.*" attribute_type="CORDEX_Research" attribute_value="user" action="Read"/>
<policy resource=".*cordex.*aggregation.*" attribute_type="CORDEX_Commercial" attribute_value="user" action="Read"/>
\end{verbatimtab}
\end{tiny}
If you intend to offer gridftp, don't forget to allow inbound access to TCP port 2811
\section{Handy pre-publish tips}
\subsection{Adding checksums}
It's a good practice to publish data along with their checksums, so that silent corruption or download errors don't get away unnoticed. Generating checksums at publish time may drastically slow down a publication, so it's advisable to precompute them and then insert them into the mapfile. There are several ways in which one could generate checksums. I prefer to use \href{http://www.gnu.org/software/parallel}{gnu parallel} to speed things up.
\begin{verbatimtab}
time(find . -name '*.nc' -print| parallel sha256sum) >/unsorted-sha256sums \
2>checksumout;
cat unsorted-sha256sums|sort -k 2,2 >sha256sums;
\end{verbatimtab}
Generate the map file using the standard \texttt{esgscan\_directory} and then use the following python script to insert the checksums. This script is available in the `\textbf{scripts}' directory in the \href{https://github.com/snic-nsc/datanode-mgr-doc.git}{repo}.
\begin{tiny}
\begin{verbatimtabinput}[4]{scripts/addchecksum.py}
\end{verbatimtabinput}
\end{tiny}
Simply call the script like this:
\begin{verbatimtab}
python addchecksum.py <checksumfile> <mapfile> <outputfile>
\end{verbatimtab}
\section{Acknowledgments}
Many people have contributed to this document, pointing out errors and suggesting improvements. Thanks in particular to Hans Ramthun, Katharina Berger and Stephanie Legutke of the DKRZ, Stephen Pascoe of the BADC, and  Jose Carlos Blanco of UNICAN, for their suggestions and help. Together, we strive to make the task of datanode administration a bit less of a hopeless task!