| GStat 2.0 Production Instance | GStat 2.0 Release Note | GStat 2.0 Installation Guide | GStat 2.0 Overview | GStat Support List |
prod
pps
test
aegis
baltic
dorii
eela
euchina
euindia
eumed
e-nmr
gilda
grisu
ireland
pi2s2
sa-grid
seegrid
trigrid
About Gstat
---------------------------------------------------------------------
GStat is an application designed to monitor EGEE/LCG compatible
Information Systems. GStat's primary goal is to detect faults,
verify the validity and display useful data from the Information
System.
GStat tests the Information System approximately every 30 minutes.
The test does not rely on any submitted job, but rather on queries
to site GIISes/BDIIs. This is done to gather information and
perform so called sanity checks to point out any potential problems
with individual sites. The test covers the following areas:
* Site and service information: Provide information about the site,
services, software and VOs supported at that site.
* Usage information: Provides the statistics on job slots, jobs,
storage space.
* Information integrity: Checks if the Information system is
publishing data that is meets specific syntax and value rules.
GStat Internals
---------------------------------------------------------------------
GStat runs on a single server. From this server, GStat executes
queries, process the results and generates static HTML reports.
Execution of queries and result processing are accomplished by agents
and filters respectively. Currently there are two servers running
GStat running at ASGC and CNAF. Both servers can be accessed from
this alias: http://gstat.gridops.org/gstat/
GStat agents are responsible for making queries and collecting raw
data for further analysis by filter components. Filters execute test
logic and generate processed data which is in turn used to create a
web based test reports.
The configuration of GStat heavily depends on the data found in the
GOCDB. GStat queries the GOCDB for the site's GIIS contact string,
nodes information and other basic site information.
Gstat currently stores numeric data in RRD databases with data
reduction. Additional historical results can be found in daily snap
shot archives.
Using Gstat
---------------------------------------------------------------------
The GStat interface consists of these components:
* Top menu
* Summary table
* Total statistics
* Global tests
* Table view
* Site report
At the top of the main page, you will find the following links:
Home: Brings you to the main page of the current instance.
Alert: Shows only sites with alerts levels of warning and above.
Service: Displays services that are available to VOs.
Regional: Shows only sites from selected regions.
Service metric: Middleware version statistics
Links: Related links to GStat
?: Main help and documentation page
The remaining links to the right point to other GStat instances that
run on this server. A few special instances are:
prod: Production certified EGEE sites
pps: Preproduction EGEE sites
tests: Sites that are either non-production or uncertified
Summary table view shows site names and their most severe status of all
tests associated with. Clicking on the site name will display the
detailed GStat report for the site. Each site also has a small table
cell to the right. This cell indicates and links to the results of the
SAM Tests page. Multiple cells indicate that this site hosts multiple
CEs.
Below the summary table, you can find a total statistics table for
entire instance. The 'Total' link will display graphs associated with
these statistics.
The Global Grid Test displays the results for the service duplicate
test. This test primarily is designed to look for duplicate instances
of global LFC services for a single VO. There should only be one
global LFC for each VO.
At the bottom of the main page is the table view which shows
individual test results for each site. This table can be reorganized
into different perspectives by with the sort by links at the top of
the table.
Finally there is a detailed report generated for each site. At the
top of the report, you will find links to the site's homepage, SAM
results, GOCDB and graphs for all test result data associated with
the site. The body of the report will consist of individual sections
for each test performed and their detailed results. Each test
section will display the name of the test, the results status, link
to alert status history graph and help documentation for the test.
The bottom of the report shows test data results in both tables and
graphs. Long term graphs can be located by following the link for
each graph.
All of GStat tests respect scheduled downtime booked in GOCDB to alert
level of result status. We can discuss downtime topic in two aspects:
* If the whole of site is in downtime, the site alert level is changed
to maintenance status. In addition, the test alert level of section
'GOC DB Info' in site report will be marked as maintenance, but all
tests associated with the site still work normally to present the
real status and details of test result even though the site is in
downtime.
* If some of nodes in the site are in downtime, the test alert levels
of tests associated with maintained nodes will be marked as
maintenance, but site alert level won't be effected. Particularly,
the test section 'Service Check' in site report will ignore the
maintained nodes in the section and retrieve the most severe status
as test alert level. Please note that if site-bdii is in downtime,
the test alert level of tests associated with the bdii will be marked
as maintenance, but the details of check result still be shown.
Feedback and comments for GStat can be sent to roc-dev at
lists.grid.sinica.edu.tw and issues can be raised with GGUS tickets
by adding GStat to the ticket title.
The section below describes the filters available for Gstat.
BDIINode Performance Filter: Checks BDII node performance
Column Name: bnode
Conditions Alert level
-------------------------------------
No problems OK
Response time > 10 secs INFO
No entries found ERROR
-------------------------------------
This filter ldapsearch queries to top-level BDII nodes found
in the GOCDB. The number of entries found and the query
response time(ms) are recorded.
To query the bdii the following command and options are used:
ldapsearch -xL -s one -l 15 -h -p 2170
-b 'mds-vo-name=local,o=grid'
This query only searches one level below the basesearch
Provided. The number of entries represents the number of sites
Found in the bdii query.
The current results and graphs for each sites BDII's can be
found in the site's detailed reports. The following suffixes
are used after the BDII hostnames:
BE BDII Entries
BT BDII response Times
CERNSE Check Filter: Checks if BDII has CERN SE
Column Name: bnode
Conditions Alert level
-------------------------------------
No problems OK
Problems with SE object NOTE
-------------------------------------
This filter checks if the CERN's SE samdpm001.cern.ch used in SFT
can be found in each's site's BDII. If this SE is missing,
then SFT replication test may fail.
GIIS Performance Filter: Checks GIIS Query performance
Column Name: gperf
Conditions Alert level
-------------------------------------
No problems OK
Response time > 40 secs INFO
No entries found ERROR
Old entries found ERROR
-------------------------------------
The filter shares the same agent as the SanityCheck filter and
uses the same ldapsearch query results.
The number of entries found, old entries(not modified within 10
minutes) and the query response time(ms) are recorded. If any
old entry found, the oledest value of modifyTimestamp found in
information system and the timestamp that GStat starts to check
old entries are both listed, also the comparison between this
two timestamp is also shown.
The current results and graphs for each sites GIIS's can be
found in the site's detailed reports. The names are used to
identify the name of the data collected:
giisEntry GIIS Entries
giisOld GIIS Old entries - with modifyTimestamp
older than 10 minutes
giisTime GIIS response Times (ms)
GIISQuery SanityCheck Filter: Performs syntax and logic checks on GIIS
Column Name: sanity
-------------------------------------
Conditions Alert level
-------------------------------------
no problems OK
blank lines exists NOTE
blank values found WARN
invalid entries WARN
query failed ERROR
-------------------------------------
This filters does a few types of checks on the GIIS output
1 - Syntax Checks
a) Check for non zero length blank lines: with spaces.
This may cause probs.
b) Check for entries that have no values
c) Check for line without ":". these should not exists
d) Check missing new line character between two attributes.
This looks like two lines combined together.
e) Check for duplicate GlueCEStateWorstResponseTime
in each CE.
2 - Missing attributes
a) Check if GlueCEUnique & GlueSEUnique DN specified in
"dn: GlueCESEBindGroupCEUniqueID=" exists
b) Check if for srm_v1/edg-se SEs have consistent access
rules between the GlueSARoot and GlueServiceURI DN entries
c) Check if following critical DN and their attributes exists
IN: dn: GlueSiteUniqueID=
IN: dn: GlueServiceUniqueID=
GlueServiceType .+
GlueServiceEndpoint .+
Related wikis:
* http://goc.grid.sinica.edu.tw/gocwiki/Value_for_%22GlueSEUniqueID%22_not_published
* http://goc.grid.sinica.edu.tw/gocwiki/Value_for_%22GlueCEUniqueID%22_not_published
* http://goc.grid.sinica.edu.tw/gocwiki/Value_for_%22GlueSAStateAvailableSpace%22_is_not_published
* http://goc.grid.sinica.edu.tw/gocwiki/Value_for_%22GlueCESEBindCEAccesspoint%22_is_not_published
* http://goc.grid.sinica.edu.tw/gocwiki/GIIS_unreachable
* http://goc.grid.sinica.edu.tw/gocwiki/Invalid_Installation_Date
* http://goc.grid.sinica.edu.tw/gocwiki/Information_from_GRIS_not_published_to_GIIS
GIISQuery Service Filter: Checks for services in GIIS
Column Name: serv
----------------------------------------------------------------------------------
Service registered in GOCDB Missing Service in BDII Alert level
----------------------------------------------------------------------------------
none missing OK
RB missing ResourceBroker WARN
MyProxy missing MyProxy WARN
CE missing GlueCE ERROR
gLite-CE missing GlueCE ERROR
CREAM-CE missing GlueCE ERROR
ARC-CE missing GlueCE ERROR
Classic-SE missing GlueSE ERROR
Central-LFC missing lcg-file-catalog ERROR
Local-LFC missing lcg-local-file-catalog ERROR
WMS missing org.glite.wms.WMProxy ERROR
LB missing org.glite.lb.Server ERROR
Site-BDII missing bdii_site INFO
Top-BDII missing bdii_top INFO
----------------------------------------------------------------------------------
Service published in BDII Missing Service in GOCDB Alert level
----------------------------------------------------------------------------------
bdii_site missing Site-BDII INFO
bdii_top missing Top-BDII INFO
----------------------------------------------------------------------------------
This filter takes the list of service nodes in GOCDB and
checks if the services are published in the information system
as a "GlueServiceUniqueID" or a "GlueCEUniqueID" or a
"GlueSEUniqueID" object. This allows a site to notice
if an important service goes down and ceases to publish
it's presence into the information system. This filter also
checks if the GlueService DNs in information system are
registered in GOCDB as corresponding service types.
The node status of monitoring and downtime is shown in columns
"Monitored" and "Downtime". The column, "GOCDB NodeTypes",
is the list of corresponding service types for the nodes
registered in GOCDB, and "BDII ServiceTypes" column contains
the list of GlueServiceType values in GlueServiceUniqueID DNs
which have the same hostname and corresponding to the node
in GOCDB.
The history of the node status is also collected. If the node
or service is missing then the alert level shown above is raised.
If the node monitoring in the GOCDB is turned off, then the alert
levels is set to 0 or "NA".
Note:
This filter depends on the results from the GOCDB Agent plugin.
GIISQuery Service Verify Filter: Checks GlueServiceUniqueID
Column Name: serEntry
-------------------------------------
Conditions Alert level
-------------------------------------
no problems OK
srm check ERROR
-------------------------------------
This filter verifies syntax of GlueServiceUniqueID entities.
The following checks are currently performed.
1. Check if SRM has the following acceptable type and version
* SRM 1.1.0, 2.2.0
* srm 1.1.0, 2.2.0
* srm_v1 1.1.0
** other type starting with "srm" are not acceptable
Related wikis:
* http://goc.grid.sinica.edu.tw/gocwiki/How_to_publish_the_version_of_my_SRM
GIISQuery SiteInfo Filter:
Column Name: version
-------------------------------------
Conditions Alert level
-------------------------------------
No problems OK
old dataGridVersion NOTE
sitename mismatch GOCDB NOTE
-------------------------------------
Values: dataGridVersion found
Detailed site report includes the following information:
siteName: Name of site
dataGridVersion: Middleware version installed
UserSupportContact: User support email contact
SysAdminContact: Administrator email contact
GlueSiteLatitude: -90 to 90 degrees
GlueSiteLongitude: -180 to 180 degrees
GlueCEUniqueID: List of CE found
GlueSEUniqueID: List of SE found
GlueServiceURI: List of services and their URI
GlueHostApplicationSoftwareRunTimeEnvironment:
List of softwares/packages installed on this subcluster
The OS Name and Release are checked if they one of the
accepted values registered in this wiki:
http://goc.grid.sinica.edu.tw/gocwiki/How_to_publish_the_OS_name
-------------------------------------------------------------
GlueHostOperatingSystemName GlueHostOperatingSystemName:
AIX 5.2
CentOS 3.5
CentOS 3.6
CentOS 3.7
CentOS 3.8
CentOS 4.2
CentOS 4.5
CentOS 4.6
CentOS 4.7
CentOS 4.8
CentOS 5.0
CentOS 5.1
CentOS 5.2
CentOS 5.3
CentOS 5.4
CentOS 5.5
Debian 3.1
Debian 4.0
FedoraCore 4
Gentoo 2006.0
RedHatEnterpriseAS 3
RedHatEnterpriseAS 4
linux-rocks-3.1 Rocks Linux
linux-rocks-4.1 Rocks Linux
Scientific Linux 3.0.3
Scientific Linux 3.0.4
Scientific Linux 3.0.5
Scientific Linux 3.0.6
Scientific Linux 3.0.7
Scientific Linux 3.0.8
Scientific Linux 3.0.9
ScientificSL 4.2
ScientificSL 4.3
ScientificSL 4.4
ScientificSL 4.5
ScientificSL 4.6
ScientificSL 4.7
ScientificSL 4.8
ScientificSL 5.0
ScientificSL 5.1
ScientificSL 5.2
ScientificSL 5.3
ScientificSL 5.4
ScientificSL 5.5
Scientific Linux CERN 3.0.4
Scientific Linux CERN 3.0.5
Scientific Linux CERN 3.0.6
Scientific Linux CERN 3.0.8
ScientificCERNSLC 4.3
ScientificCERNSLC 4.4
ScientificCERNSLC 4.5
ScientificCERNSLC 4.6
ScientificCERNSLC 4.7
ScientificCERNSLC 4.8
ScientificCERNSLC 5.2
ScientificCERNSLC 5.3
ScientificCERNSLC 5.4
ScientificCERNSLC 5.5
SUSE LINUX 9
SUSE LINUX 10
SUSE LINUX 10.2
Ubuntu 5.10
Ubuntu 6.06
Ubuntu 8.04
Ubuntu 8.10
GlueCEPolicyMaxTotalJobs: should set to accurate number
Related wikis:
* http://goc.grid.sinica.edu.tw/gocwiki/Sitename_inconsistency
* http://goc.grid.sinica.edu.tw/gocwiki/Contact_e-mail_address_inconsistency
* http://goc.grid.sinica.edu.tw/gocwiki/How_to_publish_the_OS_name
GIISQuery Usage Filter: Analyzes GIIS for CPU, Job & Storage Usage
Column Name: totalcpu, cpuUsed %, runjob, freejob
seAvail, seUsed %
Values: Number - success
"" - not available
-------------------------------------
Conditions Alert level
-------------------------------------
No problems OK
se percent usage > 80% INFO (1)
se percent usage > 90% WARN (1)
seAvail < 1GB WARN
waitJob > 50*totalCPU WARN
waitJob > 150*totalCPU ERROR
no cpu info found ERROR
no job info found ERROR
-------------------------------------
(1) Alert supressed if more than 5 TB storage available.
totalCPU: Total number of cpu for the site
freeCPU: Number of free cpus
runJob: Total running jobs on each CE queue
waitJob: Total waiting jobs on each CE queue
seAvail: SE storage space available
seUsed: SE torage space used
Notes:
-------
CPU
-------
* Physical CPU defined
If the subcluster PhysicalCPU is configured to a non zero number for
any of the CE of a given site, this number is used as the totalCPU.
If other CE's PhysicalCPU are set to zero, then this CE's stats are
excluded. This is useful if sites have two or more CE that point to
the same batch system. Then the site should set only subcluster's
PhysicalCPU for CEs with unique batch systems.
* Physical CPU not defined
If GlueSubClusterPhysicalCPU to a CE is not defined, the numbers of
TotalCPUs in queues on the same CE are used. To avoid recounting
CPUs from queues on the same CE that refer to the same cluster, only
queues with maximum "GlueCEInfoTotalCPUs" are added to the totalCPU
and freeCPU values. Queues with different "GlueCEInfoTotalCPUs"
values but all referring to the same physical cluster. The best
estimate of site cpus is the using values from the largest queue.
-------
Storage
-------
For site storage statistics, GStat prefers to use the summary information
in GlueSE to calculate the total available space and storage usage. If
these summary data are unset or zero values, GStat will further adopt the
information in GlueSA instead of GlueSE.
If the information in GlueSE is used, GStat takes the version of GlueSchema
to determine how to calculate the spaces in place:
Glue 1.2:
Storage Available = GlueSESizeFree
Storage Used = GlueSESizeTotal - GlueSESizeFree
Glue 1.3:
Storage Available = (GlueSETotalNearlineSize + GlueSETotalOnlineSize) -
(GlueSEUsedNearlineSize + GlueSEUsedOnlineSize)
Storage Used = GlueSEUsedNearlineSize + GlueSEUsedOnlineSize
The storage area (GlueSA) is a logical portion of storage extent assigned
to a VO. Storage areas can overlap the same physical space, thus having
contention over the free space among different VO's. If the information
in GlueSA is adopted in GStat, the checks are in place so that VOs sharing
the same physical partition on will not be counted twice.
To determine if VOs are on the same partition, we assume that VOs with
identical GlueSAStateAvailableSpace values are sharing partitions. This
can cause problems only if 2 partitions have the same exact disk available
space, which should have a low probability.
Glue 1.2:
Storage Available = add up the distinct values of GlueSAStateAvailableSpace
in the GlueSA.
Storage Used = add up the values of GlueSAStateUsedSpace if
GlueSAStateAvailableSpace values are distinct in the GlueSA.
Glue 1.3:
Storage Available = the same manner as Glue 1.2
Storage Used = add up the values of GlueSAStateUsedSpace in all GlueSA.
`empty entries` means that the information could not be obtained
from GIIS.
Related wikis:
* http://goc.grid.sinica.edu.tw/gocwiki/Unreliable_gathering_of_CE_Information
RRDFetch Hist Filter: Displays average of RRD data
Column Name: maxcpu, avgcpu
Values: Number - success
"" - not available
maxcpu: max of daily max CPUs number found in GIIS for
last 30 days
avgcpu: average of daily avg CPUs number found in GIIS
for last 30 days
These values indicated the relative size of the site
and provides a reference of how many CPUs normally is
available.
DeployQuery Deployment Info Filter:
Please check: this page for more details
https://lcg-sam.cern.ch:8443/sam/sam.py?funct=StatusTable&sensors=CE&vo=ops
This filters parses the LCG grid deployment Site functional Test Results
and integrates the information in to this site.
GridICE Info Filter:
Column Name: gice
-------------------------------------
Conditions Alert level
-------------------------------------
No problems OK
no GridICE service INFO
GridICE not accessible INFO
no host monitored INFO
no batch system INFO
-------------------------------------
Detailed site report includes the following information:
GlueHostUniqueID: Represents the hosts that are monitored
by GridICE. This should be > 0.
GlueBatchSystemType: The type of batch system monitored by
GridICE. If no entries are found then
batch system monitoring in not enabled.
This filter requires that gstat queries a bdii to collect available
GridICE 'GlueServiceAccessPointURL'. These available GridICE agents
are then matched to a given site the agent domain matches that of
the GIIS server. For some site this can be a problem. A better
approach may have to be taken.
If multiple GridICE agents are found, then results for all matching
agents are combined and provided to this filter for analysis.
GocDB Display Filter: Display GOC DB maintenace information
Column Name: none
Values: Shows the maintenace periods for this site
Copyright © ASGC/CERN
All Rights Reserved
Comments to author: roc-dev at lists.grid.sinica.edu.tw
Generated: Sat Jul 31, 2010