{The SAM Toucan}SAM FAQs
If you are a SAM shifter, with a shifter's oracle username and password, you may view an editable version of this page.
{The SAM Toucan}

The Questions:

  1. Where can I find documentation on all available sam and sam_admin commands?
  2. How is a file marked as being bad (corrupt files etc)?
  3. How to find out who are the station administrators for a SAM station ?
  4. What happened when I ran sam dump station --all on d0mino?
  5. How can I define a dataset once, then periodically run a project to catch-up as new files appear, i.e.get only the files which are in this definition but I have not yet analysed?
  6. How can I stop/start/restart only some of the sam servers on a particular node?
  7. How do I upgrade my station?
  8. How do you use the Sam FAQ page?
  9. How do I register a new station?
  10. How do I update the sam_user and sam_admin documentation?
  11. Why, -and what to do when -, the Clued0 station is unable to get files from d0mino ? (Now obsolete)
  12. How to fix files that were submitted with incorrect metadata
  13. How do I add a new mass storage system location to SAM?
  14. How to resize a station cache disk (Updated on 05.16.07)
  15. One of the nodes on the clued0 station has an INACTIVE disk. What can I do?
  16. Why is my python not installed correctly and how do I fix this?
  17. How do I deal with a "special" sam user, like a group or new name for user with existing sam entry?
  18. What network ports are used by sam?
  19. How can I disallow SAM from using a disk?
  20. Why does the command sam configure abort when I try to modify or add a station group ?
  21. How to fix "Delivery is not possible to node d0csNNN.fnal.gov" error
  22. How can I log in as user 'sam'?
  23. Verifying the Enstore information as told by "Output from "cron" command"
  24. How can I check if the CAB batch system (PBS) is in synch with SAM?
  25. How to re-enable a CAB stager
  26. How to fix "Delivery is not possible to node xxxx-clued0.fnal.gov" error ?
  27. How is the intermediate cache cleaned up, for example MC files coming from the remote processing centers that are forwarded through D0mino?
  28. Project on clued0 is getting all file delivery errors (see SAM disaster #3 at http://www-clued0.fnal.gov/~sam/samTV/disasters).
  29. Why can't I "samadmin cache file" and why did SAM tell me that I had insufficient privilege ?
  30. How do I set up a local naming service? The compute nodes at my site do not have any network connectivity to the outside world (they are on a restricted private network)
  31. D0ora2 (alias d0db-prd.fnal.gov) unaccessible from offsite?
  32. Which log file should be looked at for problems related to "sam store" ?
  33. How to fix a "Failed: No files for dataset" error?
  34. sam store request rejected - dest file already exists?
  35. SAM store problem - duplicate file store request
  36. How to remove files from the SAM cache?
  37. Where are the FNAL SAM station logs?
  38. How do I determine how many files have been stored in each pnfs directory since a specific date?
  39. I've just been sent an email saying "Started the D0 Calib User Servers on failover Node: d0dbsrv6.fnal.gov". What does it mean?
  40. Do I have to have a sam station running on every machine where I want to execute a sam_submit command?
  41. Is it possible to get files delivered in a special order?
  42. The declared and activated disks do not show up as active. What is the problem?
  43. What to do if I see CORBA exception errors?
  44. If the file is not stored properly,hanged indefinitely or are timed out in D0Framework?
  45. If the encp commands fails without contacting enstore?
  46. If FNORB and CORBA error messages gives unexpected/unrecognized exception?
  47. If some SAM jobs fail due to timeout out waiting for the project master callback?
  48. if user reports an autoregistration failure with a valid uid?
  49. If the SAM auto mailing list gets mail with subject station_prd fcp died at d0mino: status 1?
  50. What to do if losing or replacing a station disk?
  51. If any problem occurs about missing path in db?
  52. What are the duties of a shifter on a day with scheduled downtime (eg. first Tuesday of the month)?
  53. File delivery is very slow. Check if it is because enstore is very busy
  54. How do I add new "keywords" to SAM for storing data?
  55. Why does "ls -l /pnfs/sam/dzero/db2/datalogger/run175K_176K/datalogger/all/all/all_0000175645_079.raw" tell you that the file has size of 1 byte ?
  56. How do I find out which nodes one may submit "sam store" jobs ?
  57. Hung project?
  58. I need some information in order to regenerate missing metadata. How can I find out when did I consume a given dataset (e.g. dataset get-jpsi-july2004b-84, username stark)?
  59. What to do if files are declared with wrong file size or wrong CRC?
  60. What are the cron jobs in use for maintaining sam, and how can I find them?
  61. How to register a new sam_gridftp installation?
    Or How to add a new gridftp service certificate?
  62. How to add a new individual user certificate?
  63. How to add a storage area in the new SAM router d0rsam01, which replaced d0mino?
  64. How do you know whether a clued0 node is SAM-enabled batch node ?
  65. How does one add/set up a clued0 node/disk to be a SAM-enabled batch node?
  66. What does this error message : "^GWarning, discrepancies with file system on disk ..." mean ?
  67. What does the error message "You have violated the constraint CF_FI_FK" mean? It happened when trying to remove a file from the SAM
  68. Several segments of my job failed. How do I resubmit them?
  69. How do I add a durable location to the database?
  70. How to spot problems with machines
  71. How do I restart samTV?
  72. How to add new FOME keywords/parameters?
  73. How to see if someone is a registered sam user and how to register a new sam user?
  74. How to add a new entry to the facility name table.
  75. New D0 DB Servers - procedures for shifters (e-mail from Adam Lyon, Nov. 2006)
  76. How to use an existing dataset and make new datasets with the files from the same tape?
  77. For a new SAM station installation or upgrade, what is a set of SAM package versions which are compatible and recommended for installation?
  78. How do I fix a corrupted xml database at GridKa / What are the commands to reinitialize the GridKa xml database?
  79. What do I do if samgrid.fnal.gov is not responding to web queries or MC job submission?
  80. What to do if the Plone Issue Tracker is not working ?
  81. Should we add a certificate to gridmapfile or gridmapfile-jimsam ?
  82. How to get CRC or other enstore information about a stored file ?
  83. Adding a new SRM location
  84. Additional method in fixing Dataset Definition Editor
  85. Is there a script to mark files with "content status" bad for a request-id ?
  86. How to remove a datalogger-d0ol? station from monitoring on the SAAG web page ?
  87. Uncache locations in Sam Cache (without restarting the SAM station)


    1) Where can I find documentation on all available sam and sam_admin commands?
All commands are documented, for both
sam commands and samadmin commands. (Note that for the latter, both sam_admin XXX and samadmin XXX work.)

Also, note that most of the samadmin commands below can be issued as user sam (with your root principal) on d0ora3, or any sam enabled clued0 machine.

Contributed by:
system

Updated by:
grenier

    2) How is a file marked as being bad (corrupt files etc)?
You need to use the sam_admin command:
sam_admin update file content status to mark the file as being bad.
Such files will no longer be delivered to user projects.

If the request does not include a reason for marking the file bad, ask the user for the reason, so that it gets documented in the issue tracker system. If the reason is a corrupted file, check that it has been downloaded correctly and that it isn't corrupt on tape. You can do this by comparing the CRC checksum from the metadata to the one on disk:

setup encp -q d0en
ecrc FILE
sam get metadata --file=FILE
If they are different you should remove the replica on disk (but the tape is likely good).

[To check if the tape is good, you may also want to check the CRC in enstore and compare it to the metadata:
setup encp -q d0en
enstore file --bfid `enstore pnfs --bfid /pnfs/path/to/file`

This command needs to be executed in a node where one can see the /pnfs path. If enstore and metadata differ, you should ask the creator of the file if it's ok or not.]

If the checksums are the same but the file cannot be read (eg. dsdump --notree FILE ) then the file is bad on tape and should be marked as BAD.

The command to use is:

setup sam   (setup sam_admin is deprecated!)
samadmin modify file content status --fileName={file name} --fileContentStatus=bad 
--comment="As requested by Full Name, see issue #XXX" --connect={username/password@database}

You may omit "--connect=" if the $SAM_ORACLE_CONNECT environmental variable is set.

An example is:
 setup sam
 samadmin modify file content status \
--fileName=d0reco_p14.05.02_NumEv-250_dzero_mcp14_ouhep0.nhn.ou.edu_12899_04125210024 \
--fileContentStatus=bad --connect=oneil/PASSWD@d0ofprd1 \
--comment="As requested by R. Hauser - see issue #731"
You can check that it worked using the command "sam dump file --file=fileName", or by making sure it does not still appear in a dimensions query which requires content_status=good. For instance:
sam translate constraints --dim="global.requestid 12899 and data_tier 
reconstructed and availability_status available and content_status good"

[Note that file content status is not the same thing as availability. Files which are corrupt should have their content status set to bad.]

Contributed by:
system

Updated by:
kinyip

    3) How to find out who are the station administrators for a SAM station ?
You could actually go to the
SAM Data Browsing page, there is an item called "Station Queries" in the middle. Click that option and choose to do "Station Administrators Query". Be sure to use the right version (prd or dev) of the SAM Data Browsing page.
Contributed by:
system

Updated by:
yann

    4) What happened when I ran sam dump station --all on d0mino?
You should not run and discourage running of sam dump station --all as it carries out a dump of all the files in the sam cache on d0mino (which is many 1000 of files). This can cause problems with sam for users by occupying the database.
Contributed by:
system

    5) How can I define a dataset once, then periodically run a project to catch-up as new files appear, i.e.get only the files which are in this definition but I have not yet analysed?
1) Make a SAM dataset definition using your preferred tools (
web page, command line interface). eg. dataset definition name: thesisSample
2) Choose a project "base" name. An incrementing number will be appended to this base to ensure unique project names for each catch-up run, e.g. makeThesis_
3) Make a second SAM dataset definition using the following constraints
   __set__ thesisSample minus (project_name makeThesis_% and consumed_status 
consumed and consumer YOUR_USER_NAME)

Call it what you want, eg. thesisSample_catchup. The wildcard % is important.
4) Run your 1st project
sam submit --defname=thesisSample --project=makeThesis_0 .......etc
5) To run a catch up project when more files are available.
sam submit --defname=thesisSample_catchup --project=makeThesis_1 ......etc

  and again

sam submit --defname=thesisSample_catchup --project=makeThesis_2 ......etc

6) Repeat 5 whenever you fancy, remembering to use the --project flag and increment the number.
7) Every time you find a bug in your code and want to re-run everything, do step 4 with a different basename, eg.makeThesis_v1_
Contributed by:
system

Updated by:
yann

    6) How can I stop/start/restart only some of the sam servers on a particular node?
a) Make sure your SAM_BOOTSTRAP_ENV is set to point to the "correct"
     server_list.txt file
b) Edit the server_list.txt file, comment OUT (with a "#") the servers
     you want to STOP

c) ups update sam_bootstrap

d) If you want to restart the servers, edit the file again, and
    UNCOMMENT the previously commented-out servers
e) ups update sam_bootstrap


To restart the clued0 SAM station (email from Robert Illingworth)
Login as user "sam" to lagan-clued0.fnal.gov, then do the following:
> setup sam -q station_prd
> ups update sam_bootstrap

Contributed by:
system

Updated by:
filthaut

    7) How do I upgrade my station?
You may have been told to upgrade to a specific version, or choose the current version from upd. First "upd install sam_station vx.x.x". Then edit the file ~sam/private/{Node}_server_list.txt to and replace the old version name by the new, for station,fss and stager entries. If there is a remote stager on d0mino, ask sam admin to upgrade it, then plan for a few minutes downtime and "ups update sam_bootstrap".
Contributed by:
system

    8) How do you use the Sam FAQ page?
For the most part, we hope that it is fairly self-evident. But here are a few things to keep in mind when adding/updating FAQs:
  • To edit and add something to the FAQ page, go to http://d0db-prd.fnal.gov/sam_faq/cgi/faq.py and click on the "editable" link on the top of the page. You need d0ofprd1 username/password.
  • This is, ultimately, displayed on the web, so be careful with your "<" and ">" characters.
  • You don't have to use HTML markup language, but in the case of long answers it would make them easier to read. One of the easiest markups to use, especially when giving code examples, is
    <pre>
       pre-formatted text goes here
    </pre>
    
  • You can always point to other SAM Documentation using hyperlinks.
Good luck, and please contribute lots of useful FAQs!
Contributed by:
system

Updated by:
yann

    9) How do I register a new station?
You need to know the station name, universe(dev/prd), username(s) of station admin. Then
add station giving a meaningful description, like "CSE department cluster". If necessary register the node(s) on which the station will run, add node . Finally, you may change the monitor level by using samadmin modify station --monitor-level ..., which is usually "normal" for off-site stations.
Example: 
> samadmin add station --admin=sosebee,tomw --desc="CSE dept cluster" \
--name=uta-cse --connect=gris/*****@d0ofprd1 
Station uta-cse has been registered: id = 281
Monitor Level: normal
Life Cycle: active
Admins: tomw, sosebee
Desc: CSE department cluster
> samadmin add node --hw=pc --name=cse000.uta.edu --os=linux \
--connect=gris/*****@d0ofprd1
Node cse000.uta.edu has been registered:
id = 1191, os_type = linux, hardware_name = pc
Contributed by:
system

Updated by:
vesna

    10) How do I update the sam_user and sam_admin documentation?
You need write access to olscvs (send mail to
helpdesk@fnal.gov). Then you 'setup olscvs' followed by 'cvs co sam_user' (or 'cvs co sam_admin'). Then look into the 'sam_user/src/python' directory for the python file which has the command you wish to modify (for example you can do: 'grep "verb.*locate" *.py' if you want to modify the sam locate documentation, verb there makes sure you find the file with the locate documentation). Edit the file and add or modify the text in "'helpText' : 'help text goes here'," Notice the last ",". Commit your changes with: 'cvs ci -m "explanatory comment here" whatever.py'. See also http://d0db-prd.fnal.gov/sam/sam-shift-guide.html#documentation.
Contributed by:
system

Updated by:
yann

    11) Why, -and what to do when -, the Clued0 station is unable to get files from d0mino ? (Now obsolete)
As of 30 August 2004 clued0 no longer uses fcp for transfers outside the station. The fcp daemon on d0mino is no longer useful.

For some reason, the fcp daemon on d0mino loses its configuration file, which means it can't start and transfer files. When this occurs, the Clued0 SAM station will stop receiving files from d0mino. To fix the configuration file and restart the daemon, do the following:

1) Log into d0mino as sam
2) If /home/sam/private/conf/fcp_root/cfg/fcp.cfg exists, then fcp is probably 
ok and no further action is needed.
3) If /home/sam/private/conf/fcp_root/cfg/fcp.cfg does not exist, restore it 
from a backup by copying /home/sam/backup/fcp_root/cfg/fcp.cfg 
4) Now you must restart the FCP daemon. Go to /home/sam/private
5) Edit the REMOTE-STAGER_server_list.txt.
6) Comment out the line reading "fcp station_prd v1_5b" (the version may
vary) by putting a "#" at the front of the line. Save the file.
7) Do "setup sam -q remote-stager"
8) Do "ups update sam_bootstrap"  (This stops the fcp daemon)
9) Go back to the fcp line in REMOTE-STAGER_server_list.txt and remove the "#" 
you inserted in step 6. Save the file.
10) Do "ups update sam_bootstrap" (This restarts the fcp daemon)
11) Verify that all is well with "setup fcp; fcps d0mino". You should see the 
heading of a table (and perhaps nothing else). If you see "Cannot connect to 
fcpd" then something is still broken.


Contributed by:
system

Updated by:
illingwo

    12) How to fix files that were submitted with incorrect metadata
File parameters can be updated with the sam update file parameters command. There are also commands to update the file CRC or size. Anything else requires an expert.
Contributed by:
system

Updated by:
illingwo

    13) How do I add a new mass storage system location to SAM?
Use the
samadmin add disk location command, for example:
add disk location --mountPoint=cchpssd0.in2p3.fr:/hpss/in2p3.fr/group/d0 \
--relativePath=grid2/upload --connect=user/password@d0ofprd1
This above is possible only because "cchpssd0.in2p3.fr:/hpss/in2p3.fr/group/d0" already exists. Otherwise, one needs to do samadmin add data disk first.
Contributed by:
system

Updated by:
bellavan

    14) How to resize a station cache disk (Updated on 05.16.07)
  1. uncache all files from the station disk using the samadmin Uncache Station File command.
  2. restart the station
  3. remove the station disk using sam Remove Station Disk command
  4. add station disk with the correct size using the sam Add Station Disk command

If uncaching files is not desirable (in order to preserve files on disk), then somebody from the sam team has to manually change the disk size in the database using Resize Station Disk command. This also requires a station restart.

Example for SAM team:

<SAM-d0ora2%> samadmin resize station disk --mount=sam2.farm.particle.cz:/cache/1 
--size=1900000000KB
Enter user/passwd[@db] connection string:
You need to have an ORACLE account user/password for d0ofprd1. The result of this command will be as follows:
Disk size for sam2.farm.particle.cz:/cache/1 changed to 1.77TB

Note: (From IT#1291) Resizing the disk in the DB (experts only) will preserve the files. However, if you are shrinking the disk you should make sure that the current used space is less than the desired capacity before changing it.

Contributed by:
system

Updated by:
kinyip

    15) One of the nodes on the clued0 station has an INACTIVE disk. What can I do?

Log into that particular node and check whether servers are running ("ps -fu sam"). If there are no sam servers, start them using the following procedure:

  • $ setup sam -q worker_prd
  • $ ups restart sam_bootstrap

After executing the "restart" command, the output of the "ps -fu sam" command should look something like this:

  • $ ps -fu sam
  • UID PID PPID C STIME TTY TIME CMD
    sam 4808 1 0 Apr01 ? 00:00:00 /bin/sh /D0/ups/sam_bootstrap/NU
    sam 4968 4808 0 Apr01 ? 00:00:00 /bin/sh -f /D0/ups/fcp/v1_5b/bin
    sam 4970 4968 0 Apr01 ? 00:00:00 python /D0/ups/fcp/v1_5b/bin/fcp
    sam 4977 1 0 Apr01 ? 00:00:00 /bin/sh /D0/ups/sam_bootstrap/NU
    sam 5245 4977 0 Apr01 ? 00:00:17 stagerng start --station=clued0
    sam 3250 3249 3 08:35 pts/0 00:00:00 -bash
    sam 3304 3250 0 08:35 pts/0 00:00:00 ps -fu sam

If servers are running, and the disk is inactive, check whether that particular disk and the directory "boo" under it actually exist and belong to "sam", not "root". If the ownership is not right, ask clued0-admin@fnal.gov to correct it. Then, use the command "sam remove disk" to remove the disk first. If it complains that the disk is "in use", use the command "sam disallow disk" and then "sam remove disk". After that, follow this FAQ to revive the disk and node to be ready for SAM.

You should also remember to execute "sam allow disk" if you ran "sam disallow disk".

If all this still fails, send mail to d0sam-admin@fnal.gov with the problem description.

Contributed by:
system

Updated by:
bellavan

    16) Why is my python not installed correctly and how do I fix this?
I am attempting to follow the instructions listed at

http://d0db.fnal.gov/sam/doc/install/

The first time through, the install added a number of packages and then failed. The output from a subsequant attempt was as follows:

d0az02> setup upd
d0az02> upd install -G -c sam
informational: sam_common v4_6_0_10 already exists on local node, skipping.
informational: fnorb v1_1b_8 already exists on local node, skipping.
informational: tcl v8_3_1 already exists on local node, skipping.
informational: tk v8_3_1 already exists on local node, skipping.
informational: blt v2_4u already exists on local node, skipping.
informational: python v2_1 already exists on local node, skipping.
informational: sam_user v4_2_11 already exists on local node, skipping.
failed to run 'ups declare  sam_config v4_2_20 -z /fnal/ups/db -U ups -m
sam_config.table -f NULL -r sam_config/NULL/v4_2_20 -M ups -H Linux+2.4'
error output is:
ERROR: Found no match for product 'python'
ERROR: Action parsing failed on "setupRequired(python)"
ERROR: Found no match for product 'python'
ERROR: Action parsing failed on "setupRequired(python)"
WARNING: UPS_SHELL not set using SHELL or default, value = (unknown)
upd install failed.

Here is the solution:

Please do "ups list -aK+ python" and see if it is declared as current. If not:
ups undeclare python v2_1 -f NULL
ups declare -c python v2_1 -f Linux+2.4

If python isn't listed at all then do instead
upd install python v2_1 -G -c

Contributed by:
system

Updated by:
yann

    17) How do I deal with a "special" sam user, like a group or new name for user with existing sam entry?
A few options:

  1. Use new grid subject to person mapping, which should be working in production.
    It needs sam_common v5_0_1_3 and kx509 v2.0. Then on a machine with the FNAL kerberos client (preferably the one where sam commands are executed at NPACI) a valid fermi user does:
         $kinit
         $kx509
         $klist -p
    
    will result in a file /tmp/x509up_u$(UID).
    If this file were created (or casually deposited) on any machine needing to use sam client commands, then sam would resolve this user as the owner of the kerberos principle who executed kinit.
    There is currently no need for the time-limited proxy to be valid(non-expired), so no need to keep repeating this procedure.
  2. One can use the sam_admin add user command to register the "special" user.
Contributed by:
system

Updated by:
filthaut

    18) What network ports are used by sam?
The ports are obtained for prd, int, and dev environments as follows
<d0ora1> ups inquire sam_config -q prd

SAM_BBFTP_SOCKET=14021
SAM_LOG_SERVER_ADDR=d0db.fnal.gov:40583
SAM_NAMING_SERVICE=d0db.fnal.gov:9010

<d0ora1> ups inquire sam_config -q int

SAM_BBFTP_SOCKET=14023
SAM_LOG_SERVER_ADDR=d0db-int.fnal.gov:45583
SAM_NAMING_SERVICE=d0db-int.fnal.gov:9005

<d0ora3> ups inquire sam_config -q dev
SAM_BBFTP_SOCKET=14022
SAM_LOG_SERVER_ADDR=d0db-dev.fnal.gov:30583
SAM_NAMING_SERVICE=d0db-dev.fnal.gov:9000

None of the other centralized servers (various db servers and optimizer) have fixed ports.

All ports are tcp, except for the log server which is using udp.

In general, if the gateway node is behind a firewall, ports > 1024 should be unrestricted to d0ora2.fnal.gov, d0ora3.fnal.gov, and d0mino.fnal.gov. We are not enforcing optimizers to run on any particular port. There is no default port used by the sam_station, it is whatever orbacus grabs from the system. One could force orbacus to run at a specific port using

 -OAport 35441 
in the server options list, but that is up to the station administrator.

If you use the above option, and allow incoming traffic to port 35441, you should be ok as far as station is concerned.

However, there are other callbacks used by the system (e.g., for storing files). These may not work any more.

Do not limiting the outgoing traffic, because this would break all calls to the sam db servers.

Additional ports for JIM- Globus & Condor-G

Contributed by:
system

Updated by:
yann

    19) How can I disallow SAM from using a disk?
Sometimes SAM will be convinced a file that it needs is on a particular disk, even though that disk (or the machine) is dead or is no longer running a stager. Projects may time out due to this confusion. You can tell SAM to disallow use of that disk so it will look elsewhere. The syntax is:
sam disallow disk --mount=:
for example:
sam disallow disk --mount=argus-clued0.fnal.gov:/sam
or
sam disallow disk --mount=d0cs010.fnal.gov:/sam/cache --station=cab
(Note: If you want to disallow a disk on the CAB station you need to add the option '--station=cab' or have issued 'setup sam -q cab' before.)

You can easily determine the name of the mount point by doing "sam dump station --disks". Look up the machine and the disk in the list (not all disks are /sam; some are /sam/cache or /samcache).

You can even correspond the disk number to the station log files to determine what disk is dead or having problems.

If you get an error message while disallowing the disk that says the disk is in use, then the station is confused and needs restarting. Once the station is restarted, you should be able to disallow the disk.

Note that once a disk is disallowed, it can be "reallowed" by doing "sam allow disk" with the same option syntax.

Contributed by:
system

Updated by:
yann

    20) Why does the command sam configure abort when I try to modify or add a station group ?
The DbServer method assumes that the list of administrators is always given in the sam configure command, with --admin. This will last until the DbServer rewrite is complete and this bug fixed.

Example : to increase the max projects of the group dzero on manhep station :

  > sam configure group --station=manhep --group=dzero \
    --max-projects=20 --admin=sabah,sam,walker
Contributed by:
system

Updated by:
lima

    21) How to fix "Delivery is not possible to node d0csNNN.fnal.gov" error
A CAB user sends e-mail saying that his/her job got an error like the following:
%ERLOG-e SAM: STAGER:
        Stager error caught in SAMManager::establishProcess()!
        Delivery is not possible to node d0cs140.fnal.gov!
        Contact sam-users@fnal.gov! 
This error means that a SAM process has landed on a node that SAM has been told to disallow (not deliver any files to it and not get any files from it). This happens when the disallowed disks in SAM don't match the disallowed nodes in the batch system. See the answers to
How can I check if the CAB batch system (PBS) is in synch with SAM? and How to reenable a CAB stager for what to do about it.
Contributed by:
system

Updated by:
yann

    22) How can I log in as user 'sam'?
You need to have a root kerberos principal /root@FNAL.GOV (and the corresponding entry in ~sam/.k5login). On most systems (but not clued0) you need to explicitly use the '-F' option: 'kinit -F /root'

After obtaining the ticket you can log in as user sam by e.g. 'ssh sam@d0mino'.

Contributed by:
system

Updated by:
nunne

    23) Verifying the Enstore information as told by "Output from "cron" command"

First of all, this is a cron job set up by the SAM group, not the Enstore group.

This webpage http://d0db.fnal.gov/sam_local/PlotsAndStats/EncpVolumeStatus/ is also maintained by the SAM group. However, if you click on the individual volume on the above webpage, you would go straight into the Enstore group's webpage. There, you can check the status of a volume (tape) based entirely on the Enstore group's information. Look at the line of "system_inhibit". If something is wrong, you would see "noaccess" or "notallowed". Otherwise, the first status word says 'none'.

For people outside the fnal.gov domain, you may go to check this link, which is generated every hour. For individual volume, one can access the static inventory files, also generated every hour, at here.

One can also do the following in d0mino or clued0:

setup encp

enstore volume --vol=PRK967L1

(remembering that the LABEL such as "PRK967L1" is in capital cases) to check the status of one tape.


Since mid-Sept. 2003, all the tapes which were known to be "on-shelf" have been ignored in the cronjob.

  • If the email from the cronjob says that a tape is "deleted", it is most likely a network problem.

    Usually, if it's a short term network glitch, it'd be put back to the normal status automatically next time the cronjob is run (ie. you don't need to do anything). Otherwise, you may go to

    $HOME/EncpStats/Production/VolumeStatus in d0ora2 or

    $HOME/EncpStats/Integration/VolumeStatus in d0ora2 or

    $HOME/EncpStats/Development/VolumeStatus in d0ora1

    to find the file "restorePreviousStatus" and use it to mark the tape back to normal. The content of the email would indicate exactly which one of the above 3 directories to go to.

  • To make sure that there is no NEW tape which is put to the shelf, use the above procedure "enstore volume -vol=xxxxxx" to check. If it is something like "shelf" for the "library" flag and/or "Removed" for the "comment" flag, then this tape has just been recently put to the "shelf".

    If so, one should append this tape to the file "ignoredVolumes.txt" under the same sort of directory $HOME/EncpStats/xxx/VolumeStatus .

  • Contributed by:
    system

    Updated by:
    kinyip

        24) How can I check if the CAB batch system (PBS) is in synch with SAM?

    On a clued0 machine, run the commands
    setup sam_shifter_utils
    checkStationNodes --ping-down --inactive --sam-station=fnal-cabsrv1 \
    --pbs-server=d0cabsrv1
    
    or similar for cabsrv2. checkStationNodes --help explains the command line options.

    This lists all nodes which are disabled in PBS or SAM or both.

    There is a cron job running periodically which reports on any differences and attempts to automatically declare disks bad as required.

    The important PBS states are:

    • offline - manually set by an admin to take the node out the batch system, for example because it's broken
    • down - automatically set by the pbs server because it has lost touch with the node.

    The --ping-down option pings each down node to see if it responds. A non responsive down node is no problem for SAM, but a pingable down node may be in a confused state and cause problems.

    If the node in question does not appear as inactive or down, the problem was probably due to a network glitch.

    If a node is INACTIVE in sam, but alive in PBS then the stager process may not be running on the node. The cron job will attempt to restart the stager if it finds a node in this state, but this often fails. If the auto restart does fail then log into the node (as yourself, not sam), do 'ps -fu sam' and look for a process named 'stagerng'. If it is not there then restart the stager (see this FAQ). If the stager process is there then you can try restarting it anyway. If restarting the stager does not fix the INACTIVE state, then contact an expert. If you can't log into the node at all then it is probably a hardware problem, and you should report it to helpdesk@fnal.gov.

    If a node is DISALLOWED in sam, but alive in PBS then the disk should be reenabled in sam with 'sam allow disk' before somebody's batch job ends up on the node and dies because there's no stager running there.

    (To get the pbs node status directly, issue the following command:

    clued0: "pbsnodes -l -s d0cabsrv1"
    
    )
    Contributed by:
    system

    Updated by:
    lima

        25) How to re-enable a CAB stager
    1. If needed then allow the disk... (Disallowed disks show up as DISALLOWED in the station dump for recent station releases). To do this log into d0mino as user sam. Do setup sam -q cab. Then do
      sam allow disk --mount=d0csNNN.fnal.gov:/sam/cache
      where NNN is the number of the CAB node in question. You should get an OK response.
    2. Restart the stager on the node so that it registers with SAM...
      • Log into the CAB node as user sam.
      • Do setup sam -q cab [ For d0srv069 (which is special as it is used for recocert), do "setup sam -q d0srv069". For nodes which are used for storing files (which you may find out by "samp dump fss"), do "sam -q cabsrv_store". On cabsrv1 node do setup sam -q cabsrv1_worker instead]
      • Do ups restart sam_bootstrap
      • Verify that the stager is running by doing ps -fwwu sam. You should see a process that begins with stagering start.
    3. Verify that the node is allowed by doing (on any SAM aware machine) sam dump station --disks --station=cab | grep d0csNNN where NNN is replaced by the CAB node number. If you see INACTIVE then something bad happened. Ask the experts for help. If you didn't see that, then all is well.
    Contributed by:
    system

    Updated by:
    kinyip

        26) How to fix "Delivery is not possible to node xxxx-clued0.fnal.gov" error ?
    If you get the error message:
    > Stager error caught in SAMManager::establishProcess()!
    >          Delivery is not possible to node aldan-clued0.fnal.gov!
    
    Then the stager is not functioning correctly on that node. See
    FAQ #16 for instructions on how to check on its status and restart it if needed.
    Contributed by:
    system

    Updated by:
    kinyip

        27) How is the intermediate cache cleaned up, for example MC files coming from the remote processing centers that are forwarded through D0mino?
    sam_admin includes a command to clean the remote station cache areas and store the files to Enstore if they have not been. We have a cron job that runs daily which invokes the command.

    Once that's done - under normal conditions the cache clean up would take place.
    "Clean up" here also means to transfer the files stuck in the intermediate cache area to Enstore.

    To manually do it:
    From d0mino as "sam" ( because the command also does rm in the cache area)

    % setup sam_admin   (or -q prd sam_admin)
    
    To clean all the cache areas :
    
       % samadmin clean station cache --all
    
    To clean a specific cache area  :
    
       % samadmin clean station cache --station=     
       e.g. samadmin clean station  cache --station=hoeve
    or
    
       % samadmin clean station cache --cache=/sam/remote/
       e.g. samadmin clean station  cache --cache=/sam/remote/hoeve
    
    The samadmin code constructs a list by looking at the /sam/remote area on d0mino. So valid station values for --station or --cache would be any of the names in /sam/remote directory: D0Mainz clued0 hoeve manhep tata-d0-mcfarm uta-hep azsam1 csuf_hep2-station imperial-lesc munich triviaal wuppertal budzero d0-umich in2p3 nijmegen ucr-analysis cab d0aachen indianauniversity ouhep umdzero caps10 d0indiana karthur p840.phys.sfu.ca uta-analysis central-router d0karlsruhe lancs prague-test-station uta-cse cinvestav-station d0nevis-station luhep princeton-d0 uta-d0-grid Remember that those files that were NOT already stored to Enstore would be transferred to Enstore.
    Contributed by:
    system

    Updated by:
    yann

        28) Project on clued0 is getting all file delivery errors (see SAM disaster #3 at http://www-clued0.fnal.gov/~sam/samTV/disasters).

    You look at samTV and see a project on clude0 getting all file delivery errors. A new problem that causes this symptom is when either flotsam-clued0 or sambar-clued0 fails to mount /pnfs/sam/dzero. That failure will cause all enstore transfers through that machine to fail. If you look at the project on Disaster #3 listed above, you see that all of the files were routed to node flotsam-clued0. This is your clue that something is wrong with flotsam. Looking at the clued0 log file, you'll see:

    10/20/03 11:16:26clued0.SM.clued0 24740: File 
      reco-p14.02.00_UTA-Team_bphysics_mcp14_hepfm007.uta.edu_8670_03277095151
      has been delivered 
    enstore:/pnfs/sam/dzero/db2/monte_carlo/phase14/mcc99/reco/all_4->
    clued0:flotsam-clued0.fnal.gov:/samcache/boo
    10/20/03 11:16:26clued0.SM.clued0 24740: Delivery status: Simple Status:
      Code: delivery error (Category SAM Internal)
      Severity level: ERROR
      Generated on 20 Oct 11:16:26 by eworker
      In the context: executed process samcp 
       enstore:/pnfs/sam/dzero/db2/monte_carlo/phase14/mcc99/reco/all_4(prn146l1.27)
       /reco-p14.02.00_UTA-Team_bphysics_mcp14_hepfm007.uta.edu_8670_03277095151
       clued0:flotsam-clued0.fnal.gov:/samcache/boo, result: EXIT CODE: 256 STDOUT: 
       INFILE=/pnfs/sam/dzero/db2/monte_carlo/phase14/mcc99/reco/all_4
       /reco-p14.02.00_UTA-Team_bphysics_mcp14_hepfm007.uta.edu_8670_03277095151
       OUTFILE=
       FILESIZE=0
       LABEL=
       LOCATION=
       DRIVE=
       DRIVE_SN=
       TRANSFER_TIME=0.00
       SEEK_TIME=0.00
       MOUNT_TIME=0.00
       QWAIT_TIME=0.00
       TIME2NOW=0.00
       STATUS=USERERROR
    
       STDERR: ENOENT: [ ERRNO 2 ] No such file or directory: 
       /pnfs/sam/dzero/db2/monte_carlo/phase14/mcc99/reco/all_4
       /reco-p14.02.00_UTA-Team_bphysics_mcp14_hepfm007.uta.edu_8670_03277095151
       , method name: samcp
       Recommended action: Please contact sam-admin@fnal.gov
    

    The No such file or directory is another clue that something is wrong with the pnfs space on that machine.

    To further investigate, log into flotsam-clued0 (or sambar-clued0 as applicable) and do ls /pnfs/sam/dzero. If the directory is empty, then the mount failed (if it's not empty, see below). Unfortunately, you need a clued0-admin to fix this for you. Please send the following message to clued0-admin@fnal.gov:

    Hi Clued0-admin,
    
    Can an admin please log into flotsam-clued0 (or sambar-clued0) as root and do
    "mount -a". A directory necessary for SAM station operation failed to mount and
    is causing project failures. Once "mount -a" is done, things will work again.
    This is urgent as all SAM projects asking for tapes will fail until this is
    fixed. Thanks! SAM loves clued0!
    

    Be sure to choose flotsam or sambar as appropriate.

    Then send mail to the poor project owner that he/she will have to resubmit their job.

    If the /pnfs/sam/dzero directory has subdirectories, then something else is going on. Maybe the tape the person wants is NOACCESS (this should be obvious from the log file). Maybe fcp on d0mino is down (see this FAQ).

    Contributed by:
    system

    Updated by:
    lima

        29) Why can't I "samadmin cache file" and why did SAM tell me that I had insufficient privilege ?
    Because one needs special username/password to do so. A shifter' username/password is not "privileged" enough to do so.

    One possible username/password combination can be found by looking at the file ~sam/private/conf/dbserver/central_analysis_prd_config.py in sam@d0ora2.fnal.gov. Look under "username/password" in the above python script. Use this combination to "samadmin cache file".

    ============= More details : ================

    It takes different oracle permissions to write into different tables, and/or to do different things.

    The "SAMSHIFTER" role is granted to all of the sam_shifters, and is the "role" one has when one uses, eg., kinyip/pwd@d0ofXXX1. It can do many, but not all, of the things that other "roles" can do.

    The usernames/passwords listed in the dbconfig files are allowed to do EVERYTHING (these are the accounts under which the dbserver itself runs). In particular, the oracle command to get the next available cached_file_id cannot be run by (eg.) kinyip/pwd@d0ofXXX1, but can be run by the dbserver accounts. So when you did the cache file with the username/password listed in the dbconfig files, one is using a VERY privileged account, where the command works.

    Contributed by:
    system

    Updated by:
    yann

        30) How do I set up a local naming service? The compute nodes at my site do not have any network connectivity to the outside world (they are on a restricted private network)
    One has to do something like the following:
    
    1) create a new sam configuration using "ups tailor sam_config" with a
    qualifier local_ns (for example), that has  a single environment
    variable
    
    SAM_NAMING_SERVICE=192.168.0.11:9010
    
    
    2) install the  orbacus
    upd install orbacus v3_3_4r -q GCC-2.95.2
    
    3) insert the following in your server list file:
    
    nameservice local_ns v3_3_4r -ORBtrace_level 5
    
    4) start the servers, and look into its trace file (in directory
    ~sam/private/nameservice____local_ns)
    
    it should have something like
    
    d0ora1> cd ~sam/private/nameservice__d0ora1__ns_dev/
    d0ora1> head -1 trace
    IOR:000000000000002a49444c3a6f6f632e636f6d2f436f734e616d696e672f4f424e616d696
    e67436f6e746578743a312e3000000000000001000000000000002c000100000000001064306f
    7261312e666e616c2e676f7600232876b80000000c4e616d655365727669636500
    d0ora1>
    
    This is the ior string that has to go into SAM_NAMING_SERVICE_IOR
    variable, as well as into FNORB_NAMING_SERVICE on your worker nodes
    
    5) create configuration for the local logger, say local_log, which has
    to have only one variable
    
    SAM_LOG_SERVER_ADDR=:
    
    6) install the logger v4_2_0
    
    upd install sam_logger v4_2_0 -q GCC-2.95.2
    
    
    7) insert the line
    
    logger local_log v4_2_0 --stdout=no --info=/dev/null
    
    in your server list file
    
    8) create new sam configuration for the worker nodes, say worker_prd by
    cloning your prd configuration, and modify the SAM_NAMING_SERVICE_IOR,
    FNORB_NAMING_SERVICE, as well as
    SAM_LOG_SERVER_ADDR with local values.
    
    9) for the prd configuration on the station node, you have to add new
    SAM_NAMING_SERVICE_IOR_1 variable which points to the local naming
    service. In this way, all your C++ servers talk to 2 naming services:
    the production one and the local one.
    
    Contributed by:
    system

    Updated by:
    yann

        31) D0ora2 (alias d0db-prd.fnal.gov) unaccessible from offsite?
    On 16 Dec 2003 it happened a nasty issue related to d0ora2 due to a network problem which came out by impossibility to access Web SAM page, SAAG etc ... Because d0ora2 is a sensible point, it is important to diagnose problems ans alert people quickly. First very simple thing to do is to try to connect on d0ora2 and investigate if it is in a stange state. If everything looks raisonnable, test if a problem may come from network (in taht case it was!) by running "ping -R" or traceroute (traceroute d0db.fnal.gov for example) from outsite to see where the networkstop might be occuring. Try "sam locate foo" to see if it works fine or not. Finally, inform d0db-support@fnal.gov and/or page D0primary (d0-primary@fnal.gov) asap. Helpdesk (helpdesk@fnal.gov) can be warned (on duty hours) to pagethe person required to make a hardware intervention if needed.
    Contributed by:
    system

        32) Which log file should be looked at for problems related to "sam store" ?

    For problem related to storing files into sam ( sam store ... ), one should look at the file system server (fss) trace/log files, rather than the station trace/log files.

    The files are usually under the username sam home directory, in the private directory. The file name begins with the "fss", followed by the machine name, followed by the sam_config qualifier, followed by the station name. For example, if a station called "central-analysis" is running on a machine called "d0mino" using the sam_config qualifier "station_prd", then the fss log file would be:

    ~sam/private/fss__d0mino__station_prd__central-analysis

    Entries related to sam store commands should always contain the keyword "samcp".

    Contributed by:
    system

    Updated by:
    bellavan

        33) How to fix a "Failed: No files for dataset" error?
    The problem: a user has defined a dataset (usually Monte Carlo files). The
    Dataset Definition Editor (or sam translate constraints) reports a list of files but when trying to use the dataset it reports:
    Failed: No files for dataset.
    
    The reason: most probably the dataset was defined by looking for individual thumbnail files in sam. But those files are not saved on tape, only the merged files are stored.

    The solution: the user should probably redesign his/her query to used other fields, and specify availability_status available. While at it, you should also have content_status good to ensure that the files have not been flagged as corrupted.

    Example:

    • a non-working dataset definition:
    • global.requestid 10752 and file_name %tmb%
    • the correct way to query those files:
    • global.requestid 10752 
           and data_tier thumbnail 
           and availability_status available 
           and content_status good
    Contributed by:
    system

    Updated by:
    lauri

        34) sam store request rejected - dest file already exists?

    First, for STORE problems you should be looking in the fss trace/log, rather than the station logs (see this FAQ).
    In the fss trace:

    PROBLEM with file store, details are below:
    Complex status containing traceback (causal relationship) of:
     Simple Status:
       Code: request rejected (Category User)
       Severity level: ERROR
       Generated on Wed Feb 19 23:02:24 2003 by fss
       In the context: submStore: received incoming call from client
       Recommended action: Please check your data, contact sam-users@fnal.gov
     Simple Status:
       Code: dest file already exists (Category User)
       Severity level: ERROR
       Generated on Wed Feb 19 23:02:24 2003 by fss
       In the context: checkSourceAndDest: local()
       Recommended action: Please check your data, contact sam-users@fnal.gov
       Additional information: Checking file
    enstore:/pnfs/sam/dzero/db2/datalogger/run160000/d0farm/thumbnail/all/
    recoT_all_0000162578_mrg_095-099.raw_p13.06.01
    End of complex status
    

    There are two possible explanations for this. One is that something went wrong with the transfer to enstore and the file did not actually make it to tape, and the other is that the file was correctly stored on tape but the location was not added to the database. To distinguish between the two cases, log into d0mino and do 'ls -l /pnfs/../recoT_all_0000162578_mrg_095-099.raw_p13.06.01'. If the file has zero size the transfer failed leaving the empty pnfs entry behind. To fix this delete the zero sized file with rm (you must be user sam to do this, and be careful you don't accidentally delete the wrong file!), and tell whoever originally submitted the store to submit it again with the '--resubmit' flag.

    If the file has a non-zero size then it has been correctly stored on tape, and you need to manually complete the store by adding the tape location to sam. You can check that this hasn't already been done by running 'sam locate recoT_all_0000162578_mrg_095-099.raw_p13.06.01'. If thhe the resulting list of locations includes one beginning with /pnfs then the location is already there, and you don't need to do anything. If it is not there this do (from d0mino):

    setup encp -q d0en
    enstore pnfs --xref=/pnfs/sam/.../recoT_all_0000162578_mrg_095-099.raw_p13.06.01
    
    The result of this command tells you which tape this file is actually stored on (PRK798L1 in this example). To add the location use:
    sam add location --file=recoT_all_0000162578_mrg_095-099.raw_p13.06.01
    --loc='/pnfs/sam/.../all,prk798l1'
    
    (Note that the tape label is in lower case in add location, compared with upper case for the enstore output)

    The above seems incomplete because it doesn't have the "offset" which is shown when you use "enstore pnfs" command. Use Robert's script below.

    One should also check that the CRC is correct by going to the intermediate staging area, eg., /sam/remote/uta-hep/store, and then do

    d0mino> cd /sam/remote/uta-hep/store
    d0mino>setup sam_admin
    d0mino> samadmin check crc   your_filename
    
    If no complaint is given, it means that the CRC is correct.

    One also may have to correct crc for those files. There is samadmin insert crc utility, which can be used with a simple script that determines crc from enstore:

    d0mino> enstore file --bfid=`enstore pnfs
    --xref=/pnfs/sam/dzero/db2/datalogger/run160000/d0farm/thumbnail/all/
    recoT_all_0000162578_mrg_095-099.raw_p13.06.01 | grep bfid | 
    cut -f2 -d':'` | grep complete_crc | cut -f2 -d':' | sed 's/,//g'
     3993077792L
    
    possible fix 1) sam cancel file store that_filename at the node where the "sam store" command was originally submitted. (2) sam store --resubmit ... ( using original options )

    Robert has a script: ~illingwo/add_pnfs_location.py in clued0 which is very useful (argument = pnfs path).
    Contributed by:
    system

    Updated by:
    kinyip

        35) SAM store problem - duplicate file store request
    There is already an ongoing file store request for this file. If it seems to be stuck then the old store can be stopped with
    (in v7)
    samadmin cancel file transfer request --transferIdentifier=<filename>
    
    OR (in v5)
    sam cancel file store --station=<station where store was submitted> <filename>
    
    After the store has been cancelled, 'sam store --resubmit ..' should work.
    Contributed by:
    system

    Updated by:
    kinyip

        36) How to remove files from the SAM cache?
    It may happen that a file in the SAM cache is corrupted but the file stored on tape is ok. Until the corrupted file is gone from the cache, nobody will be able to access the good file. You'll have to use the
    sam remove replica command. The remove replica command can be run from any machine, not just the one hosting the file you wish to remove. It requires that you are either a station admin or using the sam account.

    First make sure that the file is corrupted:
    Command "sam locate yourfile" to get the node which holds the file and log into it.

    setup sam
    sam calculate file crc \
    --path=/sam/cache59/boo/TMBfix-recoT_all_0000169640_mrg_001-015.raw_p14.05.00_p14.fixtmb.02
    
    It should tell you "CORRUPT file". This file is on d0mino, so you'll have to execute the following command:
    samadmin remove station replica --mountPoint=d0mino.fnal.gov:/sam/cache59 \
     --fileName=TMBfix-recoT_all_0000169640_mrg_001-015.raw_p14.05.00_p14.fixtmb.02
    
    If the file has to be removed from cab (instead of the OLD central-analysis), one has to do instead: on fnal-cabsrv2
    samadmin remove station replica --mountPoint=d0csXXX.fnal.gov:/sam/cache \
      --fileName=XXX --station=fnal-cabsrv2
    
    or on cabsrv1:
    samadmin remove station replica --mountPoint=d0YYXXX.fnal.gov:/sam/cache \
      --fileName=XXX --station=fnal-cabsrv1
    
    where YY can be "cs" or "srv".
    
    
    It should tell you "OK" if it worked. You can check it by doing sam locate.
    Contributed by:
    system

    Updated by:
    pengj

        37) Where are the FNAL SAM station logs?

    To see the station logs for cabsrv1/2 do the following:

    • Log into d0cabsam1 under the samread account
    • cd ~sam/logs
    • cd to the appropriate station and look at the most recent dated log file (files go sm_log__mm_dd_yy - sm_log__11_12_13 = log for November 12, 2013)

    For clued0, log into lagan-clued0 under the sam account and look in

    ~sam/private/logs/station__lagan-clued0__station_prd2__clued0

    Note that stager logs may also be interesting.

    For clued0 the stager logs all live in lagan-clued0.fnal.gov:/home/sam/private/logs. The specific log directory looks like stager__"machine"-clued0_worker_prd2__clued0 where "machine" is the name of the particular machine.

    Contributed by:
    system

    Updated by:
    filthaut

        38) How do I determine how many files have been stored in each pnfs directory since a specific date?
    From any machine that has access to Oracle use sqlplus and log into d0ofprd1's d0read/reader account. From that account run the following query, changing '2003/12/01' (in both places) to the date of your choosing.
    set linesize 132
    column path format a80
    select d.path ,d.files ,d.icreate_date
    from (
         select dsl.path  path
           ,count(df.file_id)    files
           ,to_char(max(dfl.create_date),'YYYY/MM/DD HH24:MM:SS') icreate_date
         from data_storage_locations  dsl ,data_file_locations dfl ,data_files df
         where dsl.location_type = 'tape'
           and dsl.location_id   = dfl.location_id(+)
           and to_char(dfl.create_date,'YYYY/MM/DD HH24:MM:SS') > '2003/12/01 00:00:00'
           and dfl.file_id       = df.file_id(+)
           group by dsl.path
           order by dsl.path
     ) d
    where d.icreate_date > '2003/12/01 00:00:00'
    and d.files > 0
    

    The database will return every pnfs location, since your requested date, that a file has been stored in with the number of files stored since requested date and the timestamp of the last file stored there.
    Contributed by:
    system

    Updated by:
    swhite

        39) I've just been sent an email saying "Started the D0 Calib User Servers on failover Node: d0dbsrv6.fnal.gov". What does it mean?
    The machine on which the calibration db servers were running is unreachable, so the servers have automatically failed over to the back up node. No action is required from the sam shifter, although you might like to check that all the servers did actually start up on the new node by checking the db server status at
    http://d0db.fnal.gov/sam_admin/cgi/nameService?show=dbserver .
    Contributed by:
    system

        40) Do I have to have a sam station running on every machine where I want to execute a sam_submit command?
    (Specific to CDF): Set up one node to run the station master and file storage server (fss) as well as a stager. Each other node needs a stager and you need to declare to sam what cache there is on that node. One needs to have the sam software installed. In my case (Valeria's configuration at CDF), nglas08 runs the station master and file storage server and the other one is used for workstations I add to this. I have nglas07, nglas06, nglas04, nglas05 etc. They all have the same server list file, so I soft link them each to the nglasX_server_list.txt file.
    Contributed by:
    system

    Updated by:
    lauri

        41) Is it possible to get files delivered in a special order?
    There is no delivery order guarantee in SAM. It is dictated by the database,optimizer, instant cache availability, mass storage availability. For debugging purposes specify instead of a dataset a fileset.
    Contributed by:
    system

        42) The declared and activated disks do not show up as active. What is the problem?
    This could have different causes. Two are listed:
    • Usually this is because the specified size is larger than actual free space. For large data files stored on large disks, we find that specifying a sam cache size of about 98% of the actual total space reported by df seems to work.
    • Sometimes stager may pick up different host name than you used to configure disk with. To force specific hostname add --node-name=your.node.name in the server config stager line and see if it helps.
    Contributed by:
    system

    Updated by:
    lauri

        43) What to do if I see CORBA exception errors?
    In this case, try the following:
    d0mino:~/sam % sam locate file foo
    CORBA Exception, server is probably dead (Minor: 0 Completed: COMPLETED_NO)
    
    Diagnosis: The SAMDbServer is dead or hung.
    Solution: Check for example the production version of the SAM At a Glance page to verify the Db server status. If it is not available, then restart or start it.
    Contributed by:
    system

    Updated by:
    filthaut

        44) If the file is not stored properly,hanged indefinitely or are timed out in D0Framework?
    If the error messege gives:
            SAMManager:sam  Consumer established: CID = 523
            SAMManager:sam  Process established: CPID = 2599
            SAMManager:sam  Getting next input file
            SAMManager:sam  Project master will call back
            %ERLOG-e TIMEOUT:
                     Timed out waiting for project master
                     callback in SAMManager::selectForCB()!
                     Contact sam-users@fnal.gov!
                     sam 22-Nov-1999 11:39:24  ReadEvent:read Beginning of job
            >From stager: Spawned rm -Rf
            /prj_root/762/dma_1/sam/buffer//taul3signal_mcc99_2_qcd_20_11_22_99_11_24
    

    Diagnosis: Enstore is down
    Solution: Check the enstore status page for servers which are exiting, timed out, or locked. Their status should be alive. Send mail to helpdesk to report the problem.
    Contributed by:
    system

    Updated by:
    lauri

        45) If the encp commands fails without contacting enstore?
    For example, assuming that you login as sam and the directory /pnfs/sam/NULL/test is owned by "sam"
       d0mino:~ % setup encp
       d0mino:~ % encp test.dat /pnfs/sam/NULL/test
    
       INFILE=
       OUTFILE=
       FILESIZE=0
       LABEL=
       DRIVE=
       TRANSFER_TIME=0.000000
       SEEK_TIME=0.000000
       MOUNT_TIME=0.000000
       QWAIT_TIME=0.000000
       TIME2NOW=0.000000
       STATUS=TIMEDOUT
    
       3422079 lueking E ENCP  INFILE= OUTFILE= FILESIZE=0 LABEL=
          DRIVE= TRANSFER_TIME=0.000000 SEEK_TIME=0.000000
          MOUNT_TIME=0.000000 QWAIT_TIME=0.000000 TIME2NOW=0.000000
          STATUS=TIMEDOUT
       Fatal error: ('TIMEDOUT', None) No response on alive to
          config Exit code:1
       3422079 lueking E ENCP  Fatal error: ('TIMEDOUT', None)
          No response on alive to config
       Exit code:1
    

    Diagnosis: The encp client is unable to contact the enstore configuration server (currently d0ensrv2). Try pinging d0ensrv2 from the node where you are running encp. If it is unavailable, there is a network problem.

    Solution: Get the network fixed. Contact help desk and notify themn there is a "network problem" between the node you are on and d0ensrv2.

    Contributed by:
    system

    Updated by:
    lima

        46) If FNORB and CORBA error messages gives unexpected/unrecognized exception?
    For this error message:
    <<<<<< SAM Exception caught, status:
    Simple Status:
      Code: unexpected/unrecognized exception
      Severity level: ERROR
      Generated on Tue Nov 19 11:21:54 2002 by client
      In the context: 
      file_client.getDbServer: attempted to call DB server File,
      result: callee threw unrecognized/unexpected exception
      Recommended action: Please contact sam-admin@fnal.gov
      Additional information: 
       Fnorb.orb.CORBA.COMM_FAILURE Minor: 0 Completed: COMPLETED_NO: 
       File "/prj_root/711/online_1/moore/fnorb/Fnorb/orb/IIOPConnection.py", line 100:  
       None
    
    As a very good general rule of thumb: if the exception/error has to do with anything FNORB/CORBA, it is quite likely a temporary difficulty in communication In many of these cases, repeating the command a few minutes later will work. In other cases, the problem is that the server is really dead and it is NEVER going to work; in those case, you'll usually see red balls on the server side.
    Contributed by:
    system

    Updated by:
    lauri

        47) If some SAM jobs fail due to timeout out waiting for the project master callback?
    For error messeges like this,
      %ERLOG-e SAM: PROJECT MASTER:
    
            Callback error caught in SAMManager::selectForCB()!
    
           Error message: Timed out waiting for the project master callback!
    
           File delivery may be slow or unavailable!
    
            sam 11-Dec-2002 11:23:25  ReadEvent:read Beginning of job
    
    
    and in log file>>>>>>>>>>>>>>>>>>>>>>>>>
    SAMManager:sam  RCP parameter:  ProjectMasterTimeout = 15 [minutes]
    SAMManager:sam  RCP parameter:  FileRequestTimeout = 15 [minutes]
    SAMManager:sam  RCP parameter:  FileStoreLocation = /pnfs/sam/mammoth/demo
    SAMManager:sam  RCP parameter:  FileStoreMode = Asynchronous
    SAMManager:sam  RCP parameter:  StoreRequestTimeout = 15 [minutes]
    SAMManager:sam  RCP parameter:  ApplicationName = demo
    SAMManager:sam  RCP parameter:  ApplicationVersion = 1
    SAMManager:sam  Could not find RCP parameter WorkingGroup.
                   Using default value: __SAM.DFLT__
    SAMManager:sam  Consumer established: CID = 91429
    SAMManager:sam  Process established: CPID = 1154070
    SAMManager:sam  Initialized.
    SAMManager:sam  Getting next input file...
    SAMManager:sam  Established socket for the project master callbacks
    (port: 43275).
    SAMManager:sam  Project master will call back.
    SAMManager:sam  Received message from the project master
    SAMManager:sam  Input file
    delivered: /sam/cache/boo/zero_bias_0000168824_013.raw
    Initializing ReadEvent package
    Input file: d0StreamName, initial wildcard(s) = SAMInput:, current file
    /sam/cache/boo/zero_bias_0000168824_013.raw
    
    Input format:
    Initializing SmtRawUnp2Data package
    Creating a SMTCalibPack object
    Created default SMTPedAlgAllRO object
    ReadEvent: Opening /sam/cache/boo/zero_bias_0000168824_013.raw
    ReadEvent: Closing /sam/cache/boo/zero_bias_0000168824_013.raw
    First event read: Run number: 0, Event Number: 0
    Last event read:  Run number: 0, Event Number: 0
    Events read: 0
    SAMManager:sam  Input file released: zero_bias_0000168824_013.raw
    SAMManager:sam  Getting next input file...
    SAMManager:sam  Project master will call back.
    SAMManager:sam  Process ended: CPID = 1154070
    SAMManager:sam  Destroyed.
    
    Solution:

    One trivial possibility:
    The parameter "Timeout" (eg. in a SAM rcp file) set is probably not long enough. The user should set the timeouts to something like 2 hours.

    Another more "interesting possibility":
    The problem is caused by a "bad" SAM file, zero_bias_0000168824_010.raw, in this case, which is larger than 2GB and cannot be delivered. Sam is retrying and times out trying to get this file, which it can't.

    SAM shifter should mark this file as "bad" so that it does not mess up jobs.
    Until this happens, the user may define a new project

    __set__ yourset minus file_name zero_bias_0000168824_010.raw
    
    to avoid that particular file.
    Contributed by:
    system

    Updated by:
    yann

        48) if user reports an autoregistration failure with a valid uid?

    E.g. user has 2 iud's (while getting d0 computer accounts fnal users are issued new visitor ID's).


    Use the samadmin command such as:


    samadmin add person --username=hli --emailAddress="Hengne.Li@lpsc.in2p3.fr" --lastName="Li" --firstName="Hengne"
    Contributed by:
    system

    Updated by:
    kinyip

        49) If the SAM auto mailing list gets mail with subject station_prd fcp died at d0mino: status 1?
    Traceback (most recent call last):
       File "/usr/products/fcp/IRIX/v1_5b/bin/fcpd.py", line 655, in ?
         cfg = ConfigFile(os.environ['FCP_CONFIG'])
       File "/usr/products/fcslib/NULL/v2_1/lib/config.py", line 144, in __init__
         if file:    self.readConfig(file)
       File "/usr/products/fcslib/NULL/v2_1/lib/config.py", line 151, in readConfig
          self.reReadConfig()
       File "/usr/products/fcslib/NULL/v2_1/lib/config.py", line 155, in reReadConfig
         f = open(self.File, 'r')
    
    IOError: [Errno 2] No such file or directory:
    '/home/sam/private/conf/fcp_root/cfg/fcp.cfg'
    
    The fcp.cfg disappeared due to, eg., a crash in d0mino etc. Do:
    1. cp ~sam/backup/fcp_root/cfg/fcp.cfg ~sam/private/conf/fcp_root/cfg/
    2. restart the "fcp" daemon in REMOTE-STAGER_server_list.txt using sam_bootstrap, similar to what you would do to restart a DB server (for example).
    Contributed by:
    system

    Updated by:
    lima

        50) What to do if losing or replacing a station disk?
    Example for central analysis on d0mino, by Lee Lueking, Sinisa Veseli, Wyatt Merritt.
    Losing the disk which is assigned to /sam/cache/062 in the central-analysis system on d0mino, and replacing that disk with a fresh unused disk which is labelled /sam/data/007.
    1. Log in as user sam.
    2. Stop the station server with the following steps:
      1. Edit the server list file (~sam/private/CENTRAL-ANALYSIS_server_list.txt) and comment out, with a #, the station server line (starts with "station sm_central_analysis_prd" )
      2. Then do
             setup sam (need to do this so the correct server list is used)
             ups update sam_bootstrap
             
    3. rm /sam/cache/062
    4. ln -s /sam/data/007 /sam/cache/062
    5.      setup sam_admin
           samadmin uncache disk files \
               --directory=/sam/cache/062/boo --node=d0mino.fnal.gov
          
    6. Restart station server with the following steps:
      1. Edit the server list file, as above, and remove the # from the station server line
      2. Then do
              ups update sam_bootstrap
              
      If there is not an unused disk to link to, then skip steps 2 and 3, and the disk will be marked as INACTIVE when the station starts up.
    Contributed by:
    system

    Updated by:
    lauri

        51) If any problem occurs about missing path in db?
    If the error messege is like this
     <<<<<< SAM Exception caught, status:
       Simple Status:
         Code: path not found in db (Category User)
         Severity level: ERROR
         Generated on Wed Feb  5 01:10:58 2003 by DB server
         In the context: FileImpl.mapDestination: local()
         Recommended action: Please check your data, contact sam-users@fnal.gov
         Additional information: Error no match found for path pattern 
       /pnfs/sam/mammoth/copy1/monte_carlo/group-phase1/higgs/thumbnail-bygroup/NULL
    

    Ask
    d0sam-admin@fnal.gov to setup the necessary path (in this case, higgs/thumbnail). (from Kin)

    You may want to check your metadata with the command

    sam get destination --descrip=your_metadata.py
    
    A typical problem is that your metadata contains "generated" instead of "generated-bygroup".
    Contributed by:
    system

    Updated by:
    yann

        52) What are the duties of a shifter on a day with scheduled downtime (eg. first Tuesday of the month)?
    (From Wyatt Merrit - May 2004)
    • In the event of a database downtime, expect no SAM functionality, and reply to users who ask what is going on (there are always some who don't read downtime notices).
    • When the database admins indicate that the database is back up, restart the SAM DbServers
    • When ENSTORE is down, expect that jobs asking for tape files will time out, and warn users who ask about that.
    • When ENSTORE comes back, expect time-outs may continue while the system catches up from the backlog, depending on how many tape jobs users dumped in just before or during the downtime.
    • As long as the Oracle database is up, SAM functionality should be normal (DDE accessible, command line working, project submission normal, etc.). Anything (except long waits for tape files if ENSTORE is down) should be followed up as usual.
    Contributed by:
    oneil

    Updated by:
    bellavan

        53) File delivery is very slow. Check if it is because enstore is very busy
    It happens that many SAM projects are waiting a long time for file delivery (a lot of red in samTV). You can check the status of the tape robots doing the following.

    Go to the project page in samTV and then click on one of the "Request" X's (last column) which will jump you to the file on the file request page. Look at the Location string. If it starts with "enstore" then the file comes from tape. If the tape name (in the parentheses at the end of the Location string) ends in "l1" then the tape is LTO.

    Then go to the enstore status page (http://www-d0en.fnal.gov/enstore/status_enstore_system.html). Look at the "samlto.library manager" section and click on Full Queue Elements on the right side of the page. Then look for a "Reading Tape" block for one of your files. The important thing to look at now is the "Job Submitted" time and the "Dequeued" time. The former is when your tape request was submitted to enstore (this is when SAM started the encp). The "Dequeued" time is when your request exited the queue and enstore started acting on it.

    You will probably see that the requests tend to take 1.5 - 2 hours. Hence the long wait times.

    You may notice that other Reads have spent a very short time in the queue (like a few minutes). This isn't a conspiracy against a particular user: if you submit a tape request and that sits in the enstore queue for two hours and then submit another tape request FOR THE SAME TAPE one hour and 59 minutes later, this late request gets joined with the original and goes through the queue really fast. Enstore is smart to join requests for the same tape.

    In particular, if a dataset is scattered on many tapes (so there is no merging of tape requests), each file will have to wait for a long time.

    (based on an email by Adam Lyon)

    Additional info:
    There are restrictions to how many files SAM can transfer simultaneously, therefore it may not request all files at the same time. The already requested files, which should match the list of requested files on the enstore web page, can be obtained by going to the station history in samTV and selecting "Pending Transfers" (then click on "Click here for the list").
    Keep in mind that samTV is updated every hour, while the enstore page is updated every few minutes, so there will still be discrepancies.

    Contributed by:
    yann

    Updated by:
    yann

        54) How do I add new "keywords" to SAM for storing data?
    User asks that new keywords have to be used in the metadata and need to be declared to sam. Solution: Identify the category, tier, and Param_type from user mail or the error message. Use the "sam add param type" to add them to sam's list. For example to add keyword slha of type long for the 'generated' data tier:
    
    % setup sam
    % samadmin add param type --paramCategory='generated' --paramType='slha' \
    --dataType='long' --connect=username/password@d0ofprd1 
    
    % New paramType and corresponding dimension 'generated.slha' added.
    
    Other examples:
    $ samadmin add param type --paramCategory=generated --paramType=rhneutrinomass --dataType=string
    New paramType and corresponding dimension 'generated.rhneutrinomass' added.
    $ samadmin add param type --paramCategory=fome --paramType=ktfac --dataType=float
    New paramType and corresponding dimension 'fome.ktfac' added.
    
    Contributed by:
    lueking

    Updated by:
    malik

        55) Why does "ls -l /pnfs/sam/dzero/db2/datalogger/run175K_176K/datalogger/all/all/all_0000175645_079.raw" tell you that the file has size of 1 byte ?
    According to Wayne Baisley, this is because the nfs protocol only allows for
    32-bit file sizes, so the pnfs developers adopted the convention of returning a 
    value of 1 for sizes that don't fit. 
    
    There are ways to get the real size, like the command:
    
    setup encp
    enstore pnfs --layer=/pnfs/sam/dzero/db2/datalogger/path/to/file 
    
    in nodes where /pnfs exists.
    
    Contributed by:
    kinyip

    Updated by:
    lauri

        56) How do I find out which nodes one may submit "sam store" jobs ?
    Do, for example, "sam dump fss --station=clued0".
    
    One would see :
    File Storage Server Dump:
    
    Stagers are known at nodes: jetsam-clued0.fnal.gov lagan-clued0.fnal.gov 
    
    This says that one can submit "sam store" jobs in jetsam-clued0.fnal.gov and lagan-clued0.fnal.gov for the station "clued0".
    Contributed by:
    kinyip

    Updated by:
    filthaut

        57) Hung project?
    (from Igor Terekhov)
    This is how to look deeper into this.
    First, understand how many files these projects are waiting for.
    It's possible that they are all hung on a single file. To see set of files
    in a project, as well those underlivered, do "sam dump project --project=XXX".
    Second, do "sam dump station --files | fgrep YYY" where YYY is the name
    of one such missing file. Hopefully, the station still remembers that it
    "owes" the file(s) to the project. If not, there's a bug in comm betw station and project.
    Third, Take one such data file and fgrep for it in the entire station's log
    file (or in master log file on th web). You'll see history of attempts
    to deliver the file. Send us this history if nothing is apparent.
    In summary, "sam dump project" followed by "sam dump station --files"
    is required to trace hung project.
    Contributed by:
    daria

    Updated by:
    daria

        58) I need some information in order to regenerate missing metadata. How can I find out when did I consume a given dataset (e.g. dataset get-jpsi-july2004b-84, username stark)?

    (From Robert Illingworth)
    From the analysis projects browser web page (
    http://d0db.fnal.gov/sam_data_browsing/AnalysisProjects.html) you can specify a dataset definition name and get all of the projects that used it. From the dataset mentioned, one gets (in an html table):

    Project name: stark-pick_event-09-00-19-12Aug2004
    Project Id:   297643
    Station:      central-analysis
    Start time:   12-aug-04/09:00:54
    End time:     12-aug-04/09:12:11
    Dataset:      174232  (actually snapshot ID)
    
    Currently, one can also get some more information with:
    % setup sam
    % sam get project summary --project=stark-pick_event-09-00-19-12Aug2004
    
    Please note that it won't be long before we start moving a new sam version into production (version 6 is still in the testing stage). The new version brings a new command which can provide even more information about a project:
    % setup sam -t -q user_prd2  # (-t needed for now, to get test version)
    % sam get project info --project=stark-pick_event-09-00-19-12Aug2004
    
    You can try the command above now using the "setup sam -t -q user_prd2". This extra setup will not be necessary after the new sam version becomes official.
    Contributed by:
    lima

    Updated by:
    lima

        59) What to do if files are declared with wrong file size or wrong CRC?
    Get the current metadata: with the SAM commands of sam v6_0:
    $ sam get metadata (usage)
    
    Get the CRC of a file: The official way to get the Enstore CRC for a local file is "setup encp" or "setup ecrc" then "ecrc <file>" Then, for the purposes of the SAM database, you need to append the letter L to the number you get.
    Change the crc and the file size: with the SAM commands of sam v6_0:
    $ setup sam v6_0 -q user_prd2
    $ sam update file crc --filename=xxxxxxxxx --crcValue=yyyyyyyyyL \
    --crcType="adler 32 crc type" (usage)
    $ sam update file size (usage)
    
    Note the trailing "L" in --crcValue, it does matter. That is it, then everything should be alright.
    Contributed by:
    lauri

    Updated by:
    lima

        60) What are the cron jobs in use for maintaining sam, and how can I find them?
    DISCLAIMER: the details given are for the D0 sam configuration.

    There are many cron jobs which run under the sam account; but let's hold off on those and talk about the others first, since there are fewer of them. Also, in the following I am only listing the cron jobs which are in PRODUCTION; in general, there are corrseponding jobs against the integration and the development environments.

    In all of the following, to see the specific crontab entries, their frequency, etc., the shifter would log in to the node/account specified, and do

       crontab -l
    
    along with suitable unix tools (grep comes to mind).

    • samshift: the samshift account runs the encpSynch and the dccpSynch jobs (tape/volume coordination with enstore). These are run from the d0db-prd node (currently d0ora2). You need your root principal to log in as user samshift.
    • products: the products account runs the jobs that update the web pages from the CVS repository. These are run from the web server nodes (which, for d0 at the present time, is the same as the dbserver node). (Shifters are generally not able to log into the products account).

    All other cron jobs (subject to my failing memory, of course) are run from the sam account. They are usually documented in the crontab file (to some extent). Some of the more notable cron jobs that run on the dbserver node are:

    1. archiveYesterdayLogs: take the log files generated (master logger, optimizer, dbserver, any stations, etc.) and store them in sam for posterity.
    2. purgeLogs.sh: purges the log files older than some date, to recover disk space
    3. sammis plot: generates plots and statistics pages
    4. sammis glance: generates the SamAtAGlance pages
    5. nameservice_cleanup.bash: unbinds servers from the nameservce when they have been deceased for some time
    Contributed by:
    lauri

        61) How to register a new sam_gridftp installation?
    Or How to add a new gridftp service certificate?

    (This is just a hands-on recipe of how to do this. Check here for more details about GSI, the Grid Security Infrastructure)

    In case the requester doesn't know, "Step 4" in http://d0db.fnal.gov/sam_gridftp/ describes how to request a sam service certificate.

    To add a new gridftp service certificate:

    % setup cdcvs (for clued0 or d0minoxx) 
    % cvs co sam_gsi_config
    % edit grid-security/grid-mapfile (for Dzero)
        OR grid-security/grid-mapfile-cdf (for CDF)
    

    Append a new line to the file, similar to this:

    "/the/certificate/subject/of/the/new/client" sam

    For example:

    "/DC=org/DC=doegrids/OU=Services/CN=sam/sprace.if.usp.br" sam 

    % cvs commit -m "Added Dzero gridftp client from machine.domain.name"
    

    It is very important that the certificate subject string be in quotes. The entry "sam" (no quotes)
    at the end of the line means "account file transfers for UID sam".

    Note 1: Your default login is the appropriate account for the above procedure, but you will need
    read/write access to the sam_gsi_config CVS repository:

    1. read access: send your kerberos principal to helpdesk@fnal.gov saying that you want to get access
      to the cdcvs repository; mention a sponsor (like Wyatt Merritt or Rick St. Denis)
    2. write access: ask another shifter to add your principal to the cvs shifter group. This is done by appending
      your principal to the file "allowed" in the repository check_access.d/groups/sam_shifter (DZero)
      or check_access.d/groups/sam_cdf (CDF).

    Note 2: On ClueD0 a previously executed kinit id/root causes problems with cvs access.
    Either logout/login or use kinit -f when returning to your default account.

    Note 3: if a site is not running any automatic grid-mapfile update script, the site admins will further have to add the certificate in their grid-mapfile by hand. For further details, refer to the SAM-IT #1987, comment #10, contributed by Gabriele Garzoglio and linked here.

    Contributed by:
    lima

    Updated by:
    henrik

        62) How to add a new individual user certificate?
    Updated from IT 3000
    1. The user need to register its new subject/certificate with Dzero VOMRS/VOMS by following instruction in http://www-d0.fnal.gov/VO/DZero_VO_Instructions.html
    2. The shifter should add the certificate subject to sam_gsi_config/grid-security/grid-mapfile-jimsam. Please follow the same instructions from previous FAQ item, by replacing grid-mapfile with grid-mapfile-jimsam.
    3. The subject should be distributed to the sites. This is automatic and happens periodically. If the site admins have not automated the process, they will have to add the subject manually.
    Note :
    Step 3 might not be immediate depending on the frequency at which update scripts are run.
    Step 2 should be automated some days. Until it is automated, the sam shifter should do it.
    Contributed by:
    lima

    Updated by:
    grenier

        63) How to add a storage area in the new SAM router d0rsam01, which replaced d0mino?
    You can do this by logging into a clued0 node as sam, and running
        ~/add_d0rsam01_store_dir sprace [or wherever]
    
    Contributed by:
    lima

    Updated by:
    filthaut

        64) How do you know whether a clued0 node is SAM-enabled batch node ?
    1. Use the command "pbsnodes -a | less" ( use "less" or "more" as it is a long list).
    2. Look for the node that you're interested in. A SAM-enabled batch node has the field 'properties = sam' attached to it.
    Contributed by:
    kinyip

    Updated by:
    bellavan

        65) How does one add/set up a clued0 node/disk to be a SAM-enabled batch node?

    Typically, and let us assume so here, the SAM cache disk for a clued0 node is /sam. This is not necessary, for example it is /samcache on flotsam-clued0. Here are the steps:

    1. Make sure the directory /sam exists and belongs to user "sam", NOT user "root". If not, submit a request to clued0-admin@fnal.gov to change the ownership of the /sam directory to "sam".
    2. Login to the target clued0 node as "sam". At the home directory, run
      • $ ./add_clued0_disk.py
      to add the disk to the SAM station with the correct size. If it complains the size is too small, give up! If the space is too small, it's not worth it.
    3. Finally, to fix the group space allocation, run
      • $ ./set_clued0_group_size.py
      Note: If you're modifying more than one node then you only need to do this once at the end.
    4. Execute
      • $ setup sam -q worker_prd
      • $ ups restart sam_bootstrap
      to restart the stager. Check to make sure the directory /sam/boo is there. If not, restart the stager again. Sometimes, it may take more than one time to make the /sam/boo directory appear.
      Also check the clued0 log file for any error message in that node after you restart the stager, especially if that disk still stays INACTIVE. Look out for error message described in this FAQ and deal with it accordingly.
    5. Finally, ask clued0-admin@fnal.gov to set this clued0 node to be a SAM-enabled batch node.
    Contributed by:
    kinyip

    Updated by:
    bellavan

        66) What does this error message : "^GWarning, discrepancies with file system on disk ..." mean ?
    If you see error messages in a logbook like:
    03/06/05 12:18:40 clued0.SM.LocalDiskAgent 1898: Writable directory /sam/boo exists
    03/06/05 12:18:40 clued0.SM.LocalDiskAgent 1898: ^GWarning, discrepancies with file system on disk 2435
    03/06/05 12:18:40 clued0.SM.LocalDiskAgent 1898: ^GFailed to activate disk 2435
    
    it indicates that there is a disk size discrepancy in the /sam disk of the node in question. You may execute the command "df /sam" in that node and see that the disk spaces don't add up correctly. This would render that sam disk "INACTIVE". You would trigger this error message by restarting the stager in the node in question, ie., "ups restart sam_bootstrap".

    One possible cause of this is that a user process is stuck with a file open for so long that the project has timed out and been killed and the station has tried to delete the file from the cache. Because of the way Unix file systems work the disk space for a deleted file is not freed until all the processes which have the file open have closed it. If 'ps -ef' shows up a possible culprit for this then you should ask the process owner to kill it.

    Other possible causes are incorrectly an mounted file system, "junk" files on the disk (such as recovered files after a system crash), or a corrupted file system. Most of these require root access to fix, so you should report it to the administrator of that node (such as clued0-admin@fnal.gov for clued0 and the helpdesk/run2-sys@fnal.gov for CAB) to ask them to fix the problem.

    Contributed by:
    kinyip

    Updated by:
    illingwo

        67) What does the error message "You have violated the constraint CF_FI_FK" mean? It happened when trying to remove a file from the SAM
    (from Robert) By design, SAM does not allow you to erase files from the database once they have been used in a project. The normal solution is to declare the unwanted file as bad. If a new file should replace it, you should save it with a different name.
    Contributed by:
    lima

        68) Several segments of my job failed. How do I resubmit them?
    For this purpose you can use the "generate recovery project": setup sam v7_0_2c -q test_prd sam generate recovery project --project= --recoDefname= The variables mean the following: * project = project which failed * recoDefname = name of the dataset definition which contains the failed files This command works if the recovery project contains less than 999 files. The files will be delivered again as it is fastest and not in the same order as before.
    Contributed by:
    bartsch

        69) How do I add a durable location to the database?
    Durable locations have to be registered before being used. To SAM, a disk consists of a top level directory containing one or more subdirectories. For example, to add the location sprace.if.usp.br:/raid1/jim_cache, do (this is with the v6/v7 client):

    samadmin add data disk --mountPoint=sprace.if.usp.br:/raid1 --size=100G

    The size specified in the first command doesn't matter: it is not used for durable locations, so any value will do.

    The following is obsolete now and so we don't need to do it :
    { samadmin add disk location --mountPoint=sprace.if.usp.br:/raid1 --relativePath=jim_cache
    If multiple subdirectories hang off the same top level directory then the first command only needs to be done once, then one time for each of the sub directories. }
    Contributed by:
    illingwo

    Updated by:
    kinyip

        70) How to spot problems with machines
    If you look at D0 SAM TV right now (http://www-clued0.fnal.gov/~sam/samTV/current/samTV.html) you'll see that fnal-cabsrv2 has many file delivery errors. While I know we don't really train you to diagnose such problems, this should have caused someone to send mail to the SAM on call expert. There are 24 projects that have received file delivery errors (out of 60 running). If you click on fnal-cabsrv2, you would see lots of projects with errors, including some projects that started only an hour or two ago. This is your hint that something bad is going on, and it's still going on. If you click on any of the projects, and scroll down to Files, the deliveries with errors appear first. The thing that stands out is that all of the deliveries occurred on jobs running on d0cs087. In fact click on another project and you'll see the same thing. This is a glaring hint that there's something wrong with d0cs087. If you try to ssh to d0cs087 (either as SAM or yourself) you'll see that you can't. So the machine is sick. The course of action to take is to either send mail to the sam on call expert, or send mail to the helpdesk (cc: to run2-sys) and report the sick machine. Hopefully a Sysadmin will look at it today. Since it's only one machine, it is not enough of a problem to page someone.
    Contributed by:
    schmittc

        71) How do I restart samTV?
    The instructions are
    here . Currently (21 March 2006), it is also necessary to type
    > setup sam v6_0
    
    before following the rest of the instructions.
    Also note that the server for SAM-TV has changed to d0samman.fnal.gov
    Contributed by:
    wyatt

    Updated by:
    bellavan

        72) How to add new FOME keywords/parameters?
    If the request is something like "Could the following keyword be added to the FOME data tier tables in v7: . FOME.DR-LF-LEP (type: float), meaning: min dR cut between any light flavor parton and a lepton applied at parton level", do the following Note that you *must* setup sam v7 (if you do this on a clued0 machine, you'll get v7 by default).
       setup sam
       samadmin add param type --paramCategory=fome --paramType=dr-lf-lep --dataType=float --connect=username/password@d0ofprd1
    
    Contributed by:
    neeti

    Updated by:
    filthaut

        73) How to see if someone is a registered sam user and how to register a new sam user?
    The SAM shifter can see if someone is a registered sam user by doing:
    sam get registered users | grep 'username'
    The SAM shifter can also add a new user with the sam add user command:
    http://d0dbweb.fnal.gov/sam_admin/samadmin_CLI_AddUser.html
    or
    The user can register himself/herself with SAM at:
    http://d0dbweb.fnal.gov/sam_admin/cgi/autoRegister.py .
    Contributed by:
    lietti

    Updated by:
    pengj

        74) How to add a new entry to the facility name table.
    To add a new name for receving MC jobs, add a new entry to the "facility name table" using the samadmin command
    samadmin add mc production center with the required option "--facilityName" set to the name requested by the user. Below are two examples.
    $ samadmin add mc production center --facilityName=GridKaV7 --connect=aarond
    New facilityId = 31
    $ samadmin add mc production center --facilityName=ltu.cct.lsu.edu --connect=aarond
    New facilityId = 32
    
    Contributed by:
    aarond

    Updated by:
    aarond

        75) New D0 DB Servers - procedures for shifters (e-mail from Adam Lyon, Nov. 2006)

    If you look at the SAM At a Glance, you'll see some different looking SAM DB servers, like...

    SAMDbServer.user_pool_1:SAMDbServer (Using: dbs_user_pool@d0ofprd1)
                   d0dbsrv9.fnal.gov:38075     v8_0_0   16 Nov 2006
                   10:20:39
    and
    SAMDbServer.user_prd2:SAMDbServer (Dispatching to 4 servers)
                   d0ora2.fnal.gov:55923     v8_0_0   21 Nov 2006 12:35:05
    
    Here's an explanation and what to look for...
    1. For the last year or so, we have been running some DB servers as "dispatchers with backends". The user_prd2 DB server example above is such a "dispatching" DB server. It is not a real DB server, but rather chooses a backend DB server in a round-robin fashion. This way the user thinks they are talking to one DB server, but in fact we can have many on the backend to distribute the load. Also, if one of the backend DB server dies, the dispatcher merely skips over it. That way we can have redundant DB servers, and all of the servers that are up do work.
    2. We have recently starting running some of the backend DB servers on linux machines (d0dbsrv9 - 12) instead of d0ora2. This was done to alleviate high load problems on d0ora2 and to work around some software incompatibilities with Oracle client. The user_pool_<n> DB servers like the one above is such a DB server. We also have web_pool and station_pool DB servers on Linux too. You will note that there is no link to the log files for these DB servers. Unfortunately, we have not yet started publishing their logs to the web - we are still working on this problem.
    3. The dispatching DB servers ALWAYS run on d0ora2 and d0ora2 is supported 24/7. If you see a red ball next to a dispatching DB server, try to restart that server (and only that server). If that fails, phone the expert immediately, day or night.
    4. If a backend DB server on Linux (d0dbsrv9-12) is dead, then please open an issue tracker ticket, BUT THIS IS NOT A CRITICAL PROBLEM. You do not need to phone the expert. The dispatching DB server will merely skip this DB server and the load will be spread among the remaining DB servers. The expert will look into the problem as soon as possible. But again, this is not a critical problem. Note that we do not have the Linux DB server machines open to shifters yet, so you will not be able to log in. But since each machine is not critical, this should not be a problem.
    5. If ALL of the backend linux DB servers are dead (e.g. all of DB servers for machines d0dbsrv9-12 are red) then you should immediately phone the expert. This condition probably means that the entire rack died (we've never seen that before).
    6. If a regular DB server on d0ora2 goes red (like dlsam_prd or grid<n>_prd) then please follow the normal procedures for restart. If the DB server is dlsam or the farm and you can't get it restarted, then phone the expert.

    To summarize:

    1. If a dispatching DB server goes red, log into d0ora2 and restart it (and only that DB server). If you have problems, phone the expert IMMEDIATELY day or night.
    2. If a linux backend DB server goes red (on d0dbsrv9-12), then open an issue tracker ticket and send an e-mail to helpdesk but do NOT phone the expert. This is not a critical problem.
    3. If ALL of the linux backend DB servers are dead (on d0dbsrv9-12) (rack failure), then immediately phone the expert.
    4. If a regular DB server on d0ora2 dies, then follow the normal restart procedures. If the DB server is dlsam or farm and you can't get it restarted, then phone the expert.
    Contributed by:
    carvalho

    Updated by:
    bellavan

        76) How to use an existing dataset and make new datasets with the files from the same tape?
    (Extracted from
    IT#1409)
    An updated python script which works with the current version of sam is now
    available at http://www-d0.fnal.gov/~illingwo/sam_utils/dsByTape.py
    let me know if you have any problems with using it.  
    Robert
    
    
    Contributed by:
    lima

        77) For a new SAM station installation or upgrade, what is a set of SAM package versions which are compatible and recommended for installation?
    (Extracted from
    IT#1410)

    The release cuts for SAMGrid are now kept at http://www-d0.fnal.gov/computing/grid/releases/.

    Contributed by:
    lima

    Updated by:
    lima

        78) How do I fix a corrupted xml database at GridKa / What are the commands to reinitialize the GridKa xml database?
    (Extracted from
    IT#3022)

    The xmldb at GridKa is located on a partition that may fill up. This can cause xmldb corruption (among other things). In one case, the global jobs table in the DB got to be > 1.1GB (see IT#3030). When the DB gets corrupted, reinitialize it using the following commands:

    1. Delete the jobs collection:
      • $ xmldb_cmd dc -c /db/jobs -n globalJobs1 -url http://<server_name>:7080/Xindice
      where <server_name> is the fully qualified domain name of the database server (for example, as it appears in a web browser when accessing it).
    2. Create a new jobs collection and index it. The collection will be automatically recreated when the first job is run after the collection deletion, but this will not create the index. So it is better to create the collection by hand and index it.
      • $ xmldb_cmd ac -c /db/jobs -n globalJobs1 -url http://<server_name>:7080/Xindice
    3. Fix corrupted indices by removing and re-creating them. If the xmldb_cmd ri command throws exceptions, then the index was not found / did not exist. Ignore the messages and just continue with the creation of new ones.
      • $ xmldb_cmd ri -c /db/jobs/globalJobs1 -n global_job_i \
           -url http://<server_name>:7080/Xindice
      • $ xmldb_cmd ci -c /db/jobs/globalJobs1 -n "global_job_i" \
           -p global_job -url http://<server_name>:7080/Xindice
      • $ xmldb_cmd ri -c /db/jobs/globalJobs1 -n local_job_id_i \
           -url http://<server_name>:7080/Xindice
      • $ xmldb_cmd ci -c /db/jobs/globalJobs1 -n "local_job_id_i" \
           -p "local_job@Id" -url http://<server_name>:7080/Xindice
    Contributed by:
    ffiedler

    Updated by:
    bellavan

        79) What do I do if samgrid.fnal.gov is not responding to web queries or MC job submission?
    (Extracted from
    IT#3036 )
    If samgrid.fnal.gov is not responding to web queries or job submissions, the status of the CONDOR processes needs to be checked. Example error messages of this type are "ERROR: Failed to connect to local queue manager" and "AUTHENTICATE:1002:Failure performing handshake". To check the processes,
    1. Log in to the machine samgrid.fnal.gov as user sam. If you get a "permission denied" message, please request an expert to take care of the rest of this process.
    2. Check if the scheduler has been restarted recently by looking at the process timestamp:
      • $ ps -efwww| grep condor_
      The resulting output should look like this:
      sam       4595     1  0 Sep04 ?        00:04:43 condor_master -pidfile master_pid
      sam       4601     1  0 Sep04 ?        00:02:33 condor_master -pidfile master_pid
      sam      26585  4595  0 Sep06 ?        00:11:57 condor_negotiator -f
      sam      26633  4595  0 Sep06 ?        00:22:46 condor_collector -f
      sam      28197  4601  6 Sep23 ?        01:49:42 condor_schedd -f
      followed by condor_gridmaster processes for any currently running jobs. If the time stamp on the condor_schedd process is very recent, it could be that the condor_master thought the condor_sched was hung and restarted it. This restart may take several minutes, so wait a few minutes and try your submissions again.
    3. A recent process restart can also be seen by looking at the condor log files :
      • $ setup jim_broker_client
      • $ less `condor_config_val master_log`
      • $ less `condor_config_val schedd_log`
    4. If the services are not running, try to restart them with:
      • $ ups stop server_run jim_broker_client
      • $ ups start server_run jim_broker_client
    Contributed by:
    raimund

    Updated by:
    bellavan

        80) What to do if the Plone Issue Tracker is not working ?
    If you find that the Plone Issue Tracker is not working, please first
    check that "https://plone3.fnal.gov/SAMGrid" comes up.
    
    1) If it does not, then please send e-mail to "helpdesk@fnal.gov" and
    tell them that "https://plone3.fnal.gov/SAMGrid" is not responding.
    
    2) If that page does come up, then send e-mail to the helpdesk saying
    that "The Plone-collector Issue Tracker" for
    https://plone3.fnal.gov/SAMGrid" is not working.
    
    In either case, please send mail to Adam Lyon too.
    
    I suspect that if the Issue tracker is not responding, the whole Plone
    site is down. So you should be following instructions #1 most often.
    
    Note that Plone is NOT supported 24x7, so don't bother to call the
    help desk after hours or on weekends.
    
    
    ( Instructions given by Adam Lyon )
    Contributed by:
    kinyip

    Updated by:
    kinyip

        81) Should we add a certificate to gridmapfile or gridmapfile-jimsam ?


    In the cvs package "sam_gsi_config", under "grid-security" sub-directory, there are files such as gridmapfile and gridmapfile-jimsam.

    One should add subjects to gridmapfile only when it is a service certificate. You will have a CN=sam/ in the subject.

    You add subjects to "gridmapfile-jimsam" when it is a user certificate.
    Contributed by:
    kinyip

    Updated by:
    kinyip

        82) How to get CRC or other enstore information about a stored file ?
    The "sam get enstore crc" doesn't seem to work any more as this sam command is calling the obsolete 'fileinfo' command, which doesn't work on Scientfic Linux 4. Since this command is not used often, it would soon be removed (rather than be fixed). ( This is written on May 13, 2008. )

    Instead, use the command "enstore info" instead. Eg.

    enstore info --file=/pnfs/sam/dzero/db5/importedSimulated/p17/bphysics/generated-bygroup/generic/0000/pythia_p17.09.01_NumEv-1000000_b+b-stable-b-hadrons_bphysics_mcp17_clued0_572879600762123408133162918

    You may get something like:
    {'bfid': 'D0MS121064655300000',
     'complete_crc': 3605617153L,
     'deleted': 'no',
     'drive': 'd0enmvr55a:/dev/rmt/tps0d0n:1110054152',
     'external_label': 'PSB331L1',
     'gid': 1507,
     'location_cookie': '0000_000000000_0000040',
     'pnfs_name0': '/pnfs/sam/dzero/db5/importedSimulated/p17/bphysics/generated-bygroup/generic/0000/pythia_p17.09.01_NumEv-1000000_b+b-stable-b-hadrons_bphysics_mcp17_clued0_572879600762123408133162918',
     'pnfsid': '000200000000000003D823A0',
     'sanity_cookie': (65536L, 3814843630L),
     'size': 450420736L,
     'uid': 7816,
     'update': '2008-05-12 21:42:33.347994'}
    
    Contributed by:
    kinyip

    Updated by:
    kinyip

        83) Adding a new SRM location
    The new srm-url is required (in the example below "srm://se01.cmsaf.mit.edu:8443/srm/managerv2?SFN=/pnfs/cmsaf.mit.edu/opportunistic/dzero/services/stagearea"), the the size of the new disk and the station name (in the example below "osg-ouhep"). Then do e.g.:
    samadmin add node --name=srm://mit-diskonly-osg-ouhep --hw=srm --os=srm
    
    (The node name is pretty much arbitrary, provided it has "srm://" at the start and "diskonly" somewhere.)
    samadmin add disk --size=600G --station=osg-ouhep  --mountPoint='srm://mit-diskonly-osg-ouhep:srm://se01.cmsaf.mit.edu:8443/srm/managerv2?SFN=/pnfs/cmsaf.mit.edu/opportunistic/dzero/services/stagearea'
    
    The mountpoint is the combination of the node name and the url.
    Contributed by:
    torchian

    Updated by:
    kinyip

        84) Additional method in fixing Dataset Definition Editor
    When "cmd restartapache" and restarting dbs_web_prd/tomcat don't work, try the following steps from Adam Lyon (Oct. 12, 2008):
    o Log into d0ora2 as sam
    o ps -fu sam | grep java - remember the PIDs
    o For each pid do, "pargs " and look for the one with "-DSAM_DB_SERVER_NAME=SAMDbServer.web_prd" and "-DSAM_WEB_SERVER_HOST=d0db-prd.fnal.gov"
    o Kill the pid that matches the above ("kill -9 ")
    o setup sam -q prd
    o cd ~/private
    o Edit the "prd_server_list.txt" file and comment out the last line (should say "tomcat tomcat_prd v5_0_27 -Xmx1g")
    o ups update sam_bootstrap # This stops tomcat (well, you did that already above - but need to do this too)
    o Re-edit the "prd_server_list.txt" file and uncomment out the tomcat line you commented out previously
    o ups update sam_bootstrap # This restarts tomcat
    o Check the the DDE at http://d0db-prd.fnal.gov:19655/sam_dataset_editor/ it should be up
    
    Contributed by:
    kinyip

    Updated by:
    kinyip

        85) Is there a script to mark files with "content status" bad for a request-id ?
    See :
    the instructions in DZSAM-1330.
    Contributed by:
    kinyip

    Updated by:
    kinyip

        86) How to remove a datalogger-d0ol? station from monitoring on the SAAG web page ?
    It happens that some online SAM station ( datalogger station under critical monitor level in SAAG web page) have been physically retired but their monitoring is still on on the SAAG web page. If after a mail at (run2-sys-d0online AT fnal.gov) concerning a datalogger which is yellow or red, you got an answer explaining the corresponding datalogger should be removed from monitoring, then you can do so by typing the following 2 lines after having done 'kinit -F username/root' : (example is for datalogger-d0olr, pick the right datalogger )
      $ setup sam
      $ samadmin set station monitor level --station=datalogger-d0olr --monitorLevel=ignore
    
    Contributed by:
    grenier

    Updated by:
    kinyip

        87) Uncache locations in Sam Cache (without restarting the SAM station)

    Inspired by
    DZSAM-1636, from Robert Illingworth, we can use the following command


       samadmin remove station file --station=mystation --file=myfile


    to uncache the locations in SAM cache for all copies of the files on that station, without restarting the sam station (private communication with Joel Snow, not mentioned in the ticket).

    By contrast, the command


       samadmin remove station replica


    uncaches a single copy of the file from a given station disk.

    To uncache files in a duration location, you may simply use


        sam erase file location
    Contributed by:
    kinyip

    Updated by:
    kinyip


    This page generated by the SAM FAQ facility.