.. CAS documentation master file, created by sphinx-quickstart on Wed May 6 22:44:40 2009. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Core Analysis System ==================== :Author: Adam Stokes :Release: |release| :Date: |today| Introduction ------------ .. image:: cas_logo.png Description ^^^^^^^^^^^ CAS provides a user the ability to configure an environment for core analysis quickly. All the hassles of matching kernel versions and machine architecture types to core dumps are automatically detected and processed. Prerequisites ^^^^^^^^^^^^^ CAS needs at least **Python 2.3** to run. For systems that are not running Fedora 9 or later (this would include RHEL 4/5) the EPEL repo needs to be installed. Visit `EPEL `_ to enable this repository. The amount of storage needed can be determined base on the following information: - The number of kernel-debuginfo packages needed - How many core dumps will be processed. Typically it is recommended to have at least 1TB for cores and another 500GB for the debuginfo packages. Since analyzing cores requires the same architecture specific systems the core was generated on there will need to be systems available of those same types in order for analyzation to work properly. Configuration ^^^^^^^^^^^^^ CAS comes with one main configuration file which is located at ``/etc/cas.conf``. The overall contents of this file is shown below, further down we will break up each section and describe its meaning:: [settings] casuser=root kernels=/mnt/kernels rpmFilter=.*kerne.+-debuginfo-[0-9].*\.rpm debugs=/cores/debugs debugLevel=DEBUG workDirectory=/cores/processed smtphost=mail.example.com database=/var/db/cas/cas.db [maintenance] purgeLimit=90 autoPurge=Yes [advanced] # crash_32=/usr/local/i386/crash # buffersize=None ``casuser``: (**Required**) User to run cas, recommended to run as someone other than root. ``kernels``: (**Required**) Describes the location of where kernel-debuginfo packages are to be stored. This can range anywhere from an nfs mount, samba share, local disk or any other type of media the cas server can access. ``rpmFilters``: (**Required**) This is a emacs based regular expression which is essentially passed to a find command to locate the various kernel-debuginfo packages defined in ``kernels`` directive. ``debugs``: (**Required**) A temporary directory in which to store the extracted vmlinux files from the kernel-debuginfo packages for processing. Another solution would be to alter this to point an existing directory like ``/tmp``, for instance. ``debugLevel``: As the name suggest it will set the debug level for CAS output. Currently the only accepted values are ``DEBUG|INFO``. ``workDirectory``: (**Required**) Defines where all processed cores will be placed. This mount point will need to have the most storage assigned to it. Depending on how many cores are processed in a given timeframe this area will fill up quickly. ``smtphost``: If wanting output of CAS processing email to a certain address this directive needs to be set. ``Note`` that the mail server should not require smtp authentication. ``database``: (**Required**) Define where the sqlite database will reside. ``purgeLimit``: Define amount of day(s) back wish to keep physical data on system. ``autoPurge``: Yes/No setting if wanting cas-admin to auto purge stale data on each run. ``crash_32``: Primarily used on x86_64 systems to process x86 cores. If x86 version of crash is installed this directive can be set to the crash binary and CAS will automatically process x86 cores on a x86_64 machine. ``Note`` this is only available if the CAS server is a x86_64 machine. ``buffersize``: Extend the read buffer when analyzing a core for a timestamp. ``Note`` this is normally needed for itanium cores, otherwise, the default is fine. Setup & Execution ----------------- Preparing CAS Server ^^^^^^^^^^^^^^^^^^^^ To install the CAS package simply type:: $ yum install cas Once installed edit ``/etc/cas.conf`` as root using any preferred text editor. As described above the required directives need to be altered to suit the environment in question. In this example, ``/mnt/kernels`` is an nfs mount which houses the kernel-debuginfo packages. ``/cores`` is where all processed cores are stored and ``/tmp`` is the temporary storage for collecting the necessary data from the kernel-debuginfos. A mail server is setup within the environment to email CAS results and this optional directive is shown to reflect that. Finally, the CAS server is an x86_64 machine and the environment will be processing x86 cores, therefore, the directive for this is uncommented and path to the x86 crash binary is given. ``Note`` there is information provided within the configuration file for installing the x86 crash to a different location. Altering the configuration to reflect the above assumptions would show the following:: [settings] casuser=cas kernels=/mnt/kernels rpmFilter=.*kerne.+-debuginfo-[0-9].*\.rpm debugs=/tmp debugLevel=DEBUG workDirectory=/cores smtphost=mail.cas-server.com database=/var/db/cas/cas.db [maintenance] purgeLimit=90 autoPurge=Yes [advanced] crash_32=/usr/local/i386/crash #buffer=None Now that the configuration file is altered and ``/mnt/kernels`` should be populated with kernel-debuginfo rpm's the next section will describe running CAS. Running CAS ^^^^^^^^^^^ First, one or two administrative tasks need to be run. The required task is to build a database for all the data gathered from the kernel-debuginfo packages.:: $ cas-admin -b If several systems are deployed for CAS to use, ssh keys must be setup between the host (CAS) and the clients:: (cas-server) $ ssh-keygen -t dsa Cas supports passwordless entries at this time. (cas-server) $ ssh-copy-id -i ~/.ssh/id_dsa casuser@cas-client-system.com Once ssh has been setup between systems the following will build the server database:: $ cas-admin -s Please note that in order for cas to function properly it is required that only the cas user on the system has only those entries in its ssh hostkey file that are accessible with cas. Cas will error with ``Authentication Failed`` and exit cleanly if it runs into any system that it can not communicate with. At this point CAS is configured and looking at the output of CAS help there are a few options to pass:: Usage: cas [opts] args Options: -h, --help show this help message and exit -i IDENTIFIER, --identifier=IDENTIFIER Unique ID for core -f FILENAME, --file=FILENAME Filename -e EMAIL, --email=EMAIL Define email for results (must be valid!) -m, --modules Extract associated kernel modules CAS prepares its directory hierarchy based on the ``identifier`` this option is therefore required. ``filename`` is also required as it tells CAS exactly which core to process and associate with ``identifier``. If wanting email results from CAS simply pass it the email parameter. An example, of a user wanting to process a corefile named ``vmcore.12345``:: $ cas -i 12345 -f vmcore.12345 -e user@cas-server.com In the above example an assumption is made that ``1`` is associated to some form of ticketing system so to keep things organized an identifier was set of that number. The directory hierarchy for the current job should look like ``/cores/1``. In addition to the processing of core files there is also a ``process log`` contained within this directory for each job processed. If multiple jobs for the same identifier are issued they are placed within a sub directory marked by the current timestamp and the relevant data associated with it. The last option worth mentioning is for core analyst who are needing to work within the core that requires one of the kernel modules loaded during the crash. This can be extracted by passing the ``modules`` parameter in the CAS execution statement. ``Note`` the ``modules`` parameter is not heavily used but can be useful when analyzing filesystem issues and the like. From this point on CAS will download, process, and email the results of its initial analysis to the specified email address. From there further instructions are provided in either the email or the ``process log`` on how to access and analyze the core. Analyzing --------- Continuing with the previous example the results of CAS processing should be emailed and look something similar to:: Subject: CAS results for 1 Date: Tue, 06 May 2009 08:41:20 -0500 Location: Location: /cores/1/2009.05.06.08.41.20 Server: x86_64.cas-server.com Output data: PID: 0 TASK: ffffffff803e9b80 CPU: 0 COMMAND: "swapper" #0 [ffffffff8047a0a0] smp_call_function_interrupt at ffffffff8011d191 #1 [ffffffff8047a0b0] call_function_interrupt at ffffffff80110bf5 --- --- #2 [ffffffff80529f08] call_function_interrupt at ffffffff80110bf5 [exception RIP: default_idle+32] RIP: ffffffff8010e7a9 RSP: ffffffff80529fb8 RFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000018 RDX: ffffffff8010e789 RSI: ffffffff803e9b80 RDI: 0000010008001780 RBP: 0000000000000000 R8: ffffffff80528000 R9: 0000000000000080 R10: 0000000000000100 R11: 0000000000000004 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: fffffffffffffffa CS: 0010 SS: 0018 #3 [ffffffff80529fb8] cpu_idle at ffffffff8010e81c PID: 0 TASK: 100f57cb030 CPU: 1 COMMAND: "swapper" #0 [1000107bfa0] smp_call_function_interrupt at ffffffff8011d191 #1 [1000107bfb0] call_function_interrupt at ffffffff80110bf5 --- --- #2 [10001073e98] call_function_interrupt at ffffffff80110bf5 [exception RIP: default_idle+32] RIP: ffffffff8010e7a9 RSP: 0000010001073f48 RFLAGS: 00000246 RAX: 0000000000000000 RBX: 0000000000000e86 RCX: 0000000000000018 RDX: ffffffff8010e789 RSI: 00000100f57cb030 RDI: 00000102000a4780 RBP: 0000000000000001 R8: 0000010001072000 R9: 0000000000000040 R10: 0000000000000000 R11: 0000000000000008 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: fffffffffffffffa CS: 0010 SS: 0018 #3 [10001073f48] cpu_idle at ffffffff8010e81c PID: 6122 TASK: 101f3658030 CPU: 2 COMMAND: "gfs_quotad" #0 [101f21efb20] start_disk_dump at ffffffffa03183ff #1 [101f21efb50] try_crashdump at ffffffff8014cc1d #2 [101f21efb60] die at ffffffff80111c90 #3 [101f21efb80] do_invalid_op at ffffffff80112058 #4 [101f21efc40] error_exit at ffffffff80110e1d [exception RIP: do_dlm_lock+366] ... snip ... From this email a ``location`` is provided ``Location: /cores/1/2009.05.06.08.41.20`` and the server in which further analyzation can be continued ``x86_64.cas-server.com``. Normally from a support perspective this email should contain enough information for a kernel engineer to begin debugging the problem. Assuming more is needed the information provided previously will prove beneficial for anyone wishing to access this data. Logging into the stated server and changing into the directory defined several files are presented:: $ pwd Location: /cores/1/2009.05.06.08.41.20 $ ls 1.log crash crash.in crash.out usr vmcore.12345 log memory modules sys traceback ``1.log``: contains any informational messages presented during the processing of the core. Everything from informational to debug statements are provided here. ``crash``: a script autogenerated to provide an automated way of gathering intial data from the coredump. ``Note`` if wanting to use this crash wrapper in a more manual approach some alterations to the script need to occur. crash wrapper in its original form:: #!/bin/sh /usr/bin/crash \ /cores/1/2009.05.06.08.41.20/vmcore.12345 \ usr/*/*/*/*/2.6.9*largesmp/vmlinux $* ``Note`` Running the crash wrapper manually will result in an interactive instance. **Alternative to using the crash wrapper** It is possible to specify the vmlinux and corefile with crash on the command line:: $ crash /cores/1/2009.05.06.08.41.20/usr/*/*/*/*/2.6.9*largesmp/vmlinux \ /cores/1/2009.05.06.08.41.20/vmcore.12345 ``crash.in``: a list of commands to be read into crash during the automated analysis:: bt >> traceback bt -a >> traceback sys >> sys sys -c >> sys log >> log mod >> modules kmem >> memory kmem -f >> memory exit This can be extended by adding more snippets into ``/var/lib/cas/snippets``. Please see that directory for examples. ``crash.out``: output of initial crash analysis and the same data which is sent in an email if defined. ``usr``: directory structure from the extraction of the vmlinux file from the associated kernel-debuginfo rpm for use within crash:: /cores/1/2009.05.06.08.41.20/ usr/lib/debug/lib/modules/2.6.9-78.18.ELlargesmp/vmlinux ``vmcore.12345``: corefile from which was either defined or extracted from a compressed archive during CAS initialization. Troubleshooting --------------- Some of the major problems that arise when using CAS usually boils down to some improper usage of the compression and archiving tools. When compressing a core which may need to be sent over the network to a CAS server one of the proper ways to do so is:: $ tar cvjf vmcore.12345.tar.bz2 vmcore.12345 Other various ways of compressing archive are as follows:: $ tar cvzf vmcore.tar.gz vmcore $ gzip vmcore $ bzip2 vmcore ``Note``: please do not double compress or CAS will fail. Another issue, which isn't primarily a fault of CAS, are incomplete or corrupted cores. If either of these occur there is a chance that CAS will not be able to process the data needed to associate a debug kernel or do any sort of automated analysis. Unfortunately, there is not much that can be done to resolve these sort of issues other than verifying that the process which happens when a system coredump and when that dump reaches the system specified for retrieval is solid and are seeing no errors. Resources ========= * `CAS Wiki `_ * `CAS FAQ `_ * `Mailing list `_ * `Upstream releases `_ * Checkout latest from Git, ``git clone git://git.fedorahosted.org/cas.git``