E Administer LSF¶
Abstract
概述
Cluster management essentials¶
- Work with your cluster
- Viewing cluster information
- Control daemons
- Controlling mbatchd
- LSF daemon startup control
- Overview
- Configuration to enable
- LSF daemon startup control behavior
- Configuration to modify
- Commands
- Commands to reconfigure your cluster
- Reconfiguring with the lsadmin and badmin commands
- Reconfiguring by restarting the mbatchd daemon
- Viewing configuration errors
- Live reconfiguration
- bconf command authentication
- Enabling live reconfiguration
- Adding a user share to a fairshare queue
- View bconf records
- Merge configuration files
- Adding cluster adminstrators
- Working with hosts
- Host status
- View host information
- Customize host information output
- Customize host load information output
- Control hosts
- Connect to an execution host or container
- Host names
- Hosts with multiple addresses
- Use IPv6 addresses
- Specify host names with condensed notation
- Job directories and data
- Directory for job output
- Specify a directory for job output
- Temporary job directories
- About flexible job CWD
- About flexible job output directory
- Job notification
- Disable job email
- Size of job email
Monitoring cluster operations and health¶
- Monitor cluster performance
- Monitor performance metrics in real time
- Diagnose query requests
- Diagnose scheduler buckets
- Monitor scheduler efficiency and overhead
- Monitor job information
- View host-level and queue-level suspending conditions
- View job-level suspending conditions
- View resume thresholds
- View job priority information
- View job dependencies
- View information about backfill jobs
- View information about job start time
- View the run limits for interruptible backfill jobs (bjobs and bhist)
- Display available slots for backfill jobs
- Viewing job array information
- View information about reserved job slots
- View configured job slot share
- View slot allocation of running jobs
- Monitor applications by using external scripts
- Create external scripts
- Configure the application profiles
- Use the application profiles
- View resource information
- View job-level resource requirements
- View queue-level resource requirements
- View shared resources for hosts
- View load on a host
- View job resource usage
- View cluster resources (lsinfo)
- View host resources (lshosts)
- Viewing host load by resource (lshosts -s)
- Customize host resource information output
- View resource reservation information
- View host-level resource information (bhosts)
- View queue-level resource information (bqueues)
- View reserved memory for pending jobs (bjobs)
- View per-resource reservation (bresources)
- View information about resource allocation limits
- View application profile information
- View available application profiles
- View fairshare information
- View queue-level fairshare information
- View cross-queue fairshare information
- View hierarchical share information for a group
- View hierarchical share information for a host partition
- View host partition information
- Viewing information about SLAs and service classes
- Monitoring an SLA
- Viewing configured guaranteed resource pools
- Viewing guarantee policy information
- View user and user group information
- View user information
- View user pending job threshold information
- Customize user information output
- View user group information
- View user share information
- View user group admin information
- View queue information
- Queue states
- View available queues and queue status
- View detailed queue information
- Customize queue information output
- View the state change history of a queue
- View queue administrators
- View exception status for queues (bqueues)
Managing job execution¶
- Managing job execution
- About job states
- View job information
- View all jobs for all users
- View job IDs
- View jobs for specific users
- View running jobs
- View done jobs
- View pending job information
- View job suspend reasons
- View post-execution states
- View exception status for jobs (bjobs)
- View unfinished job summary information
- View the job submission environment
- Customize job information output
- Force job execution
- Force a pending job to run
- Suspend and resume jobs
- Suspend a job
- Resume a job
- Kill jobs
- Kill a job
- Kill multiple jobs
- Kill jobs by status
- Kill and record jobs as DONE
- Force removal of a job from LSF
- Remove hung jobs from LSF
- Orphan job termination
- Send a signal to a job
- Signals on different platforms
- Send a signal to a job
- Data provenance
- Prerequisites
- Using data provenance tools
- Job file spooling
- File spooling for job input, output, and command files
- Specify job input file
- Change job input file
- Job spooling directory (JOB_SPOOL_DIR)
- Specify a job command file (bsub -Zs)
- Remote file access with non-shared file space
- Copy files from the submission host to execution host
- Specify input file
- Copy output files back to the submission host
- Job submission option files
- Specify a JSON file
- Specify a YAML file
- Specify a JSDL file
- File spooling for job input, output, and command files
- Job data management
- Copy a file to a remote host (bsub -f)
- Use LSF Data Manager for data staging
- Use direct data staging (bsub -stage)
- Configuring direct data staging
- Submitting and running direct data staging jobs
- Job scheduling and dispatch
- Use exclusive scheduling
- Configure an exclusive queue
- Configure a host to run one job at a time
- Submit an exclusive job
- Configure a compute unit exclusive queue
- Submit a compute unit exclusive job
- Job dependency and job priority
- Job dependency scheduling
- Job dependency terminology
- Dependency conditions
- Job priorities
- User-assigned job priority
- Configure job priority
- Specify job priority
- Automatic job priority escalation
- Configure job priority escalation
- Absolute priority scheduling
- Enable absolute priority scheduling
- Modify the system APS value (bmod)
- Configure APS across multiple queues
- Job priority behavior
- User-assigned job priority
- Job dependency scheduling
- Job requeue and job rerun
- About job requeue
- Automatic job requeue
- Configure automatic job requeue
- Job-level automatic requeue
- Configure reverse requeue
- Exclusive job requeue
- Configure exclusive job requeue
- Requeue a job
- Automatic job requeue
- Automatic job rerun
- Configure queue-level job rerun
- Submit a rerunnable job
- Submit a job as not rerunnable
- Disable post-execution for rerunnable jobs
- About job requeue
- Job start time prediction
- Job affinity scheduling with host attributes
- Configure host attributes
- Manage host attributes
- Submit jobs with attribute affinity
- Use exclusive scheduling
- Control job execution
- Pre-execution and post-execution processing
- About pre- and post-execution processing
- Configuration to enable pre- and post-execution processing
- Pre- and post-execution processing behavior
- Check job history for a pre-execution script failure
- Configuration to modify pre- and post-execution processing
- Set host exclusion based on job-based pre-execution scripts
- Pre- and post-execution processing commands
- Job starters
- About job starters
- Command-level job starters
- Queue-level job starters
- Configure a queue-level job starter
- JOB_STARTER parameter (lsb.queues)
- Control the execution environment with job starters
- Job control actions
- Submit jobs as other users
- External job submission and execution controls
- Job submission and execution controls
- Configuration to enable job submission and execution controls
- Job submission and execution controls behavior
- Configuration to modify job submission and execution controls
- Job submission and execution controls commands
- Command arguments for job submission and execution controls
- Pre-execution and post-execution processing
- Interactive jobs and remote tasks
- Interactive jobs with bsub
- About interactive jobs
- Submit interactive jobs
- Submit an interactive job
- Submit an interactive job by using a pseudo-terminal
- Submit an interactive job and redirect streams to files
- Submit an interactive job, redirect streams to files, and display streams
- Performance tuning for interactive batch jobs
- Interactive batch job messaging
- Configure interactive batch job messaging
- Example messages
- Run X applications with bsub
- Configure SSH X11 forwarding for jobs
- Write job scripts
- Register utmp file entries for interactive batch jobs
- Interactive and remote tasks
- Run remote tasks
- Run a task on the best available host
- Run a task on a host with specific resources
- Resource usage
- Run a task on a specific host
- Run a task by using a pseudo-terminal
- Run the same task on many hosts in sequence
- Run parallel tasks
- Run tasks on hosts specified by a file
- Interactive tasks
- Redirect streams to files
- Load sharing interactive sessions
- Log on to the least loaded host
- Log on to a host with specific resources
- Run remote tasks
- Interactive jobs with bsub
Configuring and sharing job resources¶
- About LSF resources
- Resource categories
- How LSF uses resources
- Representing job resources in LSF
- Batch built-in resources
- Static resources
- How LIM detects cores, threads, and processors
- Define ncpus—processors, cores, or threads
- Define computation of ncpus on dynamic hosts
- Define computation of ncpus on static hosts
- Load indices
- About configured resources
- Add new resources to your cluster
- Configure the lsf.shared resource section
- Configure lsf.cluster.cluster_name Host section
- Configure lsf.cluster.cluster_name ResourceMap section
- Reserve a static shared resource
- External load indices
- About external load indices
- Configuration to enable external load indices
- Define a dynamic external resource
- Map an external resource
- Create an elim executable file
- Overriding built-in load indices
- Setting up an ELIM to support JSDL
- Example of an elim executable file
- External load indices behavior
- Configuration to modify external load indices
- External load indices commands
- External static load indices
- Configuration to enable external static load indices
- Create eslim executable files
- Example of an eslim executable file
- Modify a built-in load index
- Configure host resources
- Adding a host to your cluster
- Add hosts dynamically
- Configuring and running batch jobs on dynamic hosts
- Change a dynamic host to a static host
- Add a dynamic host in a shared file system environment
- Add a dynamic host in a non-shared file system environment
- Add a host to the cluster using bconf
- Add hosts dynamically
- Removing a host from your cluster
- Remove a host from management candidate list
- Remove dynamic hosts
- Adding a host to your cluster
- Share resources in queues
- Controlling queues
- Closing a queue
- Opening a queue
- Deactivating a queue
- Activating a queue
- Logging a comment on a queue control command
- Configuring dispatch windows
- Configuring run windows
- Adding a queue
- Removing a queue
- Restricting which hosts can use queues
- Restricting job size requested by parallel jobs in a queue
- Adding queue administrators
- Change job order within queues
- Switch jobs from one queue to another
- Switch a single job to a different queue
- Switch all jobs to a different queue
- Use external job switch controls
- Configuration to enable job switch controls
- Configuration to modify job switch controls
- Command arguments for job switch controls
- Controlling queues
- Application profiles
- Manage application profiles
- Add an application profile
- Submit jobs to application profiles
- How application profiles interact with queue and job parameters
- Application profile settings that override queue settings
- Application profile limits and queue limits
- Define application-specific environment variables
- Task limits
- Absolute run limits
- Pre-execution
- Post-execution
- Rerunnable jobs
- Resource requirements
- Estimated job run time and runtime limits
- Manage application profiles
- Plan-based scheduling and reservations
- Enabling plan-based scheduling
- Plan-based allocations
- Plan-based scheduling run time
- Plan-based scheduling limits and prioritization
- Configuring extendable run limits
- Reserving resources for an allocation plan
- Canceling planned allocations
- Delaying planning for jobs
- Limiting the number of planned jobs
- Adjusting the plan window
- Distributing job resources to users in LSF
- Configure resource consumers
- User groups
- User groups in LSF
- How to define user groups
- Where to configure user groups
- Configure user groups
- Configure user group administrators
- Configure user group administrator rights
- Import external user groups (egroup)
- Existing user groups as LSF user groups
- External Host and User Groups
- About external host and user groups
- Configuration to enable external host and user groups
- External host and user groups behavior
- Between-Host User Account Mapping
- About between-host user account mapping
- Configuration to enable between-host user account mapping
- Between-host user account mapping behavior
- Between-host user account mapping commands
- Cross-Cluster User Account Mapping
- About cross-cluster user account mapping
- Configuration to enable cross-cluster user account mapping
- Cross-cluster user account mapping behavior
- Cross-cluster user account mapping commands
- UNIX/Windows User Account Mapping
- About UNIX/Windows user account mapping
- Configuration to enable UNIX/Windows user account mapping
- UNIX/Windows user account mapping behavior
- Configuration to modify UNIX/Windows user account mapping behavior
- UNIX/Windows user account mapping commands
- Create a user group using bconf
- User groups in LSF
- Job groups
- Job group limits
- Create a job group
- Submit jobs under a job group
- View information about job groups (bjgroup)
- View jobs for a specific job group (bjobs)
- Job groups and time-based SLAs
- View job groups attached to a time-based SLA (bjgroup)
- Control jobs in job groups
- Suspend jobs (bstop)
- Resume suspended jobs (bresume)
- Move jobs to a different job group (bmod)
- Terminate jobs (bkill)
- Delete a job group manually (bgdel)
- Modify a job group limit (bgmod)
- Automatic job group cleanup
- Host groups
- Configure host groups
- Wildcards and special characters to define host names
- Define condensed host groups
- User groups
- Specifying resource requirements
- About resource requirements
- Queue-level resource requirements
- Job-level resource requirements
- Resource requirement strings
- Selection string
- Order string
- Usage string
- Span string
- Same string
- Compute unit string
- Affinity string
- Specify GPU resource requirements
- Reserving resources
- About resource reservation
- Use resource reservation
- Configure resource reservation at the queue level
- Specify job-level resource reservation
- Configure per-resource reservation
- Memory reservation for pending jobs
- Reserve host memory for pending jobs
- Enable memory reservation for sequential jobs
- Configure lsb.queues
- Use memory reservation for pending jobs
- How memory reservation for pending jobs works
- Time-based slot reservation
- Configure time-based slot reservation
- Assumptions and limitations
- Reservation scenarios
- Examples
- Limiting job resource allocations
- How resource allocation limits work
- How job limits work
- Configuring resource allocation limits
- Enable resource allocation limits
- Configure cluster-wide limits
- Limit conflicts
- How resource allocation limits map to pre-version 7 job slot limits
- Create a limit using bconf
- Update a limit using bconf
- Make sure resources are distributed fairly
- Runtime resource usage limits
- About resource usage limits
- Changing the units for resource usage limits
- Specify resource usage limits
- Default run limits for backfill scheduling
- Specify job-level resource usage limits
- Resource usage limits syntax
- CPU time limit
- Normalized CPU time
- Data segment size limit
- File size limit
- Memory limit
- Memory limit enforcement
- Smart memory limit enforcement
- OS memory limit enforcement
- Process limit
- Runtime limit
- Normalized run time
- LSF multicluster capability runtime limit
- Thread limit
- Stack limit
- Swap limit
- CPU time limit
- Examples
- CPU time and run time normalization
- Memory and swap limit enforcement based on Linux cgroups
- PAM resource limits
- Configure a PAM file
- Load thresholds
- Automatic job suspension
- Suspending conditions
- Configure suspending conditions at queue level
- About resuming suspended jobs
- Specify resume condition
- Time configuration
- Time windows
- Time expressions
- Automatic time-based configuration
- Dispatch and run windows
- Run windows
- Configure run windows
- View information about run windows
- Dispatch windows
- Configure host dispatch windows
- Configure queue dispatch windows
- Display host dispatch windows
- Display queue dispatch windows
- Run windows
- Deadline constraint scheduling
- Disable deadline constraint scheduling
- Runtime resource usage limits
- Preemptive scheduling
- Resource Preemption
- About resource preemption
- Requirements for resource preemption
- Custom job controls for resource preemption
- Resource preemption steps
- Configure resource preemption
- Memory preemption
- About preemptive scheduling
- Configuration to enable preemptive scheduling
- Preemptive scheduling behavior
- Configuration to modify preemptive scheduling behavior
- Preemptive scheduling commands
- Resource Preemption
- Goal-oriented SLA-driven scheduling
- Using goal-oriented SLA scheduling
- Configuring service classes for SLA scheduling
- Configure service classes using bconf
- Time-based service classes
- Configure time-based service classes
- Time-based SLA examples
- SLA CONTROL_ACTION parameter (lsb.serviceclasses)
- Submit jobs to a service class
- Modify SLA jobs (bmod)
- Configure resource consumers
- Global resources
- Global resource collection
- Configuring
- Using
GPU resources¶
- Enabling GPU features
- Automatic GPU configuration
- Enabling jobs to use GPU resources
- Optimizing GPU resource metric collection
- Nvidia Data Center GPU Manager (DCGM) features
- GPU access enforcement
- Decreasing GPU power consumption when a GPU is not in use
- Nvidia Multi-Instance GPU (MIG) features
- Monitoring GPU resources
- Monitor GPU resources with lsload command
- Monitor GPU resources with lshosts command
- Submitting and monitoring GPU jobs
- Configuring GPU resource requirements
- Submitting jobs that require GPU resources
- Monitoring GPU jobs
- Example GPU job submissions
- GPU features using ELIM
- Manually configure and use GPU resources (legacy ELIM procedure)
- Controlling GPU auto-boost
Configuring containers¶
- LSF with Docker
- Preparing LSF to run Docker jobs
- Configuring LSF to run Docker jobs
- Configuring LSF to run NVIDIA Docker jobs
- Submitting Docker jobs
- Submitting NVIDIA Docker jobs
- LSF with Shifter
- Configuring LSF to run Shifter jobs
- Submitting Shifter jobs
- LSF with Singularity
- Configuring LSF to run Singularity jobs
- Submitting Singularity jobs to LSF
- LSF with Podman
- Preparing LSF to run Podman jobs
- Configuring LSF to run jobs in Podman containers
- Submitting Podman jobs
- LSF with Enroot
- Configuring LSF to run jobs in Enroot containers
- Submitting Enroot jobs
High throughput workload administration¶
- Job packs
- Job arrays
- Create a job array
- Handle input and output files
- Prepare input files
- Pass arguments on the command line
- Set a whole array dependency
- Control job arrays
- Requeue jobs in DONE state
- Job array job slot limit
- Set a job array slot limit at submission
- Fairshare scheduling
- Understand fairshare scheduling
- Ways to configure fairshare
- Chargeback fairshare
- Configure chargeback fairshare
- Equal share
- Configure equal share
- Priority user and static priority fairshare
- Configure priority user fairshare
- Configure static priority fairshare
- Host partition fairshare
- Configure host partition fairshare
- GPU run time fairshare
- Configure GPU run time
- Chargeback fairshare
- User-based fairshare
- Configure hierarchical fairshare
- Configure a share tree
- User share assignments
- Dynamic user priority
- Use time decay and committed run time
- Historical run time decay
- Configure historical run time
- How mbatchd reconfiguration and restart affects historical run time
- Run time decay
- Configure run time decay
- Committed run time weighting factor
- Configure committed run time
- Historical run time decay
- How fairshare affects job dispatch order
- Host partition user-based fairshare
- Configure host partition fairshare scheduling
- Queue-level user-based fairshare
- Configure queue-level fairshare
- Cross-queue user-based fairshare
- Configure cross-queue fairshare
- Control job dispatch order in cross-queue fairshare
- Queue-based fairshare
- Slot allocation per queue
- Configure slot allocation per queue
- Typical slot allocation scenarios
- Users affected by multiple fairshare policies
- Submit a job and specify a user group
- Resizable jobs and fairshare
- Guaranteed resource pools
- About guaranteed resources
- Configuration overview of guaranteed resource pools
- Submitting jobs to use guarantees
- Package guarantees
- Add consumers to a guaranteed resource pool
- Reserving memory and license resources
- Memory reservation for pending jobs
- Reserve host memory for pending jobs
- Enable memory reservation for sequential jobs
- Configure lsb.queues
- Use memory reservation for pending jobs
- How memory reservation for pending jobs works
- Reserving license resources
- Memory reservation for pending jobs
Parallel workload administration¶
- Running parallel jobs
- How LSF runs parallel jobs
- Preparing your environment to submit parallel jobs to LSF
- Use a job starter
- Submit a parallel job
- Start parallel tasks with LSF utilities
- Job slot limits for parallel jobs
- Specify a minimum and maximum number of tasks
- Restrict job size requested by parallel jobs
- About specifying a first execution host
- Specify a first execution host
- Rules
- Specify a first execution host
- Compute units
- Control job locality using compute units
- Configure compute units
- Use wildcards and special characters to define names in compute units
- Define condensed compute units
- Import external host groups (egroup)
- Use compute units with advance reservation
- Control processor allocation across hosts
- Run parallel processes on homogeneous hosts
- Limit the number of processors allocated
- Limit the number of allocated hosts
- Reserve processors
- Configure processor reservation
- Reserve memory for pending parallel jobs
- Configure memory reservation for pending parallel jobs
- Enable per-task memory reservation
- Backfill scheduling
- Configure a backfill queue
- Enforce run limits
- Use backfill on memory
- Use interruptible backfill
- Configure an interruptible backfill queue
- Submit backfill jobs according to available slots
- How deadline constraint scheduling works for parallel jobs
- Optimized preemption of parallel jobs
- Configure optimized preemption
- Controlling CPU and memory affinity
- Submit affinity jobs
- Submit affinity jobs for IBM POWER8 systems
- Managing jobs with affinity resource requirements
- Affinity preemption
- Affinity binding based on Linux cgroup cpuset subsystem
- Portable hardware locality
- Submit affinity jobs
- Processor binding for LSF job processes
- Enabling processor binding for LSF job processes
- Processor binding for parallel jobs
- Running parallel jobs with blaunch
- blaunch distributed application framework
- SGI vendor MPI support
- Running jobs with task geometry
- Enforcing resource usage limits for parallel tasks
- Running MPI workload through IBM Parallel Environment Runtime Edition
- Enabling IBM PE Runtime Edition for LSF
- Network-aware scheduling
- Submitting IBM Parallel Environment jobs through LSF
- Managing IBM Parallel Environment jobs through LSF
- Advance reservation
- Types of advance reservations
- Enable advance reservation
- Allow users to create advance reservations
- Use advance reservation
- Adding reservations
- Changing reservations
- Removing reservations
- Viewing reservations
- Submitting and modifying jobs that use advance reservations
- View jobs that are associated with an advance reservation
- Advance reservation behavior and operations
- Fairshare scheduling
- Understand fairshare scheduling
- Parallel fairshare
- Configure parallel fairshare
- User share assignments
- Dynamic user priority
- Use time decay and committed run time
- Historical run time decay
- Configure historical run time
- How mbatchd reconfiguration and restart affects historical run time
- Run time decay
- Configure run time decay
- Committed run time weighting factor
- Configure committed run time
- Historical run time decay
- How fairshare affects job dispatch order
- Host partition user-based fairshare
- Configure host partition fairshare scheduling
- Queue-level user-based fairshare
- Configure queue-level fairshare
- Cross-queue user-based fairshare
- Configure cross-queue fairshare
- Control job dispatch order in cross-queue fairshare
- User-based fairshare
- Configure hierarchical fairshare
- Configure a share tree
- Queue-based fairshare
- Slot allocation per queue
- Configure slot allocation per queue
- Typical slot allocation scenarios
- Users affected by multiple fairshare policies
- Submit a job and specify a user group
- Ways to configure fairshare
- Host partition fairshare
- Configure host partition fairshare
- Chargeback fairshare
- Configure chargeback fairshare
- Equal share
- Configure equal share
- Priority user and static priority fairshare
- Configure priority user fairshare
- Configure static priority fairshare
- GPU run time fairshare
- Configure GPU run time
- Host partition fairshare
- Resizable jobs and fairshare
- Job count based fairshare
- Job checkpoint and restart
- About job checkpoint and restart
- Configuration to enable job checkpoint and restart
- Job checkpoint and restart behavior
- Configuration to modify job checkpoint and restart
- Job checkpoint and restart commands
- Job migration for checkpointable and rerunnable jobs
- Job migration behavior
- Configuration to enable job migration
- Configuration to modify job migration
- Job migration commands
- Resizable jobs
- Resizable job behavior
- Configuration to enable resizable jobs
- Resizable job commands
- Resizable job management
- Submit a resizable job
- Check pending resize requests
- Cancel an active pending request
- Specify a resize notification command manually
- Script for resizing
- How resizable jobs work with other LSF features
Security in LSF¶
- Security considerations
- Communications between daemons and commands
- Transmission of IBM Spectrum LSF commands for remote execution
- Access to jobs belonging to other users
- Accessing remote hosts
- False requests
- Authentication
- Secure your LSF cluster
- Secure communications between daemons and commands
- Encrypt transmission of LSF commands for remote execution and login
- Restrict user access to remote hosts
- Secure your cluster against false requests
- Customize external authentication
- Enable external authentication of LSF daemons
- Secure the cluster from root access for batch interactive jobs in pseudoterminals
- Restrict user access to administration commands and log files
- Job information access control
- Setting job information access control
- Secure the lsf.conf file and prevent users from changing the job user
- Temporarily enable root privileges
- View the cluster security settings
Advanced configuration¶
- Error and event logging
- System directories and log files
- About LSF log files
- Log levels and descriptions
- Manage error logs
- Set the log files owner
- View the number of file descriptors remaining
- Locate error logs
- System event log
- Duplicate logging of event logs
- Configure duplicate logging
- Set daemon message log to debug level
- Set daemon timing levels
- LSF job termination reason logging
- View logged job exit information (bacct -l)
- View recent job exit information (bjobs -l)
- Termination reasons
- LSF job exit codes
- System directories and log files
- Event generation
- Event generation
- Enable event generation for custom programs
- Events list
- Arguments passed to the LSF event program
- Event generation
- Customize batch command messages
- How LIM determines host models and types
- Automatically detect operating system types and versions
- Add a custom host type or model
- Automatic detection of hardware reconfiguration
- Set the external static LIM
- Shared file access
- Shared files on Windows
- Use LSF with non-shared file systems
- Shared configuration file content
- Authentication and authorization
- Change authentication method
- Authentication options
- Operating system authorization
- LSF authorization
- Authorization failure
- External authentication
- External authentication with LSF (eauth)
- Configuration to enable external authentication
- External authentication behavior
- Configuration to modify external authentication
- External authentication commands
- Kerberos authentication
- Kerberos authentication with LSF
- Configuration to enable Kerberos authentication
- Configuration to modify Kerberos authentication
- Kerberos authentication commands
- Handle job exceptions
- Email job exception details
- Default eadmin actions
- Handle job initialization failures
- Handle host-level job exceptions
- Handle job exceptions in queues
- Understand successful application exit values
- Specify successful application exit values
- Tune CPU factors
- View normalized ratings
- Tune CPU factors
- Set clean period for DONE jobs
- Enable host-based resources
- Portable hardware locality
- Define GPU resources
- Define Intel Xeon Phi resources
- Global fairshare scheduling
- Global fairshare background
- Remote fairshare load
- Sync mode of global fairshare policy
- Global fairshare setup and configuration
- Global policy daemon
- Global fairshare policy
- Global fairshare dynamic user priority
- Share load synchronization rules
- Configure queue level user-based global fairshare
- Configure cross-queue user-based global fairshare
- Global fairshare scheduling constraints
- Manage LSF on EGO
- About LSF on EGO
- LSF and EGO directory structure
- Configure LSF and EGO
- LSF and EGO corresponding parameters
- Parameters that have changed in LSF 10
- Special resource groups for LSF management hosts
- Manage LSF daemons through EGO
- Bypass EGO login at startup (lsf.sudoers)
- Set the command-line environment
- LSF features on EGO
- Supported LSF features with EGO-enabled SLA scheduling
- LSF features that require modification to work with EGO-enabled SLA scheduling
- Unsupported LSF features with EGO-enabled SLA scheduling
- Logging and troubleshooting
- EGO log files
- Troubleshooting using multiple EGO log files
- Frequently asked questions
- Load sharing X applications
- Start an xterm
- xterm on a PC
- Set up Exceed to log on the least loaded host
- Start an xterm in Exceed
- Examples
- Using LSF with the Etnus TotalView Debugger
- How IBM Spectrum LSF Works with TotalView
- Running jobs for TotalView debugging
- Controlling and monitoring jobs being debugged in TotalView
- Register LSF host names and IP addresses to LSF servers
Performance tuning¶
- Tune your cluster
- Tune LIM
- Load thresholds
- Compare LIM load thresholds
- LIM reports a host as busy
- Interactive jobs
- Multiprocessor systems
- How LSF works with LSF_MASTER_LIST
- Using a DNS host cache to improve cluster startup performance
- Load thresholds
- Improve mbatchd response time after mbatchd restart
- Improve mbatchd query performance
- Configuring mbatchd to use multithreading
- Multithread batch queries
- Setting a dedicated query port for mbatchd
- Specify an expiry time for child mbatchd
- Configure mbatchd to push new job information to child mbatchd
- Specify hard CPU affinity
- Configuring mbatchd to use multithreading
- Logging mbatchd performance metrics
- Logging mbatchd and mbschd profiling information
- Improve performance of mbatchd for job array switching events
- Increase queue responsiveness
- Automatically bind LSF daemons to specific CPU cores
- Use LSF Explorer to improve the performance of the bacct and bhist commands, or to retrieve additional data
- Improve slot utilization by preventing bwait from running in jobs
- Tune LIM
- Achieve performance and scalability
- Optimize performance in large sites
- Tune UNIX for large clusters
- Increase the file descriptor limit
- Tune LSF for large clusters
- Manage scheduling performance
- Enable fast job dispatch
- Enable continuous scheduling
- Use scheduler threads to evaluate resource requirement matching
- Limit job dependency evaluation
- Limit the number of batch queries
- Improve the speed of host status updates
- Limit your user’s ability to move jobs in a queue
- Manage the number of pending reasons
- Achieve efficient event switching
- Automatic load updates
- Manage I/O performance of the info directory
- Job ID limit
- Manage scheduling performance
Energy aware scheduling¶
- Managing host power states
- Configuring host power state management
- Power parameters in lsb.params
- PowerPolicy section in lsb.resources
- Controlling and monitoring host power state management
- Valid host statuses for power saved mode
- Disabling the power operation feature
- Changing lsf.shared / lsf.cluster
- Integration with Advance Reservation
- Integration with provisioning systems
- Configuring host power state management
- CPU frequency management
- Configuring CPU frequency management
- Specifying CPU frequency management for jobs
- Job energy usage reporting
- Resource usage in job summary email
- Automatic CPU frequency selection
- Prerequisites
- Configure MySQL database
- Configuring automatic CPU frequency selection
- Installing and configuring benchmarking programs
- Checking compute node performance
- Calculating coefficient data
- Creating an energy policy tag
- Energy policy tag format
- Generate an energy policy tag
- Enable automatic CPU frequency selection
- Prerequisites
LSF multicluster capability¶
- Overview of LSF multicluster capability
- Benefits of IBM Spectrum LSF multicluster capability
- Two Platform MultiCluster models
- Set up LSF multicluster capability
- Setup overview
- System requirements
- Installation and configuration procedures
- Install Platform MultiCluster
- Set common ports
- Non-uniform name spaces
- User-level account mapping
- Restricted awareness of remote clusters
- Add or modify RemoteClusters list
- Security of daemon communication
- Authentication between clusters
- Resource usage updates for MultiCluster jobs
- Global limits for job resource allocations
- Configure global limits
- View the global limits
- MultiCluster information cache
- Setup overview
- Job forwarding model
- Job forwarding model overview
- Job scheduling under the job forwarding model
- Queue scheduling parameters under job forwarding model
- Advance reservations across clusters
- Special considerations under job forwarding model
- Job migration
- Checkpoint a MultiCluster job
- Absolute priority scheduling
- Strict resource requirement select string syntax
- Compute unit requirement strings
- Multicluster queues
- Enable multicluster queues
- Remote-only queues
- Configure a remote-only queue
- Request a specific cluster
- Remote cluster equivalency
- Remote Resources
- Remote queue workload job-forwarding scheduler
- Enable queue preference
- Configure queue preference
- Enable job slot limit
- Configure pending job slot limit
- Enable queue preference
- Pre-exec retry threshold
- Retry threshold and suspend notification
- Pending MultiCluster job limit
- Update pending reason for MultiCluster jobs
- Configure the pending reason updating interval
- Configure the pending reason update package size
- Remote timeout limit
- Enable job priority in MultiCluster job forward mode
- Specify a job priority (bsub -sp)
- Configure maximum job priority
- Enhance fairshare calculation to include the job fowarding mode
- Resource leasing model
- Lease model overview
- Using the lease model
- Special considerations under resource leasing model
- Resource export
- Create an export policy
- Export workstations
- Distribution policy for automatically selected hosts
- Export special hosts
- Distribution policy for named hosts
- Export other resources
- Export shared resources
- Shared lease
- Enable shared leasing
- Borrow resources
- Parallel jobs and the lease model