Decimal Arithmetic Specification, version 1.08 Copyright (c) IBM Corporation, 2003. All rights reserved. ©	8 Jan 2003
[previous \| contents \| next]

The Arithmetic Model

This specification is based on a model of decimal arithmetic which is a formalization of the decimal system of numeration (Algorism) as further defined and constrained by the relevant standards (IEEE 854 and ANSI X3-274).

There are three components to the model:

numbers – which represent the values which can be manipulated by, or be the results of, the core operations defined in this specification
operations – the core operations (such as addition, multiplication, etc.) which can be carried out on numbers
context – which represents the user-selectable parameters and rules which govern the results of arithmetic operations (for example, the precision to be used).

This specification defines these components in the abstract. It neither defines the way in which operations are expressed (which might vary depending on the computer language or other interface being used),^[1] nor does it define the concrete representation (specific layout in storage, or in a processor's register, for example) of numbers or context.

The remainder of this section describes the abstract model for each component.

Abstract representation of numbers

Numbers represent the values which can be manipulated by, or be the results of, the core operations defined in this specification.

Numbers may be finite numbers (numbers whose value can be represented exactly) or they may be special values (infinities and other values which are not finite numbers).

Finite numbers

Finite numbers are defined by three integer parameters:

sign – a value which must be either 0 or 1, where 1 indicates that the number is negative or is the negative zero and 0 indicates that the number is zero or positive.
coefficient – an integer which may be zero or positive.
In the abstract, there is no upper limit on the maximum size of the coefficient. In practice, an implementation may need to define a specific upper limit (for example, the length of the maximum coefficient supported by the concrete representation). This limit must be expressed as an integral number of decimal digits.^[2]
exponent – a signed integer which indicates the power of ten by which the coefficient is multiplied.
In the abstract, there is no upper limit on the absolute value of the exponent. In practice there may be some upper limit, E_limit, on the absolute value of the exponent. It is recommended that this limit be expressed as an integral number of decimal digits or be one of the numbers 1, 5, or 25, multiplied by an positive integral power of ten and optionally reduced by one (for example, 49 or 50).
If the coefficient has a maximum length then it is required ^[3] that E_limit be greater than 5 × mlength, where mlength is the maximum length of the coefficient in decimal digits. It is recommended that E_limit be greater than 10 × mlength.
The adjusted exponent is the value of the exponent of a number when that number is expressed as though in scientific notation with one digit (non-zero unless the coefficient is 0) before any decimal point. This is given by the value of the exponent+(clength–1), where clength is the length of the coefficient in decimal digits.
When a limit to the exponent applies, it must result in a balanced range of positive or negative numbers,^[4] taking into account the magnitude of the coefficient. To achieve this balanced range, the minimum and maximum values of the adjusted exponent (E_min and E_max respectively) must have the same magnitude. E_max will always equal E_limit (the largest value of the exponent) and E_min will always equal –E_max.
Therefore, if the length of the coefficient is clength digits, the exponent may take any of the values –E_limit–(clength–1) through E_limit–(clength–1).
For example, if the coefficient had the value 123456789 (9 digits) and the exponent had an E_limit of 999 (3 digits), then the exponent could range from –1007 through +991. This would allow positive values of the number to range from 1.23456789E–999 through 1.23456789E+999.

The numerical value of a finite number is given by: (–1)^sign × coefficient × 10^exponent

Notes:

Many concrete representations for finite numbers have been used successfully. Typically, the coefficient is represented in some form of binary coded or packed decimal, or is encoded using a base which is a higher power of ten. It may also be expressed as a binary integer. The exponent is typically represented by a twos complement or biased binary integer. Some possible concrete representations are described in detail at:http://www2.hursley.ibm.com/decimal/deccode.html
This abstract definition deliberately allows for multiple representations of values which are numerically equal but are visually distinct (such as 1 and 1.00). However, there is a one-to-one mapping between the abstract representation and the result of the primary conversion to string using to-scientific-string on that abstract representation. In other words, if one number has a different abstract representation to another, then the primary string conversion will also be different.
No such constraint applies to the concrete representation (that is, there may be multiple concrete representations of a single abstract representation).
A number with a coefficient of 0 is permitted to have a non-zero sign. This negative zero is accepted as an operand for all operations (see IEEE 854 §3.1).

Special values

In addition to the finite numbers, numbers must also be able to represent one of three named special values:

infinity – a value representing a number whose magnitude is infinitely large (¥, see IEEE 854 §6.1)
quiet NaN – a value representing undefined results (‘Not a Number’) which does not cause an Invalid operation condition. IEEE 854 recommends that additional diagnostic information be associated with quiet NaNs (see IEEE 854 §6.2)
signaling NaN – a value representing undefined results (‘Not a Number’) which will cause an Invalid operation condition if used in any operation defined in this specification (see IEEE 854 §6.2).

When a number has one of these special values, its coefficient and exponent are undefined.^[5] The sign, however, is significant (that is, it is possible to have both positive and negative infinity, and the sign of a NaN is always 0).

Subnormal numbers and Underflow

Numbers whose adjusted exponents are less than E_min are called subnormal numbers.^[6] These subnormal numbers are accepted as operands for all operations, and may result from any operation. If a result is subnormal, before any rounding, then the Subnormal condition is raised.

For a subnormal result, the minimum value of the exponent becomes –E_limit–(precision–1), called E_tiny, where precision is the working precision, as described below. The result will be rounded, if necessary, to ensure that the exponent is no smaller than E_tiny. If, during this rounding, the result becomes inexact, then the Underflow condition is raised. A subnormal result does not necessarily raise Underflow, therefore, but is always indicated by the Subnormal condition (even if, after rounding, its value is 0).

When a number underflows to zero during a calculation, its exponent will be E_tiny. The maximum value of the exponent is unaffected.

Note that the minimum value of the exponent for subnormal numbers is the same as the minimum value of exponent which can arise during operations which do not result in subnormal numbers, which occurs in the case where clength = precision.

Notation

In later sections of this document, a specific finite number is described by its abstract representation, using the triad notation: [sign, coefficient, exponent], where each value is an integer. Only the exponent can be negative.

Similarly, duples are used to indicate the special values. These have the form [sign, special-value], where the sign is indicated as before, and the special-value is one of inf, qNaN, or sNaN, representing infinity, quiet NaN, or signaling NaN, respectively.

So, for example, the triad [0,2708,-2] represents the number 27.08, the triad [1,1953,0] represents the integer -1953, the duple [1,inf] represents the number –¥, and the duple [0,qNaN] represents a quiet NaN.

Abstract representation of operations

The core operations which must be provided by an implementation are described in later sections which define Conversions and Arithmetic Operations. Each operation is given an abstract name (for example, ‘add’), and its semantics are strictly defined. However, the implementation of each operation and the manner by which each is effected is not defined by this specification.

For example, in a object-oriented language, the addition operation might be effected by a method called add, whereas in a calculator application it might be effected by clicking on a button icon. In other uses, an infix ‘+’ symbol might be used to indicate addition. And in all cases, the operation might be carried out in software, hardware, or some combination of these.

Similarly, operations which are distinct in the specification need not be mapped one-to-one to distinct operations in the implementation – it is only necessary that all the core operations are available. For example, conversions to a string could be handled by a single method, with variations determined from context or additional arguments.

Abstract representation of context

The context represents the user-selectable parameters and rules which govern the results of arithmetic operations (for example, the precision to be used). It is defined by the following parameters:

precision

An integer which must be positive (greater than 0). This sets the maximum number of significant digits that can result from an arithmetic operation.

In the abstract, there is no upper bound on the precision (although a specific precision must always be provided). In practice there may need to be some upper limit to it (for example, the length of the maximum coefficient supported by a concrete representation). This limit must be expressed as an integral number of decimal digits.

Similarly, there may be a lower bound on the setting on precision, which may be the same as the upper bound (for example, if it is implied by the length of the maximum coefficient supported by a concrete representation). This limit must also be expressed as an integral number of decimal digits.

An implementation must designate a precision to be known as single precision (see IEEE 854 §3.2.1). This must be greater than 5 (see IEEE 854 §3.1) and within the range of implemented precisions. It is recommended that it be at least 9.^[7]

An implementation may also designate a precision to be known as double precision, which must be within the range of implemented precisions (see IEEE 854 §3.2.2). If a double precision is designated, then the following constraints apply:

If the value of single precision is given by P_s, and the value of double precision is given by P_d, then P_d must be greater than or equal to 2 × P_s + 1 (see IEEE 854 §3.2.2).
The maximum exponent (E_limit) at the designated single precision must be at least 1 less than the E_limit at double precision, divided by 8 (see IEEE 854 §3.2.2).^[8]

If these constraints cannot be implemented (for example, an implementation may support very large exponents and not be able to have different exponent limits for differing precisions), then a double precision must not be designated.

rounding

A named value which indicates the algorithm to be used when rounding is necessary. Rounding is applied when a result coefficient has more significant digits than the value of precision; in this case the result coefficient is shortened to precision digits and may then be incremented by one (which may require a further shortening), depending on the rounding algorithm selected and the remaining digits of the original coefficient. The exponent is adjusted to compensate for any shortening.

The following rounding algorithms are defined and must be supported:^[9]

round-down

(Truncate.) The discarded digits are ignored; the result is unchanged.

round-half-up

If the discarded digits represent greater than or equal to half (0.5) of the value of a one in the next left position then the result should be incremented by 1 (rounded up). Otherwise the discarded digits are ignored.

round-half-even

If the discarded digits represent greater than half (0.5) the value of a one in the next left position then the result should be incremented by 1 (rounded up). If they represent less than half, then the result is not adjusted (that is, the discarded digits are ignored).

Otherwise (they represent exactly half) the result is unaltered if its rightmost digit is even, or incremented by 1 (rounded up) if its rightmost digit is odd (to make an even digit).

round-ceiling

(Round toward +¥.) If all of the discarded digits are zero or if the sign is 1 the result is unchanged. Otherwise, the result should be incremented by 1 (rounded up). If this would cause overflow then the result will be [0,inf].

round-floor

(Round toward –¥.) If all of the discarded digits are zero or if the sign is 0 the result is unchanged. Otherwise, the sign is 1 and the coefficient should be incremented by 1. If this would cause overflow then the result will be [1,inf].

When a result is rounded, the coefficient may become longer than the current precision. In this case the least significant digit of the coefficient (it will be a zero) is removed (reducing the precision by one), and the exponent is incremented by one. This in turn may give rise to an overflow condition.

flags and trap-enablers

The exceptional conditions are grouped into signals, which can be controlled individually. The context contains a flag (which is either 0 or 1) and a trap-enabler (which also is either 0 or 1) for each signal.

For each of the signals, the corresponding flag is set to 1 when the signal occurs. It is only reset to 0 by explicit user action.

For each of the signals, the corresponding trap-enabler indicates which action is to be taken when the signal occurs (see IEEE 854 §7). If 0, a defined result is supplied, and execution continues (for example, an overflow is perhaps converted to a positive or negative infinity). If 1, then execution of the operation is ended or paused and control passes to a ‘trap handler’, which will have access to the defined result.

The signals are:

division-by-zero

raised when a non-zero dividend is divided by zero

inexact

raised when a result is not exact (one or more non-zero coefficient digits were discarded during rounding)

invalid-operation

raised when a result would be undefined or impossible

This signal cannot occur, and is therefore optional, in an implementation where the lower bound for precision is equal to the maximum length of the coefficient.

overflow

raised when the exponent of a result is too large to be represented

rounded

raised when a result has been rounded (that is, some zero or non-zero coefficient digits were discarded)

subnormal

raised when a result is subnormal (its adjusted exponent is less than E_min), before any rounding

underflow

raised when a result is both subnormal and inexact.

This specification does not define the means by which flags and traps are reset or altered, respectively, or the means by which traps are effected.^[10]

Notes:

The setting of precision may be used to reduce a result from double to single precision, using the plus operation. This meets the requirements of IEEE 854 § 4.3.
IEEE 854 was designed under the assumption that some small number of known precisions would be available to users. This specification extends this concept to allow (but not require) variable precisions, as specified by ANSI X3.274. This generalization allows improved interoperation between software arbitrary-precision decimal packages and hardware implementations (which are expected to have relatively low maximum precision limits, typically just tens of digits).
precision can be set to positive values lower than nine. Small values, however, should be used with care – the loss of precision and rounding thus requested will affect all computations affected by the context, including comparisons. To conform to IEEE 854, this value should not be set less than 6.
For completeness, it is recommended that implementations also offer two further rounding modes: round-half-down (round to nearest, where a 0.5 case is rounded down) and round-up (round away from zero).
The concrete representation of rounding is often a series of integer constants, or an enumeration, held in an object or control register.
It has been proposed that each exceptional condition should have its own, distinct, signal and trap-enabler. This specification may change to this approach.

Default contexts

This specification defines two default contexts, which define suitable settings for basic arithmetic and for the extended arithmetic defined by IEEE 854. It is recommended that the default contexts be easily selectable by the user.

Basic default context

In the basic default context, the parameters are set as follows:

flags – all set to 0
trap-enablers – inexact, rounded, and subnormal are set to 0; all others are set to 1 (that is, the other conditions are treated as errors)
precision – is set to 9
rounding – is set to round-half-up

Extended default context

In the extended default context, the parameters are set as follows:

flags – all set to 0
trap-enablers – all set to 0 (IEEE 854 §7)
precision – is set to the designated single precision
rounding – is set to round-half-even (IEEE 854 §4.1)

It is recommended that if a double precision is designated then a third extended double default context be provided, with the same settings as the extended default context except that the precision is set to the double precision.

Footnotes:

[1]	Indeed, some variations of operations could be selected by using context settings outside the scope of this specification.
[2]	That is, the maximum value of the coefficient will be an integral power of ten, less one – for example, 99999999999999999999.
[3]	See IEEE 854 §3.1.
[4]	This rule, a requirement for both ANSI X3.274 and IEEE 854, constrains the number of values which would overflow or underflow when inverted (divided into 1).
[5]	Typically, in a concrete representation, certain out-of-range values of the exponent are used to indicate the special values, and the coefficient is used to carry additional diagnostic information for quiet NaNs.
[6]	IEEE 854 defines subnormal numbers as numbers whose absolute value is non-zero and is closer to zero than ten to the power of E_min. This definition includes zeros with tiny exponents.
[7]	This is the ‘narrowest basic precision’ described in IEEE 854 §3.2.1. Strictly speaking, single precision should be the narrowest precision supported; however it is assumed that when precision is fully variable the intent of IEEE 854 is that the designation applies to the narrowest default precision – the programmer is permitted to specify a narrower precision explicitly.
[8]	This constraint is very slightly tighter than that defined by IEEE 854, which specifies that E_limit for double be greater than or equal to 8 × E_limit for double, plus 7. There is a preference for human-oriented limits, so it is suggested that the E_limit for single be one tenth of, or one digit shorter than, the E_limit for double.
[9]	The term ‘round to nearest’ is not used because it is ambiguous. round-half-up is the usual round-to-nearest algorithm used in European countries, in international financial dealings, and in the USA for tax calculations. round-half-even is often used for other applications in the USA, where it is usually called ‘round to nearest’ and is sometimes called ‘banker's rounding’.
[10]	IEEE 854 suggests that there be a mechanism allowing traps to return a substitute result to the operation that raised the exception, but this may not be possible in some environments (including some object-oriented environments).

[previous | contents | next]