Pages

Friday, February 19, 2016

Alignment - first thoughts

Memory alignment is an important and unavoidable low-level fact.
The trade-off is space overhead due to padding which is to be minimized.
Managing alignment presents challenges, specially before C++11.
As you know, C++11 is a major advance since C++03.

A few important things to keep in mind are:
 
  • In main memory, every data type must (as an absolute requirement) or should (at least for optimization) be located at a characteristic address multiple, aka the alignment modulus due to hardware design and engineering issues. In fact, it's not uncommon, with exceptions, to have the arithmetic data types' alignments and the void * type alignment dictating all the other dependent data types' alignments.
      
  • Type T is more restrictive (or stringent) then type S if the alignment modulus for type T is greater than (or equal --- a technicality for the reflexive property) to the alignment modulus for type S.
      
  • In general, if type T is more restrictive than type S, then converting T * to S * and back won't cause addressing misalignment or exception in detriment of a program or application.
      
  • By means of an appropriate C or C++ typecast, a void * can be converted into a valid T * whatever be the type of T. Similarly, a const void * can be safely be assigned to a const T *. Hence, void * is more restrictive than T * for any type T (including function pointers void (*)() ). These facts are quite powerful because (low-level) allocators generally return a void * which in turn they obtain from a platform low-level system-call including ::shmat() and malloc(3C) variants such as: mtalloc(3ALLOC), umem_alloc(3ALLOC), and bsdmalloc(3MALLOC), not excluding their possible and respective implementation of memalign(). Either the allocator will return an address suitable to the most restrictive type of the compiler will adjust it according to the typecast.
     
  •  An aggregate data type alignment modulus is frequently a multiple of the alignment modulus of the more stringent of its members, which ultimately is the alignment modulus of the more stringent primitive data type involved. Aside from hardware constrains, the basic fact behind this is the language requirements to well-formed arrays.
 
Naturally, I understand that platform and/or compiler specifics pragmas, such as pragma align and pragma pack, as well as unions including the most restrictive type are very useful tools in assuring a certain alignment constraint, but they consist on a different approach to what I'm talking about here, which (although not space efficient) is more portable in all aspects. My goal is to obtain a properly aligned value of a type-specific pointer (T *) within an arbitrary raw chunk of memory (a raw block of memory pages), which, after all, is typical in the context of memory allocators.
  
Thus, the trick is how to obtain proper alignment and avoid any portability problems.
I'd say the answer is based on the following equivalent rules:
  1. Rely on the standards and be aware of industry-wide facts.
  2. Assume nothing too specific about a platform and its system and compilers.
I always kept in mind these rules, but in practice I never really found a sufficiently clear example of how to adhere to them. Perhaps it is a matter of highly confidential treasures of a few. At the same time, the scenarios were apparently so ugly that I was discouraged to try to tackle anything by myself. In addition, third parties solutions claiming to address the issue were so clumsy that I became uninterested on embracing any of them. I just wanted some really simple approach that could get the job done on most straightforward scenarios.

NOTE
I do not (yet) attempt to (statically or even at run-time) compute the alignment modulus of a given type. For the moment I just rely on a compiler good support to the C++ language. As I said earlier, what I initially attempt is to obtain a properly (although not optimally) aligned pointer to a given type T.

Possibly, the following is good enough for most scenarios:
From the beginning of a certain block of memory not necessarily aligned to the most stringent type and pointed to by a base pointer, the goal is to compute a higher and properly aligned address suitable to the next object of type O. Optionally, there may or may be an initial arbitrary (prefix or header) object of type P beyond (the end of) which we want to find a properly aligned address suitable to an object of type O. Of course the final properly aligned pointer so obtained is also naturally suitable to be the base address of an array of O-objects.

By the way the term object most frequently refers to an aggregate type (such as a struct) which members are optimally padded by the compiler. Note that the base pointer do not necessarily points to an initial general prefix (or header) P-object that is different from an O-object and which a suitable higher address we are looking for. As such, the base pointer may not be suitable to the prefix P-object as well as first address beyond the end of the prefix P-object is not generally suitable (properly aligned) to store the subsequent O-object we are interested in.

NOTE
Before the advent of C++11 alignof() and alignas(), managing alignment (specially of aggregate types) in a portable and general way towards a finer alignment modulus value than the natural one returned by sizeof() was quite difficult as already demonstrated by some of the most prominent C++ gurus, such as Andrei Alexandrescu. One "brute force" (but error-prone) workaround is to manually inform (via a function or template parameter) a specific alignment modulus upon a particular visual inspection of a certain data type, specially the aggregated ones.

The requirements are:
  • The base pointer
  • The type of the prefix P-object 
  • The type of the O-object(s)

The big picture is depicted by the following:
  • The BASE POINTER address is general not necessarily suitable to a P object
  • The END OF PREFIX is (in general) not suitable address to an O object
  • The goal is to compute the NEXT ALIGNED OBJECT address
    (by computing the offset involved on the necessary padding)
 

The key and subtle aspect about the return value of the sizeof operator, is that the compiler carefully computes it in order to provide not just a proper alignment but also a proper array addressing to the argument type. The returned value is a multiple of the optimal alignment modulus. Thus the alignment modulus for an given type is less than or equal to the value returned by the sizeof when applied to that type. Those are constrains from the language and the hardware which the compiler has to manage.

Under Solaris 11.1 x64 and Solaris Studio 12.3 C++03, the following declarations...

struct S1
{
   char c;
   int i;
};

struct S2
{
   char c;
   long l;
};

struct S3
{
   char c;
   long double ld;
};

... produce the following results at run-time:
 
sizeof( char )         =  1
sizeof( int )          =  4
sizeof( long )         =  8
sizeof( long double )  = 16
 
sizeof( void * )       =  8

sizeof( S1 )           =  8
sizeof( S2 )           = 16
sizeof( S3 )           = 32
 
sizeof( S1[2] )        = 16
sizeof( S2[2] )        = 32
sizeof( S3[2] )        = 64
  
NOTE
The alignment modulus of S3 above could be some power of 2 which divides 32 (its size) and matches the alignment of most stringent type (long double) among its members, which is the alignment modulus of its largest (in size) primitive member (ld). Note that on some other platform, a long double could have an alignment modulus inferior to its size. As such, for the sake of simplicity in getting absolute portability I am using the sub-optimal value of 32 (which is full size of S3 and which for sure is a multiple of whatever its optimal alignment modulus could be). It's possible to find the optimal alignment modulus by a series of smart language tricks and machinery backed by a good compiler support, but that's too deep to cover on this introduction. Back to our rather simplistic examples and assumptions, if the prefix object (on the above example denoted by the first data member) is less than 16 bytes (as it's the case of a char type) we are wasting that difference with padding (15 bytes, almost a 50% of the whole aggregate size) as the price for simplicity and an absolute bullet-proof portability (until C++03).

The following templates provide an example of a possible implementation.
They consist on a type function and a template function.

// The base template for the "type function"
template< typename P >
struct pointed_by;

// The specialization for the "type function"
// Gets the base type of a pointer type
template< typename P >
struct pointed_by< P * >
{
   typedef P type;
};

// 
// The alignment template function
// The result is greater or equal to the parameter
//
// Inspired on: 
//
//    The C Programming Language
//    Kernighan & Ritchie
//    8.7 - malloc() implementation (variable nunits)
//
template< typename P >
inline P align( void * p )
{
   const std::size_t alignment =  
      sizeof( typename pointed_by< P >::type );

   return reinterpret_cast< P >
   (
      (
         (
            reinterpret_cast< std::intptr_t >( p )
            +
            ( alignment - 1 )
         )
         / alignment
      )
      * alignment
   );
}

Despite the complexity, by taking advantage of templates the overhead is minimized. The logic on the template function may present a potential portability problem as multiplication and division of pointers aren't defined. But this isn't really an issue on Intel platforms and I risk to say that isn't either on other platforms as long as it's handled as shown above. The intptr_t is an assumed integer type large enough to represent any pointer type.

I considered that the base-10 integer Math strategy I could come up was the most clear one and it does well as to compiler register optimizations. In addition, as much as possible, multiplication and division are optimized by the compiler into straightforward base-2 shift or equivalent operations, thus efficient as well.

Nevertheless all this, I've found on C as a Second Language for Native Speakers of Pascal by Tomaz Müldner & Peter W. Steelean an alternative but slightly more efficient computation if the alignment modulus is a power of 2 that uses a clever base-2 integer Math which completely puzzled me at first. After all, I'm not from that legacy epoch where Assembly programming was so prevalent alongside these base 2 arithmetic quirks. At least in Solaris Studio 12.3 on Intel x64 it saved one leaq Assembly instruction. I'm not sure if this is less portable than my previous alternative, and I don't think so, to me both seem equally portable (or not).
  
The C++ code is:

template< typename P >
inline P align( void * p )
{
   const std::size_t magic_value =
      sizeof( typename pointed_by< P >::type ) - 1;

   // sizeof() must yield a power of 2
   // Consider using assert() to enforce it   

   return reinterpret_cast< P >
   (
      (
         reinterpret_cast< std::intptr_t >( p )
         +
         magic_value 
      )
      & ~magic_value
   );
}

The (-g) 64-bits instructions for the specific power of 2 alignment modulus are:
 
0x0000000000402dc0:  align       :    pushq    %rbp
0x0000000000402dc1:  align+0x0001:    movq     %rsp,%rbp
0x0000000000402dc4: 
align+0x0004:    subq     $0x0000000000000020,%rsp
0x0000000000402dc8: 
align+0x0008:    movq     %rdi,0xfffffffffffffff8(%rbp)
0x0000000000402dcc: 
align+0x000c:    movq     0xfffffffffffffff8(%rbp),%r8
0x0000000000402dd0: 
align+0x0010:    leaq     0x0000000000000007(%r8),%r8
0x0000000000402dd4: 
align+0x0014:    andq     $0xfffffffffffffff8,%r8
0x0000000000402dd8: 
align+0x0018:    movq     %r8,0xffffffffffffffe8(%rbp)
0x0000000000402ddc: 
align+0x001c:    movq     0xffffffffffffffe8(%rbp),%r8
0x0000000000402de0: 
align+0x0020:    movq     %r8,%rax
0x0000000000402de3: 
align+0x0023:    leave
 
Compare it to the more general solution (-g) 64-bits instructions:

0x0000000000402e40:  align       :    pushq    %rbp
0x0000000000402e41: 
align+0x0001:    movq     %rsp,%rbp
0x0000000000402e44: 
align+0x0004:    subq     $0x0000000000000020,%rsp
0x0000000000402e48: 
align+0x0008:    movq     %rdi,0xfffffffffffffff8(%rbp)
0x0000000000402e4c: 
align+0x000c:    movq     0xfffffffffffffff8(%rbp),%r8
0x0000000000402e50: 
align+0x0010:    leaq     0x0000000000000007(%r8),%r8
0x0000000000402e54: 
align+0x0014:    shrq     $0x0000000000000003,%r8
0x0000000000402e58: 
align+0x0018:    leaq     0x0000000000000000(,%r8,8),%r8
0x0000000000402e60: 
align+0x0020:    movq     %r8,0xffffffffffffffe8(%rbp)
0x0000000000402e64: 
align+0x0024:    movq     0xffffffffffffffe8(%rbp),%r8
0x0000000000402e68: 
align+0x0028:    movq     %r8,%rax
0x0000000000402e6b: 
align+0x002b:    leave

But what's under the hood (magic_value) on the bit-wise version?
As I said I was initially puzzled with them and with lack of any help.
In fact, since many years I was faced with similar challenges.
I usually ran away thinking they weren't relevant at last.
This time I accepted the challenge to tackle it.
The nasty details of the solution are here!

But don't get fooled by relying on the return value of sizeof() as the alignment modulus as it may not always be a power of 2! This is true for many user-defined types but may happen as well for some basic types. So, the previous base-2 version of  align<>() won't produce the right result! Nevertheless, the original base-10 version works fine on all cases, though, so prefer it! Or... you may take advantage of templates  in order to somehow select one version or the other according to the return value of sizeof() being a power of 2 or not.

template 

   typename P, 
   std::size_t A = sizeof( typename pointed_by< P >::type ), 
   bool = ( A & ( A - 1 ) ) == 0 // power of 2?
>
struct pointer;


template< typename P, std::size_t A >
struct pointer< P, A, false >
{
   inline static P adjust( void * p )
   {
      return reinterpret_cast< P >
      (

         (
            (
               reinterpret_cast< std::intptr_t >( p )
               +
               ( A - 1 )
            )
            / A
         ) 

         * A
      );
   }
};


template< typename P, std::size_t A >
struct pointer< P, A, true >
{
   inline static P adjust( void * p )
   {
      return reinterpret_cast< P >
      (

         (
            reinterpret_cast< std::intptr_t >( p )
            +
            ( A - 1 )
         )
         & ~( A - 1 )
      );
   }
};


template< typename P >
inline P align( void * p )
{

   return pointer< P >::adjust( p );
}

  
Having said all that, consider yet the following variant which also works in all cases (not just for powers of 2) and use a single integer modulus operation at its core (compared to the others): 

template< typename P >
inline P align( void * p )
{
   const std::size_t alignment =  
      sizeof( typename pointed_by< P >::type );

   const std::intptr_t pointer =
      reinterpret_cast< std::intptr_t >( p ) +
         alignment - 1;
 
   return reinterpret_cast< P >
      ( pointer - pointer % alignment ); 
}

Sample usage of  align<>() is as follows:

void f()
{
   ...
  
   // Get a properly aligned P-object
   // base points anywhere within a memory block
   P * prefix = align< P * >( base );
 
   ...
   prefix->... = ...
   ...
   
   // Get a properly aligned O-object
   // array will point to a O[] following *prefix
   O * array = align< O * >( prefix + 1 );

   ...
   array[ n ] = ...  
   ...

   ... 
}

Under Windows XP and DJGPP C++11, the following declarations...

struct P
{
   char c[ 5 ];
   short i;
};

 
struct O
{
   char c;
   long double ld;
};

 
... produce the following results at run-time:

block                  =  0x124e48
base                   =  0x124e49


sizeof( P )            =  8
alignof( P )           =  2

 
sizeof( O )            = 16
alignof( O )           =  4

sizeof( char )         =  1
alignof( char )        =  1


sizeof( short )        =  2
alignof( short )       =  2

 
sizeof( long double )  = 12
alignof( long double ) =  4


prefix                 =  0x124e50
prefix-end             =  0x124e58

array                  =  0x124e60
array[ 1 ]             =  0x124e70


As a slight variation of the above implementation, the resulting aligned pointer can be based on an explicit (power of 2) alignment parameter instead of being strictly related to the size of resulting pointer type. This way it's possible, for instance, to align the resulting pointer to a virtual memory page boundary.

template< typename P >
inline P align( void * p, const std::size_t alignment )
{
   const std::intptr_t pointer = 
      reinterpret_cast< std::intptr_t >( p ) +
         alignment - 1;
 
   return reinterpret_cast< P >
      ( pointer - pointer % alignment ); 
}
 
NOTE
On the above function template the alignment isn't a value template parameter because as such it's a more flexible solution. For example, a platform such as Solaris can dynamically provide many memory page sizes. If a value template parameter were used it would restrict the possibilities to statically defined constant expression as a value template parameter. The implementation is trivial by taking advantage of what was previously  presented.

The aligned addresses of fundamental or aggregate types (structs or so) are always even and are a multiple of a certain power of 2. I must point out that throughout all this post I perhaps wasn't very explicit about not worrying with odd addresses. According to the assumptions and strategies they are simply irrelevant.