(这是蒂亚戈·马西埃拉(ThiagoMacieira)写的一篇关于复制与参考的博客文章的内容,
https://www.macieira.org/blog/2012/02/the-value-of-passing-by-value/
)
问题陈述书
在进入ABI文档并尝试编译代码之前,我们需要定义我们要解决的问题。一般来说,我试图找到传递小型C++结构的最佳方式:什么时候传递值比传递常量引用更好?在这些情况下,qreal讨论是否有重要意义?
像QLatin1String这样的小结构,只包含一个指针作为成员,将受益于通过值传递。我们还应该考虑哪些其他类型的结构?
-
具有多个指针的结构
-
64位体系结构上具有32位整数的结构
-
浮点结构(单精度和双精度)
-
Qt中发现的混合型和特殊结构
我将研究x86-64、ARMv7硬浮点、MIPS硬浮点(o32)和IA-64 ABI,因为我可以访问这些编译器。它们都支持通过寄存器传递参数,并且至少有4个整数寄存器用于参数传递。除了MIPS之外,它们都有至少4个用于参数传递的浮点寄存器。有关更多信息,请参阅我之前的ABI详细信息博客。
因此,我们将研究当您按值传递以下结构时会发生什么:
struct Pointers2
{
void *p1, *p2;
};
struct Pointers4
{
void *p1, *p2, *p3, *p4;
};
struct Integers2 // like QSize and QPoint
{
int i1, i2;
};
struct Integers4 // like QRect
{
int i1, i2, i3, i4;
};
template <typename F> struct Floats2 // like QSizeF, QPointF, QVector2D
{
F f1, f2;
};
template <typename F> struct Floats3 // like QVector3D
{
F f1, f2, f3;
};
template <typename F> struct Floats4 // like QRectF, QVector4D
{
F f1, f2, f3, f4;
};
template <typename F> struct Matrix4x4 // like QGenericMatrix<4, 4>
{
F m[4][4];
};
struct QChar
{
unsigned short ucs;
};
struct QLatin1String
{
const char *str;
int len;
};
template <typename F> struct QMatrix
{
F _m11, _m12, _m21, _m22, _dx, _dy;
};
template <typename F> struct QMatrix4x4 // like QMatrix4x4
{
F m[4][4];
int f;
};
我们将分析以下程序的组装:
template <typename T> void externalFunction(T);
template <typename T> void passOne()
{
externalFunction(T());
}
template <typename T> T externalReturningFunction();
template <typename T> void returnOne()
{
externalReturningFunction<T>();
}
// C++11 explicit template instantiation
template void passOne<Pointers2>();
template void passOne<Pointers4>();
template void passOne<Integers2>();
template void passOne<Integers4>();
template void passOne<Floats2<float> >();
template void passOne<Floats2<double> >();
template void passOne<Floats3<float> >();
template void passOne<Floats3<double> >();
template void passOne<Floats4<float> >();
template void passOne<Floats4<double> >();
template void passOne<Matrix4x4<float> >();
template void passOne<Matrix4x4<double> >();
template void passOne<QChar>();
template void passOne<QLatin1String>();
template void passOne<QMatrix<float> >();
template void passOne<QMatrix<double> >();
template void passOne<QMatrix4x4<float> >();
template void passOne<QMatrix4x4<double> >();
template void returnOne<Pointers2>();
template void returnOne<Pointers4>();
template void returnOne<Integers2>();
template void returnOne<Integers4>();
template void returnOne<Floats2<float> >();
template void returnOne<Floats2<double> >();
template void returnOne<Floats3<float> >();
template void returnOne<Floats3<double> >();
template void returnOne<Floats4<float> >();
template void returnOne<Floats4<double> >();
template void returnOne<Matrix4x4<float> >();
template void returnOne<Matrix4x4<double> >();
template void returnOne<QChar>();
template void returnOne<QLatin1String>();
template void returnOne<QMatrix<float> >();
template void returnOne<QMatrix<double> >();
template void returnOne<QMatrix4x4<float> >();
template void returnOne<QMatrix4x4<double> >();
此外,我们还对非结构浮点参数的情况很感兴趣:它们是否升级?因此,还要测试以下各项:
void passFloat()
{
void externalFloat(float, float, float, float);
externalFloat(1.0f, 2.0f, 3.0f, 4.0f);
}
void passDouble()
{
void externalDouble(double, double, double, double);
externalDouble(1.0f, 2.0f, 3.0f, 4.0f);
}
float returnFloat()
{
return 1.0f;
}
double returnDouble()
{
return 1.0;
}
Analysis of the output
x86-64
您可能已经注意到我跳过了旧式32位x86。这是有意的,因为该平台无论如何都不支持通过寄存器进行传递。我们可以从中得出的唯一结论是:
whether the structures are stored in the stack in the place of the argument, or whether theyâre stored elsewhere and itâs passed by pointer
whether single-precision floating-point is promoted to double-precision
此外,我有意忽略它,因为我想让人们开始考虑用于x86-64的新ILP32 ABI,它由GCC 4.7s-mx32开关启用,它遵循与下面描述的相同的ABI(指针是32位的除外)。
因此,让我们看看组装结果。对于参数传递,我们发现
Pointers2 is passed in registers;
Pointers4 is passed in memory;
Integers2 is passed in a single register (two 32-bit values per 64-bit register);
Integers4 is passed in two registers only (two 32-bit values per 64-bit register);
Floats2<float> is passed packed into a single SSE register, no promotion to double
Floats3<float> is passed packed into two SSE registers, no promotion to double;
Floats4<float> is passed packed into two SSE registers, no promotion to double;
Floats2<double> is passed in two SSE registers, one value per register
Floats3<double> and Floats4<double> are passed in memory;
Matrix4x4 and QMatrix4x4 are passed in memory regardless of the underlying type;
QChar is passed in a register;
QLatin1String is passed in registers.
The floating point parameters are passed one per register, without float promotion to double.
对于返回值,结论与上面相同:如果值在寄存器中传递,它也在寄存器中返回;如果在内存中传递,则返回内存。通过仔细阅读ABI文件,我们得出以下结论:
Single-precision floating-point types are not promoted to double;
Single-precision floating-point types in a structure are packed into SSE registers if they are still available
Structures bigger than 16 bytes are passed in memory, with an exception for __m256, the type corresponding to one AVX 256-bit register.
IA-64
以下是参数传递的结果:
Both Pointers structures are passed in registers, one pointer per register;
Both Integers structures are passed in registers, packed like x86-64 (two ints per register);
All of the Floats structures are passed in registers, one value per register (unpacked);
QMatrix4x4<float> is passed entirely in registers: half of it (the first 8 floats) are in floating-point registers, one value per register (unpacked); the other half is passed in integer registers out4 to out7 as the memory representations (packed);
QMatrix4x4<double> is passed partly in registers: half of it (the first 8 doubles) are in floating-point registers, one value per register (unpacked); the other half is passed in memory;
QChar and QLatin1String are passed in registers;
Both QMatrix are passed entirely in registers, one value per register (unpacked);
QMatrix4x4 is passed like Matrix4x4, except that the integer is always in memory (the structure is larger than 8*8 bytes);
Individual floating-point parameters are passed one per register; type promotion happens internally in the register.
对于返回值,我们有:
The floating-point structures with up to 8 floating-point members are returned in registers;
The integer structures of up to 32 bytes are returned in registers;
All the rest is returned in memory supplied by the caller.
结论如下:
Type promotion happens in hardware, as IA-64 does not have specific registers for single or double precision (is FP registers hold only extended precision data);
Homogeneous structures of floating-point types are passed in registers, up to 8 values; the rest goes to the integer registers if there are some still available or in memory;
All other structures are passed in the integer registers, up to 64 bytes;
Integer registers are allocated for passing any and all types, even if they aren't used (the ABI says they should be used if in the case of C without prototypes).
手臂
以下是参数传递的结果:
Pointers2, Pointers4, Integers2, and Integers4 are passed in registers (note that the Pointers and Integers structures are the same in 32-bit mode);
All of the Float types are passed in registers, one value per register, without promotion of floats to doubles; the values are also stored in memory but I can't tell if this is required or just GCC being dumb;
All types of Matrix4x4, QMatrix and QMatrix4x4 are passed in both memory and registers, which contains the first 16 bytes;
QChar and QLatin1String are passed in registers;
are passed in memory regardless of the underlying type.
The floating point parameters are passed one per register, without float promotion to double.
All of the Float types are returned in registers and GCC then stores them all to memory even if they are never used afterwards;
QChar is returned in a register;
Everything else is returned in memory.
请注意,返回类型是32位AAPC与64位AAPC不同的地方之一:如果在寄存器中将类型传递给作为第一个参数的函数,则返回到这些相同的寄存器中。32位AAPC将寄存器中的返回限制为4字节或更少的结构。
我的结论是:
Single-precision floating-point types are not promoted to double;
Homogeneous structures (that is, structures containing one single type) of a floating-point type are passed in floating-point registers if the structure has 4 members or fewer;
MIPS
我尝试了MIPS 32位构建(使用GCC默认的o32 ABI)和MIPS 64位构建(使用-mabi=o64-mlong64)。除非另有说明,否则两种体系结构的结果是相同的。
对于传递参数,它们是:
Both types of Integers and Pointers structures are passed in registers; on 64-bit, two 32-bit integers are packed into a single 64-bit register like x86-64;
Float2<float>, Float3<float>, and Float4<float> are passed in integer registers, not on the floating-point registers; on 64-bit, two floats are packed into a single 64-bit register;
Float2<double> is passed in integer registers; on 32-bit, two 32-bit registers are required to store each double;
On 32-bit, the first two doubles of Float3<double> and Float3<double> are passed in integer registers, the rest are passed in memory;
On 64-bit, Float3<double> and Float3<double> are passed entirely in integer registers;
Matrix4x4, QMatrix, and QMatrix4x4 are passed in integer registers (the portion that fits) and in memory (the rest);
QChar is passed in a register (on MIPS big-endian, it's passed on bits 16-31);
QLatin1String is passed on two registers;
The floating point parameters are passed one per register, without float promotion to double.
对于返回值,MIPS很简单:所有内容都返回到内存中,甚至是QChar。
结论更容易得出:
No float is promoted to double;
No structure is ever passed in floating-point registers;
No structure is ever returned in registers.
一般结论
我们只能得出很少的总体结论。其中之一是,当存在形式参数时,单精度浮点值不会显式提升为双精度。自动升级可能只针对以省略号(…)传递的浮点值,但我们的问题是在参数已知的地方调用函数。唯一稍微偏离规则的是IA-64,但这并不重要,因为硬件(如x87)仅在一种模式下运行。
对于包含整数参数(包括指针)的结构,没有什么需要进一步优化的:它们被加载到寄存器中的方式与它们在内存中的显示方式完全相同。这意味着寄存器中与填充相对应的部分可能包含未初始化或垃圾数据,或者它可能会产生一些非常奇怪的东西,比如big-endian模式下的MIPS。这还意味着,在所有体系结构上,小于寄存器的类型不会占用整个寄存器,因此它们可能会被其他成员打包。
另一个很明显:包含float的结构比包含double的结构小,因此它们将使用更少的内存或更少的寄存器进行传递。
为了继续得出结论,我们需要排除MIPS,因为它传递整数寄存器中的所有内容,并通过内存返回所有内容。如果我们这样做,我们就能看到所有ABI都为只包含一种浮点类型的结构提供了优化。在ABI文档中,它们的名称略有不同,都表示同构浮点结构。这些优化意味着在某些条件下,该结构在浮点寄存器上传递。
第一个被分解的实际上是x86-64:上限是16字节,限制为两个SSE寄存器。其基本原理似乎是传递一个双精度复数值,它需要16个字节。我们能够传递四个单精度值是一个意外的好处。
其余的体系结构(ARM和IA-64)可以通过寄存器传递更多的值,并且每个寄存器始终有一个值(无打包)。IA-64有更多专用于参数传递的寄存器,因此它可以传递比ARM更多的参数。
规范建议
Structures of up to 16 bytes containing integers and pointers should be passed by value;
Homogeneous structures of up to 16 bytes containing floating-point should be passed by value (2 doubles or 4 floats);
Mixed-type structures should be avoided; if they exist, passing by value is still a good idea;
上述内容仅适用于可复制性和可销毁性很低的结构。所有C结构(C++中的POD)都符合这些标准。
最终注释
我应该注意到,上面的建议并不总是产生更高效的代码。尽管这些值可以在寄存器中传递,但我测试的每个编译器(GCC 4.6、Clang 3.0、ICC 12.1)在某些情况下仍会执行大量内存操作。编译器通常将结构写入内存,然后将其加载到寄存器中。当它这样做时,通过常量引用传递将更加有效,因为它将用堆栈指针上的算术替换内存负载。
然而,这些只是编译器团队进一步优化工作的问题。我为x86-64测试的三个编译器的优化方式不同,几乎在所有情况下,其中至少有一个能够在没有内存访问的情况下运行。有趣的是,当我们用零替换填充空间时,行为也会发生变化。