FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Gentoo > Gentoo User

 
 
LinkBack Thread Tools
 
Old 03-13-2012, 05:18 PM
Michael Mol
 
Default hard drive encryption

On Tue, Mar 13, 2012 at 2:06 PM, Florian Philipp <lists@binarywings.net> wrote:
> Am 13.03.2012 18:45, schrieb Frank Steinmetzger:
>> On Tue, Mar 13, 2012 at 05:11:47PM +0100, Florian Philipp wrote:
>>
>>>> Since I am planning to encrypt only home/ under LVM control, what kind
>>>> of overhead should I expect?
>>>
>>> What do you mean with overhead? CPU utilization? In that case the
>>> overhead is minimal, especially when you run a 64-bit kernel with the
>>> optimized AES kernel module.
>>
>> Speaking of that...
>> I always wondered what the exact difference was between AES and AES i586. I
>> can gather myself that it's about optimisation for a specific architecture.
>> But which one would be best for my i686 Core 2 Duo?
>
> From what I can see in the kernel sources, there is a generic AES
> implementation using nothing but portable C code and then there is
> "aes-i586" assembler code with "aes_glue" C code.


> So I assume the i586
> version is better for you --- unless GCC suddenly got a lot better at
> optimizing code.

Since when, exactly? GCC isn't the best compiler at optimization, but
I fully expect current versions to produce better code for x86-64 than
hand-tuned i586. Wider registers, more registers, crypto acceleration
instructions and SIMD instructions are all very nice to have. I don't
know the specifics of AES, though, or what kind of crypto algorithm it
is, so it's entirely possible that one can't effectively parallelize
it except in some relatively unique circumstances.

--
:wq
 
Old 03-13-2012, 05:58 PM
Florian Philipp
 
Default hard drive encryption

Am 13.03.2012 19:18, schrieb Michael Mol:
> On Tue, Mar 13, 2012 at 2:06 PM, Florian Philipp <lists@binarywings.net> wrote:
>> Am 13.03.2012 18:45, schrieb Frank Steinmetzger:
>>> On Tue, Mar 13, 2012 at 05:11:47PM +0100, Florian Philipp wrote:
>>>
>>>>> Since I am planning to encrypt only home/ under LVM control, what kind
>>>>> of overhead should I expect?
>>>>
>>>> What do you mean with overhead? CPU utilization? In that case the
>>>> overhead is minimal, especially when you run a 64-bit kernel with the
>>>> optimized AES kernel module.
>>>
>>> Speaking of that...
>>> I always wondered what the exact difference was between AES and AES i586. I
>>> can gather myself that it's about optimisation for a specific architecture.
>>> But which one would be best for my i686 Core 2 Duo?
>>
>> From what I can see in the kernel sources, there is a generic AES
>> implementation using nothing but portable C code and then there is
>> "aes-i586" assembler code with "aes_glue" C code.
>
>
>> So I assume the i586
>> version is better for you --- unless GCC suddenly got a lot better at
>> optimizing code.
>
> Since when, exactly? GCC isn't the best compiler at optimization, but
> I fully expect current versions to produce better code for x86-64 than
> hand-tuned i586. Wider registers, more registers, crypto acceleration
> instructions and SIMD instructions are all very nice to have. I don't
> know the specifics of AES, though, or what kind of crypto algorithm it
> is, so it's entirely possible that one can't effectively parallelize
> it except in some relatively unique circumstances.
>

One sec. We are talking about an Core2 Duo running in 32bit mode, right?
That's what the i686 reference in the question meant --- or at least,
that's what I assumed.

If we talk about 32bit mode, none of what you describe is available.
Those additional registers and instructions are not accessible with i686
instructions. A Core 2 also has no AES instructions.

Of course, GCC could make use of what it knows about the CPU, like
number of parallel pipelines, pipeline depth, cache size, instructions
added in i686 and so on. But even then I doubt it can outperform
hand-tuned assembler, even if it is for a slightly older instruction set.

If instead we are talking about an Core 2 Duo running in x86_64 mode, we
should be talking about the aes-x86_64 module instead of the aes-i586
module and that makes use of the complete instruction set of the Core 2,
including SSE2.

Regards,
Florian Philipp
 
Old 03-13-2012, 06:07 PM
Stroller
 
Default hard drive encryption

On 13 March 2012, at 18:18, Michael Mol wrote:
> ...
>> So I assume the i586
>> version is better for you --- unless GCC suddenly got a lot better at
>> optimizing code.
>
> Since when, exactly? GCC isn't the best compiler at optimization, but
> I fully expect current versions to produce better code for x86-64 than
> hand-tuned i586. Wider registers, more registers, crypto acceleration
> instructions and SIMD instructions are all very nice to have. I don't
> know the specifics of AES, though, or what kind of crypto algorithm it
> is, so it's entirely possible that one can't effectively parallelize
> it except in some relatively unique circumstances.

Do you have much experience of writing assembler?

I don't, and I'm not an expert on this, but I've read the odd blog article on this subject over the years.

What I've read often has the programmer looking at the compiled gcc bytecode and examining what it does. The compiler might not care how many registers it uses, and thus a variable might find itself frequently swapped back into RAM; the programmer does not have any control over the compiler, and IIRC some flags reserve a register for degugging (IIRC -fomit-frame-pointer disables this). I think it's possible to use registers more efficiently by swapping them (??) or by using bitwise comparisons and other tricks.

Assembler optimisation is only used on sections of code that are at the core of a loop - that are called hundreds or thousands (even millions?) of times during the program's execution. It's not for code, such as reading the .config file or initialisation, which is only called once. Because the code in the core of the loop is called so often, you don't have to achieve much of an optimisation for the aggregate to be much more considerable.

The operations in question may only be constitute a few lines of C, or a handful of machine operations, so it boils down to an algorithm that a human programmer is capable of getting a grip on and comprehending. Whilst compilers are clearly more efficient for large programs, on this micro scale, humans are more clever and creative than machines.

Encryption / decryption is an example of code that lends itself to this kind of optimisation. In particular AES was designed, I believe, to be amenable to implementation in this way. The reason for that was that it was desirable to have it run on embedded devices and on dedicated chips. So it boils down to a simple bitswap operation (??) - the plaintext is modified by the encryption key, input and output as a fast stream. Each byte goes in, each byte goes out, the same function performed on each one.

Another operation that lends itself to assembler optimisation is video decoding - the video is encoded only once, and then may be played back hundreds or millions of times by different people. The same operations must be repeated a number of times on each frame, then c 25 - 60 frames are decoded per second, so at least 90,000 frames per hour. Again, the smallest optimisation is worthwhile.

Stroller.
 
Old 03-13-2012, 06:13 PM
Michael Mol
 
Default hard drive encryption

On Tue, Mar 13, 2012 at 2:58 PM, Florian Philipp <lists@binarywings.net> wrote:
> Am 13.03.2012 19:18, schrieb Michael Mol:
>> On Tue, Mar 13, 2012 at 2:06 PM, Florian Philipp <lists@binarywings.net> wrote:
>>> Am 13.03.2012 18:45, schrieb Frank Steinmetzger:
>>>> On Tue, Mar 13, 2012 at 05:11:47PM +0100, Florian Philipp wrote:
>>>>
>>>>>> Since I am planning to encrypt only home/ under LVM control, what kind
>>>>>> of overhead should I expect?
>>>>>
>>>>> What do you mean with overhead? CPU utilization? In that case the
>>>>> overhead is minimal, especially when you run a 64-bit kernel with the
>>>>> optimized AES kernel module.
>>>>
>>>> Speaking of that...
>>>> I always wondered what the exact difference was between AES and AES i586. I
>>>> can gather myself that it's about optimisation for a specific architecture.
>>>> But which one would be best for my i686 Core 2 Duo?
>>>
>>> From what I can see in the kernel sources, there is a generic AES
>>> implementation using nothing but portable C code and then there is
>>> "aes-i586" assembler code with "aes_glue" C code.
>>
>>
>>> So I assume the i586
>>> version is better for you --- unless GCC suddenly got a lot better at
>>> optimizing code.
>>
>> Since when, exactly? GCC isn't the best compiler at optimization, but
>> I fully expect current versions to produce better code for x86-64 than
>> hand-tuned i586. Wider registers, more registers, crypto acceleration
>> instructions and SIMD instructions are all very nice to have. I don't
>> know the specifics of AES, though, or what kind of crypto algorithm it
>> is, so it's entirely possible that one can't effectively parallelize
>> it except in some relatively unique circumstances.
>>
>
> One sec. We are talking about an Core2 Duo running in 32bit mode, right?
> That's what the i686 reference in the question meant --- or at least,
> that's what I assumed.

I think you're right; I missed that part.

>
> If we talk about 32bit mode, none of what you describe is available.
> Those additional registers and instructions are not accessible with i686
> instructions. A Core 2 also has no AES instructions.
>
> Of course, GCC could make use of what it knows about the CPU, like
> number of parallel pipelines, pipeline depth, cache size, instructions
> added in i686 and so on. But even then I doubt it can outperform
> hand-tuned assembler, even if it is for a slightly older instruction set.

I'm still not sure why. I'll posit that some badly-written C could
place constraints on the compiler's optimizer, but GCC should have
little problem handling well-written C, separating semantics from
syntax and finding good transforms of the original code to get
proofably-same results. Unless I'm grossly overestimating the
capabilities of its AST processing and optimization engine.

>
> If instead we are talking about an Core 2 Duo running in x86_64 mode, we
> should be talking about the aes-x86_64 module instead of the aes-i586
> module and that makes use of the complete instruction set of the Core 2,
> including SSE2.

FWIW, SSE2 is available on 32-bit processors; I have code in the field
using SSE2 on Pentium 4s.

--
:wq
 
Old 03-13-2012, 06:18 PM
Florian Philipp
 
Default hard drive encryption

Am 13.03.2012 19:58, schrieb Florian Philipp:
> Am 13.03.2012 19:18, schrieb Michael Mol:
>> On Tue, Mar 13, 2012 at 2:06 PM, Florian Philipp <lists@binarywings.net> wrote:
>>> Am 13.03.2012 18:45, schrieb Frank Steinmetzger:
>>>> On Tue, Mar 13, 2012 at 05:11:47PM +0100, Florian Philipp wrote:
>>>>
>>>>>> Since I am planning to encrypt only home/ under LVM control, what kind
>>>>>> of overhead should I expect?
>>>>>
>>>>> What do you mean with overhead? CPU utilization? In that case the
>>>>> overhead is minimal, especially when you run a 64-bit kernel with the
>>>>> optimized AES kernel module.
>>>>
>>>> Speaking of that...
>>>> I always wondered what the exact difference was between AES and AES i586. I
>>>> can gather myself that it's about optimisation for a specific architecture.
>>>> But which one would be best for my i686 Core 2 Duo?
>>>
>>> From what I can see in the kernel sources, there is a generic AES
>>> implementation using nothing but portable C code and then there is
>>> "aes-i586" assembler code with "aes_glue" C code.
>>
>>
>>> So I assume the i586
>>> version is better for you --- unless GCC suddenly got a lot better at
>>> optimizing code.
>>
>> Since when, exactly? GCC isn't the best compiler at optimization, but
>> I fully expect current versions to produce better code for x86-64 than
>> hand-tuned i586. Wider registers, more registers, crypto acceleration
>> instructions and SIMD instructions are all very nice to have. I don't
>> know the specifics of AES, though, or what kind of crypto algorithm it
>> is, so it's entirely possible that one can't effectively parallelize
>> it except in some relatively unique circumstances.
>>
>
> One sec. We are talking about an Core2 Duo running in 32bit mode, right?
> That's what the i686 reference in the question meant --- or at least,
> that's what I assumed.
>
> If we talk about 32bit mode, none of what you describe is available.
> Those additional registers and instructions are not accessible with i686
> instructions. A Core 2 also has no AES instructions.
>
> Of course, GCC could make use of what it knows about the CPU, like
> number of parallel pipelines, pipeline depth, cache size, instructions
> added in i686 and so on. But even then I doubt it can outperform
> hand-tuned assembler, even if it is for a slightly older instruction set.
>

P.S: I just looked up the differences in the instruction sets of i586
and i686. The only significant instruction added in i686 was a
conditional move (CMOV). This helps to avoid condition jumps. However,
in the aes-i586 code there are only two conditional jumps and they both
just end the loop of encryption/decryption rounds for AES-128 and
AES256, respectively. My assembler isn't perfect but I doubt you can
optimize that away with a CMOV.

> If instead we are talking about an Core 2 Duo running in x86_64 mode, we
> should be talking about the aes-x86_64 module instead of the aes-i586
> module and that makes use of the complete instruction set of the Core 2,
> including SSE2.
>
> Regards,
> Florian Philipp
 
Old 03-13-2012, 06:30 PM
Florian Philipp
 
Default hard drive encryption

Am 13.03.2012 20:13, schrieb Michael Mol:
> On Tue, Mar 13, 2012 at 2:58 PM, Florian Philipp <lists@binarywings.net> wrote:
>> Am 13.03.2012 19:18, schrieb Michael Mol:
>>> On Tue, Mar 13, 2012 at 2:06 PM, Florian Philipp <lists@binarywings.net> wrote:
>>>> Am 13.03.2012 18:45, schrieb Frank Steinmetzger:
>>>>> On Tue, Mar 13, 2012 at 05:11:47PM +0100, Florian Philipp wrote:
>>>>>
>>>>>>> Since I am planning to encrypt only home/ under LVM control, what kind
>>>>>>> of overhead should I expect?
>>>>>>
>>>>>> What do you mean with overhead? CPU utilization? In that case the
>>>>>> overhead is minimal, especially when you run a 64-bit kernel with the
>>>>>> optimized AES kernel module.
>>>>>
>>>>> Speaking of that...
>>>>> I always wondered what the exact difference was between AES and AES i586. I
>>>>> can gather myself that it's about optimisation for a specific architecture.
>>>>> But which one would be best for my i686 Core 2 Duo?
>>>>
>>>> From what I can see in the kernel sources, there is a generic AES
>>>> implementation using nothing but portable C code and then there is
>>>> "aes-i586" assembler code with "aes_glue" C code.
>>>
>>>
>>>> So I assume the i586
>>>> version is better for you --- unless GCC suddenly got a lot better at
>>>> optimizing code.
>>>
>>> Since when, exactly? GCC isn't the best compiler at optimization, but
>>> I fully expect current versions to produce better code for x86-64 than
>>> hand-tuned i586. Wider registers, more registers, crypto acceleration
>>> instructions and SIMD instructions are all very nice to have. I don't
>>> know the specifics of AES, though, or what kind of crypto algorithm it
>>> is, so it's entirely possible that one can't effectively parallelize
>>> it except in some relatively unique circumstances.
>>>
>>
>> One sec. We are talking about an Core2 Duo running in 32bit mode, right?
>> That's what the i686 reference in the question meant --- or at least,
>> that's what I assumed.
>
> I think you're right; I missed that part.
>
>>
>> If we talk about 32bit mode, none of what you describe is available.
>> Those additional registers and instructions are not accessible with i686
>> instructions. A Core 2 also has no AES instructions.
>>
>> Of course, GCC could make use of what it knows about the CPU, like
>> number of parallel pipelines, pipeline depth, cache size, instructions
>> added in i686 and so on. But even then I doubt it can outperform
>> hand-tuned assembler, even if it is for a slightly older instruction set.
>
> I'm still not sure why. I'll posit that some badly-written C could
> place constraints on the compiler's optimizer, but GCC should have
> little problem handling well-written C, separating semantics from
> syntax and finding good transforms of the original code to get
> proofably-same results. Unless I'm grossly overestimating the
> capabilities of its AST processing and optimization engine.
>

Well, it's not /that/ good. Otherwise the Firefox ebuild wouldn't need a
profiling run to allow the compiler to predict loop and jump certainties
and so on.

But, by all means, let's test it! It's not like we cannot.
Unfortunately, I don't have a 32bit Gentoo machine at hand where I could
test it right now.

>>
>> If instead we are talking about an Core 2 Duo running in x86_64 mode, we
>> should be talking about the aes-x86_64 module instead of the aes-i586
>> module and that makes use of the complete instruction set of the Core 2,
>> including SSE2.
>
> FWIW, SSE2 is available on 32-bit processors; I have code in the field
> using SSE2 on Pentium 4s.
>

Um, yeah. I should have clarified that. I meant that for x86_64
machines, the compiler as well as the assembler programmer can safely
assume that SSE2 is available. For generic i686 assembler code, you cannot.

Regards,
Florian Philipp
 
Old 03-13-2012, 06:38 PM
Michael Mol
 
Default hard drive encryption

On Tue, Mar 13, 2012 at 3:07 PM, Stroller
<stroller@stellar.eclipse.co.uk> wrote:
>
> On 13 March 2012, at 18:18, Michael Mol wrote:
>> ...
>>> So I assume the i586
>>> version is better for you --- unless GCC suddenly got a lot better at
>>> optimizing code.
>>
>> Since when, exactly? GCC isn't the best compiler at optimization, but
>> I fully expect current versions to produce better code for x86-64 than
>> hand-tuned i586. Wider registers, more registers, crypto acceleration
>> instructions and SIMD instructions are all very nice to have. I don't
>> know the specifics of AES, though, or what kind of crypto algorithm it
>> is, so it's entirely possible that one can't effectively parallelize
>> it except in some relatively unique circumstances.
>
> Do you have much experience of writing assembler?
>
> I don't, and I'm not an expert on this, but I've read the odd blog article on this subject over the years.

Similar level of experience here. I can read it, even debug it from
time to time. A few regular bloggers on the subject are like candy.
And I used to have pagetable.org, Ars's Technopaedia and specsheets
for early x86 and motorola processors memorized. For the past couple
years, I've been focusing on reading blogs of language and compiler
authors, academics involved in proofing, testing and improving them,
etc.

>
> What I've read often has the programmer looking at the compiled gcc bytecode and examining what it does. The compiler might not care how many registers it uses, and thus a variable might find itself frequently swapped back into RAM; the programmer does not have any control over the compiler, and IIRC some flags reserve a register for degugging (IIRC -fomit-frame-pointer disables this). I think it's possible to use registers more efficiently by swapping them (??) or by using bitwise comparisons and other tricks.

Sure; it's cheaper to null out a register by XORing it with itself
than setting it to 0.

>
> Assembler optimisation is only used on sections of code that are at the core of a loop - that are called hundreds or thousands (even millions?) of times during the program's execution. It's not for code, such as reading the .config file or initialisation, which is only called once. Because the code in the core of the loop is called so often, you don't have to achieve much of an optimisation for the aggregate to be much more considerable.

Sure; optimize the hell out of the code where you spend most of your
time. I wasn't aware that gcc passed up on safe optimization
opportunities, though.

>
> The operations in question may only be constitute a few lines of C, or a handful of machine operations, so it boils down to an algorithm that a human programmer is capable of getting a grip on and comprehending. Whilst compilers are clearly more efficient for large programs, on this micro scale, humans are more clever and creative than machines.

I disagree. With defined semantics for the source and target, a
computer's cleverness is limited only by the computational and memory
expense of its search algorithms. Humans get through this by making
habit various optimizations, but those habits become less useful as
additional paths and instructions are added. As system complexity
increases, humans operate on personally cached techniques derived from
simpler systems. I would expect very, very few people to be intimately
familiar with the the majority of optimization possibilities present
on an amdfam10 processor or a core2. Compiler's aren't necessarily
familiar with them, either; they're just quicker at discovering them,
given knowledge of the individual instructions and the rules of
language semantics.

>
> Encryption / decryption is an example of code that lends itself to this kind of optimisation. In particular AES was designed, I believe, to be amenable to implementation in this way. The reason for that was that it was desirable to have it run on embedded devices and on dedicated chips. So it boils down to a simple bitswap operation (??) - the plaintext is modified by the encryption key, input and output as a fast stream. Each byte goes in, each byte goes out, the same function performed on each one.

I'd be willing to posit that you're right here, though if there isn't
a per-byte feedback mechanism, SIMD instructions would come into
serious play. But I expect there's a per-byte feedback mechanism, so
parallelization would likely come in the form of processing
simultaneous streams.

>
> Another operation that lends itself to assembler optimisation is video decoding - the video is encoded only once, and then may be played back hundreds or millions of times by different people. The same operations must be repeated a number of times on each frame, then c 25 - 60 frames are decoded per second, so at least 90,000 frames per hour. Again, the smallest optimisation is worthwhile.

Absolutely. My position, though, is that compilers are quicker and
more capable of discovering optimization possibilities than humans
are, when the target architecture changes. Sure, you've got several
dozen video codecs in, say, ffmpeg, and perhaps it all boils down to
less than a dozen very common cases of inner loop code. With
hand-tuned optimization, you'd need to fork your assembly patch for
each new processor feature that comes out, and then work to find the
most efficient way to execute code on that processor.

There's also cases where processor features get changed. I don't
remember the name of the instruction (it had something to do with
stack operations) in x86, but Intel switched it from a 0-cycle
instruction to something more expensive. Any code which assumed that
instruction was a 0-cycle instruction now became less efficient. A
compiler (presuming it has a knowledge of the target processor's
instruction set properties) would have an easier time coping with that
change than a human would.

I'm not saying humans are useless; this is just one of those areas
which is sufficiently complex-yet-deterministic that sufficient
knowledge of the source and target environments would give a computer
the edge over a human in finding the optimal sequence of CPU
instructions.

--
:wq
 
Old 03-13-2012, 06:42 PM
Michael Mol
 
Default hard drive encryption

On Tue, Mar 13, 2012 at 3:30 PM, Florian Philipp <lists@binarywings.net> wrote:
> Am 13.03.2012 20:13, schrieb Michael Mol:
>> On Tue, Mar 13, 2012 at 2:58 PM, Florian Philipp <lists@binarywings.net> wrote:
>>> Am 13.03.2012 19:18, schrieb Michael Mol:
>>>> On Tue, Mar 13, 2012 at 2:06 PM, Florian Philipp <lists@binarywings.net> wrote:
>>>>> Am 13.03.2012 18:45, schrieb Frank Steinmetzger:
>>>>>> On Tue, Mar 13, 2012 at 05:11:47PM +0100, Florian Philipp wrote:
>>>>>>
>>>>>>>> Since I am planning to encrypt only home/ under LVM control, what kind
>>>>>>>> of overhead should I expect?
>>>>>>>
>>>>>>> What do you mean with overhead? CPU utilization? In that case the
>>>>>>> overhead is minimal, especially when you run a 64-bit kernel with the
>>>>>>> optimized AES kernel module.
>>>>>>
>>>>>> Speaking of that...
>>>>>> I always wondered what the exact difference was between AES and AES i586. I
>>>>>> can gather myself that it's about optimisation for a specific architecture.
>>>>>> But which one would be best for my i686 Core 2 Duo?
>>>>>
>>>>> From what I can see in the kernel sources, there is a generic AES
>>>>> implementation using nothing but portable C code and then there is
>>>>> "aes-i586" assembler code with "aes_glue" C code.
>>>>
>>>>
>>>>> So I assume the i586
>>>>> version is better for you --- unless GCC suddenly got a lot better at
>>>>> optimizing code.
>>>>
>>>> Since when, exactly? GCC isn't the best compiler at optimization, but
>>>> I fully expect current versions to produce better code for x86-64 than
>>>> hand-tuned i586. Wider registers, more registers, crypto acceleration
>>>> instructions and SIMD instructions are all very nice to have. I don't
>>>> know the specifics of AES, though, or what kind of crypto algorithm it
>>>> is, so it's entirely possible that one can't effectively parallelize
>>>> it except in some relatively unique circumstances.
>>>>
>>>
>>> One sec. We are talking about an Core2 Duo running in 32bit mode, right?
>>> That's what the i686 reference in the question meant --- or at least,
>>> that's what I assumed.
>>
>> I think you're right; I missed that part.
>>
>>>
>>> If we talk about 32bit mode, none of what you describe is available.
>>> Those additional registers and instructions are not accessible with i686
>>> instructions. A Core 2 also has no AES instructions.
>>>
>>> Of course, GCC could make use of what it knows about the CPU, like
>>> number of parallel pipelines, pipeline depth, cache size, instructions
>>> added in i686 and so on. But even then I doubt it can outperform
>>> hand-tuned assembler, even if it is for a slightly older instruction set.
>>
>> I'm still not sure why. I'll posit that some badly-written C could
>> place constraints on the compiler's optimizer, but GCC should have
>> little problem handling well-written C, separating semantics from
>> syntax and finding good transforms of the original code to get
>> proofably-same results. Unless I'm grossly overestimating the
>> capabilities of its AST processing and optimization engine.
>>
>
> Well, it's not /that/ good. Otherwise the Firefox ebuild wouldn't need a
> profiling run to allow the compiler to predict loop and jump certainties
> and so on.

I was thinking more in the context of simple functions and
mathematical operations. Loop probabilities? Yeah, that's a tough one.
Nobody wants to stall a huge CPU pipeline. I remember when the
NetBurst architecture came out. Intel cranked up the amount of die
space dedicated to branch prediction...

>
> But, by all means, let's test it! It's not like we cannot.
> Unfortunately, I don't have a 32bit Gentoo machine at hand where I could
> test it right now.

Now we're talking.

Unfortunately, I don't have a 32-bit Gentoo environment available,
either. Actually, I've never run Gentoo in a 32-bit envrionment. >.>

--
:wq
 
Old 03-13-2012, 07:02 PM
Florian Philipp
 
Default hard drive encryption

Am 13.03.2012 20:07, schrieb Stroller:
>
> On 13 March 2012, at 18:18, Michael Mol wrote:
>> ...
>>> So I assume the i586 version is better for you --- unless GCC
>>> suddenly got a lot better at optimizing code.
>>
>> Since when, exactly? GCC isn't the best compiler at optimization,
>> but I fully expect current versions to produce better code for
>> x86-64 than hand-tuned i586. Wider registers, more registers,
>> crypto acceleration instructions and SIMD instructions are all very
>> nice to have. I don't know the specifics of AES, though, or what
>> kind of crypto algorithm it is, so it's entirely possible that one
>> can't effectively parallelize it except in some relatively unique
>> circumstances.
>
> Do you have much experience of writing assembler?
>
> I don't, and I'm not an expert on this, but I've read the odd blog
> article on this subject over the years.
>
> What I've read often has the programmer looking at the compiled gcc
> bytecode and examining what it does. The compiler might not care how
> many registers it uses, and thus a variable might find itself
> frequently swapped back into RAM; the programmer does not have any
> control over the compiler, and IIRC some flags reserve a register for
> degugging (IIRC -fomit-frame-pointer disables this). I think it's
> possible to use registers more efficiently by swapping them (??) or
> by using bitwise comparisons and other tricks.
>

You recall correctly about the frame pointer.

Concerning the register usage: I'm no expert in this field, either, but
I think the main issue is not simply register allocation but branch and
exception prediction and so on. The compiler can either optimize for a
seamless continuation if the jump happens or if it doesn't. A human or a
just-in-time compiler can better handle these cases by predicting the
outcome of -- in the case of a JIT -- analyze the outcome of the first
few iterations.

OT: IIRC, register reuse is also the main performance problem of
state-of-the-art javascript engines, at the moment. Concerning the code
they compile at runtime, they are nearly as good as `gcc -O0` but they
have the same problem concerning registers (GCC with -O0 produces code
that works exactly as you describe above: Storing the result after every
computation and loading it again).

> Assembler optimisation is only used on sections of code that are at
> the core of a loop - that are called hundreds or thousands (even
> millions?) of times during the program's execution. It's not for
> code, such as reading the .config file or initialisation, which is
> only called once. Because the code in the core of the loop is called
> so often, you don't have to achieve much of an optimisation for the
> aggregate to be much more considerable.
>
> The operations in question may only be constitute a few lines of C,
> or a handful of machine operations, so it boils down to an algorithm
> that a human programmer is capable of getting a grip on and
> comprehending. Whilst compilers are clearly more efficient for large
> programs, on this micro scale, humans are more clever and creative
> than machines.
>
> Encryption / decryption is an example of code that lends itself to
> this kind of optimisation. In particular AES was designed, I believe,
> to be amenable to implementation in this way. The reason for that was
> that it was desirable to have it run on embedded devices and on
> dedicated chips. So it boils down to a simple bitswap operation (??)
> - the plaintext is modified by the encryption key, input and output
> as a fast stream. Each byte goes in, each byte goes out, the same
> function performed on each one.
>

Well, sort of. First of, you are right, AES was designed with hardware
implementations in mind.

The algorithm boils down to a number of substitution and permutation
networks and XOR operations (I assume that's what you meant with byte
swap). If you look at the portable C code
(/usr/src/linux/crypto/aes_generic.c), you can see that it mostly
consists of lookup tables and XORs.

The thing about "each byte goes in, each byte goes out", however, is a
bit wrong. What you think of is a stream cipher like RC4. AES is a block
cipher. These use an (in this case 128 bit long) input string and XOR it
with the encryption (sub-)key and shuffle it around according to the
exact algorithm.

> Another operation that lends itself to assembler optimisation is
> video decoding - the video is encoded only once, and then may be
> played back hundreds or millions of times by different people. The
> same operations must be repeated a number of times on each frame,
> then c 25 - 60 frames are decoded per second, so at least 90,000
> frames per hour. Again, the smallest optimisation is worthwhile.
>
> Stroller.
>
>
 
Old 03-13-2012, 07:15 PM
Florian Philipp
 
Default hard drive encryption

Am 13.03.2012 20:38, schrieb Michael Mol:
> On Tue, Mar 13, 2012 at 3:07 PM, Stroller
> <stroller@stellar.eclipse.co.uk> wrote:
>>
>> On 13 March 2012, at 18:18, Michael Mol wrote:
>>> ...
>>>> So I assume the i586 version is better for you --- unless GCC
>>>> suddenly got a lot better at optimizing code.
>>>
>>> Since when, exactly? GCC isn't the best compiler at optimization,
>>> but I fully expect current versions to produce better code for
>>> x86-64 than hand-tuned i586. Wider registers, more registers,
>>> crypto acceleration instructions and SIMD instructions are all
>>> very nice to have. I don't know the specifics of AES, though, or
>>> what kind of crypto algorithm it is, so it's entirely possible
>>> that one can't effectively parallelize it except in some
>>> relatively unique circumstances.
>>
>> Do you have much experience of writing assembler?
>>
>> I don't, and I'm not an expert on this, but I've read the odd blog
>> article on this subject over the years.
>
> Similar level of experience here. I can read it, even debug it from
> time to time. A few regular bloggers on the subject are like candy.
> And I used to have pagetable.org, Ars's Technopaedia and specsheets
> for early x86 and motorola processors memorized. For the past couple
> years, I've been focusing on reading blogs of language and compiler
> authors, academics involved in proofing, testing and improving them,
> etc.
>
>>
>> What I've read often has the programmer looking at the compiled gcc
>> bytecode and examining what it does. The compiler might not care
>> how many registers it uses, and thus a variable might find itself
>> frequently swapped back into RAM; the programmer does not have any
>> control over the compiler, and IIRC some flags reserve a register
>> for degugging (IIRC -fomit-frame-pointer disables this). I think
>> it's possible to use registers more efficiently by swapping them
>> (??) or by using bitwise comparisons and other tricks.
>
> Sure; it's cheaper to null out a register by XORing it with itself
> than setting it to 0.
>
>>
>> Assembler optimisation is only used on sections of code that are at
>> the core of a loop - that are called hundreds or thousands (even
>> millions?) of times during the program's execution. It's not for
>> code, such as reading the .config file or initialisation, which is
>> only called once. Because the code in the core of the loop is
>> called so often, you don't have to achieve much of an optimisation
>> for the aggregate to be much more considerable.
>
> Sure; optimize the hell out of the code where you spend most of your
> time. I wasn't aware that gcc passed up on safe optimization
> opportunities, though.
>
>>
>> The operations in question may only be constitute a few lines of C,
>> or a handful of machine operations, so it boils down to an
>> algorithm that a human programmer is capable of getting a grip on
>> and comprehending. Whilst compilers are clearly more efficient for
>> large programs, on this micro scale, humans are more clever and
>> creative than machines.
>
> I disagree. With defined semantics for the source and target, a
> computer's cleverness is limited only by the computational and
> memory expense of its search algorithms. Humans get through this by
> making habit various optimizations, but those habits become less
> useful as additional paths and instructions are added. As system
> complexity increases, humans operate on personally cached techniques
> derived from simpler systems. I would expect very, very few people to
> be intimately familiar with the the majority of optimization
> possibilities present on an amdfam10 processor or a core2. Compiler's
> aren't necessarily familiar with them, either; they're just quicker
> at discovering them, given knowledge of the individual instructions
> and the rules of language semantics.
>
>>
>> Encryption / decryption is an example of code that lends itself to
>> this kind of optimisation. In particular AES was designed, I
>> believe, to be amenable to implementation in this way. The reason
>> for that was that it was desirable to have it run on embedded
>> devices and on dedicated chips. So it boils down to a simple
>> bitswap operation (??) - the plaintext is modified by the
>> encryption key, input and output as a fast stream. Each byte goes
>> in, each byte goes out, the same function performed on each one.
>
> I'd be willing to posit that you're right here, though if there
> isn't a per-byte feedback mechanism, SIMD instructions would come
> into serious play. But I expect there's a per-byte feedback
> mechanism, so parallelization would likely come in the form of
> processing simultaneous streams.
>
>>
>> Another operation that lends itself to assembler optimisation is
>> video decoding - the video is encoded only once, and then may be
>> played back hundreds or millions of times by different people. The
>> same operations must be repeated a number of times on each frame,
>> then c 25 - 60 frames are decoded per second, so at least 90,000
>> frames per hour. Again, the smallest optimisation is worthwhile.
>
> Absolutely. My position, though, is that compilers are quicker and
> more capable of discovering optimization possibilities than humans
> are, when the target architecture changes. Sure, you've got several
> dozen video codecs in, say, ffmpeg, and perhaps it all boils down to
> less than a dozen very common cases of inner loop code. With
> hand-tuned optimization, you'd need to fork your assembly patch for
> each new processor feature that comes out, and then work to find the
> most efficient way to execute code on that processor.
>
> There's also cases where processor features get changed. I don't
> remember the name of the instruction (it had something to do with
> stack operations) in x86, but Intel switched it from a 0-cycle
> instruction to something more expensive. Any code which assumed that
> instruction was a 0-cycle instruction now became less efficient. A
> compiler (presuming it has a knowledge of the target processor's
> instruction set properties) would have an easier time coping with
> that change than a human would.
>
> I'm not saying humans are useless; this is just one of those areas
> which is sufficiently complex-yet-deterministic that sufficient
> knowledge of the source and target environments would give a
> computer the edge over a human in finding the optimal sequence of
> CPU instructions.
>

This thread is becoming ridiculously long. Just as a last side-note:

One of the primary reasons that the IA64 architecture failed was that it
relied on the compiler to optimize the code in order to exploit the
massive instruction-level parallelism the CPU offered. Compilers never
became good enough for the job. Of course, that happended in the
nineties and we have much better compilers now (and x86 is easier to
handle for compilers). But on the other hand: That was Intel's next big
thing and if they couldn't make the compilers work, I have no reason to
believe in their efficiency now.

Regards,
Florian Philipp
 

Thread Tools




All times are GMT. The time now is 07:52 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org