Why is this code 6.5x slower with optimizations enabled?Unit Testing C CodeWith arrays, why is it the case that a[5] == 5[a]?Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?Why are elementwise additions much faster in separate loops than in a combined loop?What is “:-!!” in C code?Why is my program slow when looping over exactly 8192 elements?Obfuscated C Code Contest 2006. Please explain sykes2.cWhy does the C preprocessor interpret the word “linux” as the constant “1”?Why does GCC generate 15-20% faster code if I optimize for size instead of speed?How is the linking done for string functions in C?

DOS, create pipe for stdin/stdout of command.com(or 4dos.com) in C or Batch?

Is it tax fraud for an individual to declare non-taxable revenue as taxable income? (US tax laws)

Can I interfere when another PC is about to be attacked?

Why is the design of haulage companies so “special”?

Can an x86 CPU running in real mode be considered to be basically an 8086 CPU?

The use of multiple foreign keys on same column in SQL Server

Why Is Death Allowed In the Matrix?

Why doesn't Newton's third law mean a person bounces back to where they started when they hit the ground?

"which" command doesn't work / path of Safari?

What is the offset in a seaplane's hull?

Motorized valve interfering with button?

Why is "Reports" in sentence down without "The"

What would happen to a modern skyscraper if it rains micro blackholes?

I probably found a bug with the sudo apt install function

When blogging recipes, how can I support both readers who want the narrative/journey and ones who want the printer-friendly recipe?

Why has Russell's definition of numbers using equivalence classes been finally abandoned? ( If it has actually been abandoned).

XeLaTeX and pdfLaTeX ignore hyphenation

How do I create uniquely male characters?

Compute hash value according to multiplication method

How to re-create Edward Weson's Pepper No. 30?

Should I join office cleaning event for free?

What is the command to reset a PC without deleting any files

How did the USSR manage to innovate in an environment characterized by government censorship and high bureaucracy?

Why don't electromagnetic waves interact with each other?



Why is this code 6.5x slower with optimizations enabled?


Unit Testing C CodeWith arrays, why is it the case that a[5] == 5[a]?Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?Why are elementwise additions much faster in separate loops than in a combined loop?What is “:-!!” in C code?Why is my program slow when looping over exactly 8192 elements?Obfuscated C Code Contest 2006. Please explain sykes2.cWhy does the C preprocessor interpret the word “linux” as the constant “1”?Why does GCC generate 15-20% faster code if I optimize for size instead of speed?How is the linking done for string functions in C?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








7















I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.



Here's my code:



#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

int main()
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i)
s[strlen(s)] = 'A';

clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;



On my machine it outputs:



$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415


Somehow, enabling optimizations causes it to execute longer.










share|improve this question
























  • With gcc-8.2 debug version takes 51334, release 8246. Release compiler options -O3 -march=native -DNDEBUG

    – Maxim Egorushkin
    1 hour ago












  • Please report it to gcc's bugzilla.

    – Marc Glisse
    1 hour ago











  • Using -fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.

    – David Schwartz
    1 hour ago











  • It is generating repnz scasb for strlen at -O1.

    – Marc Glisse
    1 hour ago












  • @MarcGlisse and for -O2 and -O3, it's loading and comparing the chars as integers. Unfortunately, the naive -O0 uses the library function which uses vector-instructions that beat this optimization easily.

    – EOF
    1 hour ago


















7















I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.



Here's my code:



#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

int main()
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i)
s[strlen(s)] = 'A';

clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;



On my machine it outputs:



$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415


Somehow, enabling optimizations causes it to execute longer.










share|improve this question
























  • With gcc-8.2 debug version takes 51334, release 8246. Release compiler options -O3 -march=native -DNDEBUG

    – Maxim Egorushkin
    1 hour ago












  • Please report it to gcc's bugzilla.

    – Marc Glisse
    1 hour ago











  • Using -fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.

    – David Schwartz
    1 hour ago











  • It is generating repnz scasb for strlen at -O1.

    – Marc Glisse
    1 hour ago












  • @MarcGlisse and for -O2 and -O3, it's loading and comparing the chars as integers. Unfortunately, the naive -O0 uses the library function which uses vector-instructions that beat this optimization easily.

    – EOF
    1 hour ago














7












7








7








I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.



Here's my code:



#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

int main()
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i)
s[strlen(s)] = 'A';

clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;



On my machine it outputs:



$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415


Somehow, enabling optimizations causes it to execute longer.










share|improve this question
















I wanted to benchmark glibc's strlen function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.



Here's my code:



#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

int main()
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i)
s[strlen(s)] = 'A';

clock_t end = clock();
printf("%lldn", (long long)(end-start));
return 0;



On my machine it outputs:



$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415


Somehow, enabling optimizations causes it to execute longer.







c gcc glibc






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 2 hours ago









Fei Xiang

2,1634822




2,1634822










asked 2 hours ago









TsarNTsarN

3814




3814












  • With gcc-8.2 debug version takes 51334, release 8246. Release compiler options -O3 -march=native -DNDEBUG

    – Maxim Egorushkin
    1 hour ago












  • Please report it to gcc's bugzilla.

    – Marc Glisse
    1 hour ago











  • Using -fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.

    – David Schwartz
    1 hour ago











  • It is generating repnz scasb for strlen at -O1.

    – Marc Glisse
    1 hour ago












  • @MarcGlisse and for -O2 and -O3, it's loading and comparing the chars as integers. Unfortunately, the naive -O0 uses the library function which uses vector-instructions that beat this optimization easily.

    – EOF
    1 hour ago


















  • With gcc-8.2 debug version takes 51334, release 8246. Release compiler options -O3 -march=native -DNDEBUG

    – Maxim Egorushkin
    1 hour ago












  • Please report it to gcc's bugzilla.

    – Marc Glisse
    1 hour ago











  • Using -fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.

    – David Schwartz
    1 hour ago











  • It is generating repnz scasb for strlen at -O1.

    – Marc Glisse
    1 hour ago












  • @MarcGlisse and for -O2 and -O3, it's loading and comparing the chars as integers. Unfortunately, the naive -O0 uses the library function which uses vector-instructions that beat this optimization easily.

    – EOF
    1 hour ago

















With gcc-8.2 debug version takes 51334, release 8246. Release compiler options -O3 -march=native -DNDEBUG

– Maxim Egorushkin
1 hour ago






With gcc-8.2 debug version takes 51334, release 8246. Release compiler options -O3 -march=native -DNDEBUG

– Maxim Egorushkin
1 hour ago














Please report it to gcc's bugzilla.

– Marc Glisse
1 hour ago





Please report it to gcc's bugzilla.

– Marc Glisse
1 hour ago













Using -fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.

– David Schwartz
1 hour ago





Using -fno-builtin makes the problem go away. So presumably the issue is that in this particular instance, GCC's builtin strlen is slower than the library's.

– David Schwartz
1 hour ago













It is generating repnz scasb for strlen at -O1.

– Marc Glisse
1 hour ago






It is generating repnz scasb for strlen at -O1.

– Marc Glisse
1 hour ago














@MarcGlisse and for -O2 and -O3, it's loading and comparing the chars as integers. Unfortunately, the naive -O0 uses the library function which uses vector-instructions that beat this optimization easily.

– EOF
1 hour ago






@MarcGlisse and for -O2 and -O3, it's loading and comparing the chars as integers. Unfortunately, the naive -O0 uses the library function which uses vector-instructions that beat this optimization easily.

– EOF
1 hour ago













1 Answer
1






active

oldest

votes


















4














Testing your code on Godbolt's Compiler Explorer provides this explanation:



  • at -O0 or without optimisations, the generated code call the C library function strlen

  • at -O1 the generated code uses a simple inline expansion using a rep scasb instruction.

  • at -O2 and above, the generated code uses a more elaborate inline expansion.

Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:



  • the -O1 code is much slower than the C library implementation: 32240 vs 3090

  • the -O2 code is faster than the -O1 but still substantially slower than the C ibrary code: 8570 vs 3090.

This behavior is specific to gcc and the glibc. The same test on OS/X with clang and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.



This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.



I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0 to 100 at -O0 and -O2 but still a much worse performance at -O1, 3 times slower.



Here is the updated code:



#include <stdlib.h>
#include <string.h>
#include <time.h>

void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';


clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);


int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;



Here is the output:




chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call





share|improve this answer

























  • Wouldn't it still be better for the inlined version to use the same optimizations as the library strlen, giving the best of both worlds?

    – Daniel H
    1 hour ago






  • 1





    It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

    – chqrlie
    1 hour ago











  • Does it change if you use -march=native -mtune=native?

    – Deduplicator
    1 min ago











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55563598%2fwhy-is-this-code-6-5x-slower-with-optimizations-enabled%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









4














Testing your code on Godbolt's Compiler Explorer provides this explanation:



  • at -O0 or without optimisations, the generated code call the C library function strlen

  • at -O1 the generated code uses a simple inline expansion using a rep scasb instruction.

  • at -O2 and above, the generated code uses a more elaborate inline expansion.

Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:



  • the -O1 code is much slower than the C library implementation: 32240 vs 3090

  • the -O2 code is faster than the -O1 but still substantially slower than the C ibrary code: 8570 vs 3090.

This behavior is specific to gcc and the glibc. The same test on OS/X with clang and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.



This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.



I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0 to 100 at -O0 and -O2 but still a much worse performance at -O1, 3 times slower.



Here is the updated code:



#include <stdlib.h>
#include <string.h>
#include <time.h>

void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';


clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);


int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;



Here is the output:




chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call





share|improve this answer

























  • Wouldn't it still be better for the inlined version to use the same optimizations as the library strlen, giving the best of both worlds?

    – Daniel H
    1 hour ago






  • 1





    It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

    – chqrlie
    1 hour ago











  • Does it change if you use -march=native -mtune=native?

    – Deduplicator
    1 min ago















4














Testing your code on Godbolt's Compiler Explorer provides this explanation:



  • at -O0 or without optimisations, the generated code call the C library function strlen

  • at -O1 the generated code uses a simple inline expansion using a rep scasb instruction.

  • at -O2 and above, the generated code uses a more elaborate inline expansion.

Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:



  • the -O1 code is much slower than the C library implementation: 32240 vs 3090

  • the -O2 code is faster than the -O1 but still substantially slower than the C ibrary code: 8570 vs 3090.

This behavior is specific to gcc and the glibc. The same test on OS/X with clang and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.



This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.



I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0 to 100 at -O0 and -O2 but still a much worse performance at -O1, 3 times slower.



Here is the updated code:



#include <stdlib.h>
#include <string.h>
#include <time.h>

void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';


clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);


int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;



Here is the output:




chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call





share|improve this answer

























  • Wouldn't it still be better for the inlined version to use the same optimizations as the library strlen, giving the best of both worlds?

    – Daniel H
    1 hour ago






  • 1





    It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

    – chqrlie
    1 hour ago











  • Does it change if you use -march=native -mtune=native?

    – Deduplicator
    1 min ago













4












4








4







Testing your code on Godbolt's Compiler Explorer provides this explanation:



  • at -O0 or without optimisations, the generated code call the C library function strlen

  • at -O1 the generated code uses a simple inline expansion using a rep scasb instruction.

  • at -O2 and above, the generated code uses a more elaborate inline expansion.

Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:



  • the -O1 code is much slower than the C library implementation: 32240 vs 3090

  • the -O2 code is faster than the -O1 but still substantially slower than the C ibrary code: 8570 vs 3090.

This behavior is specific to gcc and the glibc. The same test on OS/X with clang and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.



This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.



I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0 to 100 at -O0 and -O2 but still a much worse performance at -O1, 3 times slower.



Here is the updated code:



#include <stdlib.h>
#include <string.h>
#include <time.h>

void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';


clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);


int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;



Here is the output:




chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call





share|improve this answer















Testing your code on Godbolt's Compiler Explorer provides this explanation:



  • at -O0 or without optimisations, the generated code call the C library function strlen

  • at -O1 the generated code uses a simple inline expansion using a rep scasb instruction.

  • at -O2 and above, the generated code uses a more elaborate inline expansion.

Benchmarking your code repeatedly shows a substantial variation from one run to another, but increasing the number of iterations shows that:



  • the -O1 code is much slower than the C library implementation: 32240 vs 3090

  • the -O2 code is faster than the -O1 but still substantially slower than the C ibrary code: 8570 vs 3090.

This behavior is specific to gcc and the glibc. The same test on OS/X with clang and Apple's Libc does not show a significant difference, which is not a surprise as Godbolt shows that clang generates a call to the C library strlen at all optimisation levels.



This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen has a more important impact than the lack of performance of the inline code for small strings. The strings on which you benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.



I updated the benchmark for smaller strings and it shows similar performance for string lengths varying from 0 to 100 at -O0 and -O2 but still a much worse performance at -O1, 3 times slower.



Here is the updated code:



#include <stdlib.h>
#include <string.h>
#include <time.h>

void benchmark(int repeat, int minlen, int maxlen)
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++)
for (int i = minlen; i < maxlen; ++i)
bytes += i + 1;
calls += 1;
s[i] = '';
s[strlen(s)] = 'A';


clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/calln",
avglen, ns / bytes, ns / calls);


int main()
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;



Here is the output:




chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out
average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call
average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call
average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call
average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call
average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call
average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call
average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call
average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call
average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call
average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call
chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out
average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call
average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call
average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call
average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call
average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call
average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call
average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call
average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call
average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call
average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call
chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out
average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call
average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call
average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call
average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call
average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call
average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call
average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call
average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call
average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call
average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call






share|improve this answer














share|improve this answer



share|improve this answer








edited 11 mins ago

























answered 1 hour ago









chqrliechqrlie

62.9k848107




62.9k848107












  • Wouldn't it still be better for the inlined version to use the same optimizations as the library strlen, giving the best of both worlds?

    – Daniel H
    1 hour ago






  • 1





    It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

    – chqrlie
    1 hour ago











  • Does it change if you use -march=native -mtune=native?

    – Deduplicator
    1 min ago

















  • Wouldn't it still be better for the inlined version to use the same optimizations as the library strlen, giving the best of both worlds?

    – Daniel H
    1 hour ago






  • 1





    It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

    – chqrlie
    1 hour ago











  • Does it change if you use -march=native -mtune=native?

    – Deduplicator
    1 min ago
















Wouldn't it still be better for the inlined version to use the same optimizations as the library strlen, giving the best of both worlds?

– Daniel H
1 hour ago





Wouldn't it still be better for the inlined version to use the same optimizations as the library strlen, giving the best of both worlds?

– Daniel H
1 hour ago




1




1





It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

– chqrlie
1 hour ago





It would, but the hand optimized version in the C library might be larger and more complicated to inline. I have not looked into this recently, but there used to be a mix of complex platform specific macros in <string.h> and hard coded optimisations in the gcc code generator. Definitely still room for improvement on intel targets.

– chqrlie
1 hour ago













Does it change if you use -march=native -mtune=native?

– Deduplicator
1 min ago





Does it change if you use -march=native -mtune=native?

– Deduplicator
1 min ago



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55563598%2fwhy-is-this-code-6-5x-slower-with-optimizations-enabled%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Are there any AGPL-style licences that require source code modifications to be public? Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?Force derivative works to be publicAre there any GPL like licenses for Apple App Store?Do you violate the GPL if you provide source code that cannot be compiled?GPL - is it distribution to use libraries in an appliance loaned to customers?Distributing App for free which uses GPL'ed codeModifications of server software under GPL, with web/CLI interfaceDoes using an AGPLv3-licensed library prevent me from dual-licensing my own source code?Can I publish only select code under GPLv3 from a private project?Is there published precedent regarding the scope of covered work that uses AGPL software?If MIT licensed code links to GPL licensed code what should be the license of the resulting binary program?If I use a public API endpoint that has its source code licensed under AGPL in my app, do I need to disclose my source?

2013 GY136 Descoberta | Órbita | Referências Menu de navegação«List Of Centaurs and Scattered-Disk Objects»«List of Known Trans-Neptunian Objects»

Button changing it's text & action. Good or terrible? The 2019 Stack Overflow Developer Survey Results Are Inchanging text on user mouseoverShould certain functions be “hard to find” for powerusers to discover?Custom liking function - do I need user login?Using different checkbox style for different checkbox behaviorBest Practices: Save and Exit in Software UIInteraction with remote validated formMore efficient UI to progress the user through a complicated process?Designing a popup notice for a gameShould bulk-editing functions be hidden until a table row is selected, or is there a better solution?Is it bad practice to disable (replace) the context menu?