Code Optimization

Interesting things about software development and code optimization

Javascript and jQuery - creating a lot of elements like a big grid


Hello dear friends,


Last time I did create a big grid using jQuery and Javascript, and faced with the performance problem.




My task was to create a big calendar grid that should have 20 rows at least and 180 columns (each column for each day for the 6 month period). While developing this grid I had noticed that Microsoft Edge and FireFox browsers hung during this creation cycle for a few seconds (Google Chrome seemed much better). My first method did create each cell for every column and row.

So how to optimize this? I did some googling and found a suggestion to use setTimeout function to avoid browser hanging, but it didn't help a lot in my case. Ugh! What should I do with it. Just in case, here is the piece of code I used to create my grid:

.....

var row, cell;

for(var r=0;r<20;r++)

{

    row = $("<div class='row'></div>");

    for(var i=0; i<180;i++)

    {

         cell = $("<div class='cell'></div>");

         ..........

        row.append(cell);

    }

    ....

    grid.append(row);

}


So, looks pretty simple? I was thinking for a few days on it trying to understand what could I do with it.

At some point I just had thought "what if create one row only and then just clone it as much as needed?". And yes, I was completely right! :) it did reduce time from 4-5 seconds to less than 1 second:

.....
var row, cell;


{
    row = $("<div class='row'></div>");

    for(var i=0; i<180;i++)
    {
         cell = $("<div class='cell'></div>");
         ..........
        row.append(cell);
    }
    ....
    grid.append(row);
}

for (var c = 1; c < 20; c++) {
   grid.append(row.clone(false).off());
}

So now its clear that the clone method grid.append(row.clone(false).off()); much much faster than creating a new elements.



Thank you and see you next time :)


1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y



Moving mouse makes video lagging - Windows 10

Hi friends,


have you noticed that, if you have windows 10 and Intel graphics chip, moving mouse over video or just moving mouse makes video lagging in a web video player?

I have Intel processor with built-in video card and I have noticed this problem. To solve it you need to do two simple steps under the Intel HD Graphics control panel:

- maximize performance on the plugged in mode

- maximize performance and disable power setting for the on-battery mode


more over, after I disabled it all I have noticed that YouTube video player started to work great in the full-screen (before this it didn't show video in full-screen mode) mode and whole laptop performance has been increased.


Enjoy it :)



1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y



Multiplication and Division - could be even faster?

Hi friends,


today we are going to work on something that looks and sounds crazy, we are going to speed up multiplication and division.

First, lets look into a code that we will try to speed up: 

double p1 = 0, p2 = 0;

int count = 100000000;

Stopwatch sw = new Stopwatch();

Int32 a = 0, b = 0;

sw.Start();

for (int i = 0; i < count; i++)

{

a = i;

a *= 2;

b += a;

}

sw.Stop();

p1 = sw.ElapsedMilliseconds;

Console.WriteLine(sw.ElapsedMilliseconds + " : " + a);

As you can see the code just measures time that multiplication takes, on my PC this cycle takes about 160 ms.

Now lets add similar code but optimized and we are expecting it will be faster: 

-

static void Main(string[] args)

{

double p1 = 0, p2 = 0;

int count = 100000000;


Stopwatch sw = new Stopwatch();

Int32 a = 0, b = 0;

sw.Start();

for (int i = 0; i < count; i++)

{

a = i;

//multiply by 2

a *= 2;

b += a;

}

sw.Stop();

p1 = sw.ElapsedMilliseconds;

Console.WriteLine(sw.ElapsedMilliseconds + " : " + a + " : " + b);

sw.Reset();

sw.Start();

b = 0;

for (int i = 0; i < count; i++)

{

a = i;

//multiply by 2

a <<= 1;

b += a;

}

sw.Stop();

p2 = sw.ElapsedMilliseconds;

Console.WriteLine(sw.ElapsedMilliseconds + " : " + a + " : " + b);

Console.WriteLine("percent faster: " + 100.0 / p1 * (p1 - p2));

Console.WriteLine("times faster: " + (p1 / p2));

Console.ReadLine();

}

If you will compile it and run (make sure you compile it in Release mode and run outside of Visual Studio) you will see almost no difference

in speed between these two cases of multiplication. On my PC it shows something like that:


161 : 199999998 : 1774919424
157 : 199999998 : 1774919424
percent faster: 2.48447204968944
times faster: 1.02547770700637


but we can say that the result almost the same. Interesting and not really, but do not be  sad, every next step will amaze you ;)


Another type


So, the next step is to test it with another type of integer - Int64 (or long), and what we see? We see that now it is about two times faster, here is my result:


846 : 199999998 : 9999999900000000
342 : 199999998 : 9999999900000000
percent faster: 59.5744680851064
times faster: 2.47368421052632


ha, interesting? So using simple bit scrolling to multiply integer value by 2, 4, 8, 16 ... much faster than multiplication itself.

Lets continue with division.


Division


Ok, now, lets go through the same steps but this time with division, here is the code:

static void Main(string[] args)

{

double p1 = 0, p2 = 0;

int count = 100000000;

Stopwatch sw = new Stopwatch();

Int32 a = 0, b = 0;

sw.Start();

for (int i = 0; i < count; i++)

{

a = i;

//div by 2

a /= 2;

b += a;

}

sw.Stop();

p1 = sw.ElapsedMilliseconds;

Console.WriteLine(sw.ElapsedMilliseconds + " : " + a + " : " + b);

sw.Reset();

sw.Start();

b = 0;

for (int i = 0; i < count; i++)

{

a = i;

//div by 2

a >>= 1;

b += a;

}

sw.Stop();

p2 = sw.ElapsedMilliseconds;

Console.WriteLine(sw.ElapsedMilliseconds + " : " + a + " : " + b);

Console.WriteLine("percent faster: " + 100.0 / p1 * (p1 - p2));

Console.WriteLine("times faster: " + (p1 / p2));

Console.ReadLine();

}

Compile and run it and you will see that the speed also almost the same:


180 : 49999999 : -1728753792
158 : 49999999 : -1728753792
percent faster: 12.2222222222222
times faster: 1.13924050632911


And as you may expect changing from Int32 to Int64 should be even more faster, and this is true, and in my case it more than 8 times faster:


2978 : 49999999 : 2499999950000000
344 : 49999999 : 2499999950000000
percent faster: 88.4486232370719
times faster: 8.65697674418605


So, the main thing here is that multiplication and division on non-native types are complex operations that involve more processor consumption than simple bit scrolling.

On the other hand the Int64 type is not native type for 32 bit environment and it takes more time to operate with it.

But if you will compile this code as x64 and run it under 64 bit operating system, you will see completely another result, here is mine:


229 : 49999999 : 2499999950000000
151 : 49999999 : 2499999950000000
percent faster: 34.061135371179
times faster: 1.51655629139073


this is because of under 64 bit operating system the Int32 and Int64 are native types (means a CPU register may handles whole value).

One more thing is the Any CPU compilation - it seems not so good as you may expect.


Thank you, and do not hesitate to comment it.


1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y



SQL COUNT(*) is slow - how to increase the performance

Hi friends,

today I'm going to show you one possible way to speed up the COUNT(*)  performance in case when you need to know the total amount of rows of your select query with paging.

Here is the example of my first and slow query:

Select

    COUNT(*) OVER () as TotalCount

    , ID

    , OwnerID

    , Name

    , [Description]

    , keywords

From [dbo].[tblData]

where pState = N'a' and [status] <> 'P'

    and (@FilterBy = 0 OR @FilterBy = TypeID)

Order By DateAdded Desc

OFFSET @p0 ROWS FETCH NEXT @p1 ROWS ONLY;


this query did take from 4 seconds up to 9 seconds selecting 315000 rows from about 450000 rows -

that is really slow. It seems like OVER () overloads the query itself (but this is just my guess).

Also converting from nchar to char takes a lot of time, so I did change it (notice the N letter before string)

After some time of thinking and playing with that around, I did come to the following solution:

select @rowstotal = count(*)

From [dbo].[tblData]

where pState = 'a' and [status] <> 'P'

     and (@FilterBy = 0 OR @FilterBy = TypeID)


select

      @rowstotal as TotalCount

    , ID

    , OwnerID

    , Name

    , [Description]

    , keywords

From [dbo].[tblData]

where pState = 'a' and [status] <> 'P'

    and (@FilterBy = 0 OR @FilterBy = TypeID)

Order By DateAdded Desc

OFFSET @p0 ROWS FETCH NEXT @p1 ROWS ONLY;


(pay attention that to calculate total number of rows to be selected you do not need ordering. I think count(*) with ordering works much slower)


Now this query takes up to 1 second to select the same 315000 rows from about 450000 that is minimum 4 times faster :) 


I'm not sure if this is really best solution as I'm not a DBA master :)

but in my case it sped up the query enough.


Thank you.


1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y



C#.NET - Fast Memory Copy method with x86 Assembler

Introduction

I'm Oleksandr Karpov and this is my first article here, thanks for reading it.

Here, I'm going to show and explain how to copy data really fast and how to use assembly under C# and .NET. In my case, I use it in a video creating application from images, video and sound.
Also, if you have an assembly method or function that you need to use under C#, it will show you how to do it in a quick and simple way.

Background

To understand it all, it would be great for you to know assembly language, memory alignment and some C#, Windows and .NET advanced techniques.
To be able to copy-paste data really fast, you need it to have 16 byte aligned memory address in other way it will have almost the same speed (in my case, about 1.02 time faster).

The code uses SSE instructions that are supported by processors from Pentium III+ (KNI/MMX2), AMD Athlon (AMD EMMX).

I have tested it on my Pentium Dual-Core E5800 3.2GHz with 4GB RAM in dual mode.
For me, the fast copy method is 1.5 times faster than the standard with 16 byte memory aligned and
almost the same (1.02 times faster) with non-aligned memory addresses.

To be able to allocate 16 byte aligned memory in C# under Windows, we have three ways to do it:

a) On this time it seems that Bitmap object (actually windows itself inside) allocates memory  with 16 byte aligned address, so we can use Bitmap to easy and quick aligned memory allocation;

b) As managed array by adding 8 bytes more (as windows heap is 8 byte aligned) and calculating 16 byte aligned memory point within allocated memory:

int dataLength = 4096;


// +8 bytes as windows heap is 8 byte aligned

byte[] buffer = new byte[dataLength + 8];


IntPtr addr = Marshal.UnsafeAddrOfPinnedArrayElement(buffer, 0);


//(int)(((long)addr + 15) / 16 * 16 - getting point to 16 byte aligned address

int bufferAlignedOffset = (int)(((long)addr + 15) / 16 * 16 - addr);

c) By allocating memory with VirtualAlloc API:

IntPtr addr = VirtualAlloc(IntPtr.Zero,

new UIntPtr(dataLength + 8),

AllocationTypes.Commit | AllocationTypes.Reserve,

MemoryProtections.ExecuteReadWrite);


addr = new IntPtr(((long)addr + 15) / 16 * 16);

Using the Code

This is a complete performance test that will show you performance measurements and how to use it all.

The FastMemCopy class contains all things for fast memory copy logic.

First thing you need is to create a default Windows Forms application project and put two buttons on the form and the PictureBox control as we will test it on images.

Let's declare some fields:

string bitmapPath;

Bitmap bmp, bmp2;

BitmapData bmpd, bmpd2;

byte[] buffer = null;

Now, we will create two methods to handle click events for our buttons.

For standard method:

private void btnStandard_Click(object sender, EventArgs e)

{

using (OpenFileDialog ofd = new OpenFileDialog())

{

if (ofd.ShowDialog() != System.Windows.Forms.DialogResult.OK)

return;

bitmapPath = ofd.FileName;

}


//open a selected image and create an empty image with the same size

OpenImage();


//unlock for read and write images

UnlockBitmap();

//copy data from one image to another by standard method

CopyImage();

//lock images to be able to see them

LockBitmap();

//lets see what we have

pictureBox1.Image = bmp2;

}

and for fast method:

private void btnFast_Click(object sender, EventArgs e)

{

using (OpenFileDialog ofd = new OpenFileDialog())

{

if (ofd.ShowDialog() != System.Windows.Forms.DialogResult.OK)

return;

bitmapPath = ofd.FileName;

}

//open a selected image and create an empty image with the same size

OpenImage();

//unlock for read and write images

UnlockBitmap();

//copy data from one image to another with our fast method

FastCopyImage();

//lock images to be able to see them

LockBitmap();

//lets see what we have

pictureBox1.Image = bmp2;

}

Ok, now we have buttons and event handlers so let's implement methods that will open images, lock, unlock them and standard copy method:

Open an image:

void OpenImage()

{

pictureBox1.Image = null;

buffer = null;

if (bmp != null)

{

bmp.Dispose();

bmp = null;

}

if (bmp2 != null)

{

bmp2.Dispose();

bmp2 = null;

}

GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);

bmp = (Bitmap)Bitmap.FromFile(bitmapPath);

buffer = new byte[bmp.Width * 4 * bmp.Height];

bmp2 = new Bitmap(bmp.Width, bmp.Height, bmp.Width * 4, PixelFormat.Format32bppArgb,

Marshal.UnsafeAddrOfPinnedArrayElement(buffer, 0));

}

Lock and unlock bitmaps:

void UnlockBitmap()

{

bmpd = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadWrite,

PixelFormat.Format32bppArgb);

bmpd2 = bmp2.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadWrite,

PixelFormat.Format32bppArgb);

}

void LockBitmap()

{

bmp.UnlockBits(bmpd);

bmp2.UnlockBits(bmpd2);

}

and copy data from one image to another and show measured time:

void CopyImage()

{

//start stopwatch

Stopwatch sw = new Stopwatch();

sw.Start();

//copy-past data 10 times

for (int i = 0; i < 10; i++)

{

System.Runtime.InteropServices.Marshal.Copy(bmpd.Scan0, buffer, 0, buffer.Length);

}

//stop stopwatch

sw.Stop();

//show measured time

MessageBox.Show(sw.ElapsedTicks.ToString());

}

That's it for the standard copy-paste method. Actually, there is nothing too complex, we use well-known System.Runtime.InteropServices.Marshal.Copy method.

And one more "middle-method" for the fast copy logic:

void FastCopyImage()

{

FastMemCopy.FastMemoryCopy(bmpd.Scan0, bmpd2.Scan0, buffer.Length);

}

Now, let's implement the FastMemCopy class. Here is the declaration of the class and some types we will use inside of it:

internal static class FastMemCopy

{

[Flags]

private enum AllocationTypes : uint

{

Commit = 0x1000, Reserve = 0x2000,

Reset = 0x80000, LargePages = 0x20000000,

Physical = 0x400000, TopDown = 0x100000,

WriteWatch = 0x200000

}

[Flags]

private enum MemoryProtections : uint

{

Execute = 0x10, ExecuteRead = 0x20,

ExecuteReadWrite = 0x40, ExecuteWriteCopy = 0x80,

NoAccess = 0x01, ReadOnly = 0x02,

ReadWrite = 0x04, WriteCopy = 0x08,

GuartModifierflag = 0x100, NoCacheModifierflag = 0x200,

WriteCombineModifierflag = 0x400

}

[Flags]

private enum FreeTypes : uint

{

Decommit = 0x4000, Release = 0x8000

}

[UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)]

private unsafe delegate void FastMemCopyDelegate();

private static class NativeMethods

{

[DllImport("kernel32.dll", SetLastError = true)]

internal static extern IntPtr VirtualAlloc(

IntPtr lpAddress,

UIntPtr dwSize,

AllocationTypes flAllocationType,

MemoryProtections flProtect);

[DllImport("kernel32")]

[return: MarshalAs(UnmanagedType.Bool)]

internal static extern bool VirtualFree(

IntPtr lpAddress,

uint dwSize,

FreeTypes flFreeType);

}

Now let's declare the method itself:

public static unsafe void FastMemoryCopy(IntPtr src, IntPtr dst, int nBytes)

{

if (IntPtr.Size == 4)

{

//we are in 32 bit mode

//allocate memory for our asm method

IntPtr p = NativeMethods.VirtualAlloc(

IntPtr.Zero,

new UIntPtr((uint)x86_FastMemCopy_New.Length),

AllocationTypes.Commit | AllocationTypes.Reserve,

MemoryProtections.ExecuteReadWrite);

try

{

//copy our method bytes to allocated memory

Marshal.Copy(x86_FastMemCopy_New, 0, p, x86_FastMemCopy_New.Length);

//make a delegate to our method

FastMemCopyDelegate _fastmemcopy =

(FastMemCopyDelegate)Marshal.GetDelegateForFunctionPointer(p,

typeof(FastMemCopyDelegate));

//offset to the end of our method block

p += x86_FastMemCopy_New.Length;

//store length param

p -= 8;

Marshal.Copy(BitConverter.GetBytes((long)nBytes), 0, p, 4);

//store destination address param

p -= 8;

Marshal.Copy(BitConverter.GetBytes((long)dst), 0, p, 4);

//store source address param

p -= 8;

Marshal.Copy(BitConverter.GetBytes((long)src), 0, p, 4);

//Start stopwatch

Stopwatch sw = new Stopwatch();

sw.Start();

//copy-past all data 10 times

for (int i = 0; i < 10; i++)

_fastmemcopy();

//stop stopwatch

sw.Stop();

//get message with measured time

System.Windows.Forms.MessageBox.Show(sw.ElapsedTicks.ToString());

}

catch (Exception ex)

{

//if any exception

System.Windows.Forms.MessageBox.Show(ex.Message);

}

finally

{

//free allocated memory

NativeMethods.VirtualFree(p, (uint)(x86_FastMemCopy_New.Length),

FreeTypes.Release);

GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);

}

}

else if (IntPtr.Size == 8)

{

throw new ApplicationException("x64 is not supported yet!");

}

}

and assembly code that is represented as an array of bytes with explanation:

private static byte[] x86_FastMemCopy_New = new byte[]

{

0x90, //nop do nothing

0x60, //pushad store flag register on stack

0x95, //xchg ebp, eax eax contains memory address of our method

0x8B, 0xB5, 0x5A, 0x01, 0x00, 0x00, //mov esi,[ebp][00000015A] get source buffer address

0x89, 0xF0, //mov eax,esi

0x83, 0xE0, 0x0F, //and eax,00F will check if it is 16 byte aligned

0x8B, 0xBD, 0x62, 0x01, 0x00, 0x00, //mov edi,[ebp][000000162] get destination address

0x89, 0xFB, //mov ebx,edi

0x83, 0xE3, 0x0F, //and ebx,00F will check if it is 16 byte aligned

0x8B, 0x8D, 0x6A, 0x01, 0x00, 0x00, //mov ecx,[ebp][00000016A] get number of bytes to copy

0xC1, 0xE9, 0x07, //shr ecx,7 divide length by 128

0x85, 0xC9, //test ecx,ecx check if zero

0x0F, 0x84, 0x1C, 0x01, 0x00, 0x00, //jz 000000146 ? copy the rest

0x0F, 0x18, 0x06, //prefetchnta [esi] pre-fetch non-temporal source data for reading

0x85, 0xC0, //test eax,eax check if source address is 16 byte aligned

0x0F, 0x84, 0x8B, 0x00, 0x00, 0x00, //jz 0000000C0 ? go to copy if aligned

0x0F, 0x18, 0x86, 0x80, 0x02, 0x00, 0x00, //prefetchnta [esi][000000280] pre-fetch more source data

0x0F, 0x10, 0x06, //movups xmm0,[esi] copy 16 bytes of source data

0x0F, 0x10, 0x4E, 0x10, //movups xmm1,[esi][010] copy more 16 bytes

0x0F, 0x10, 0x56, 0x20, //movups xmm2,[esi][020] copy more

0x0F, 0x18, 0x86, 0xC0, 0x02, 0x00, 0x00, //prefetchnta [esi][0000002C0] pre-fetch more

0x0F, 0x10, 0x5E, 0x30, //movups xmm3,[esi][030]

0x0F, 0x10, 0x66, 0x40, //movups xmm4,[esi][040]

0x0F, 0x10, 0x6E, 0x50, //movups xmm5,[esi][050]

0x0F, 0x10, 0x76, 0x60, //movups xmm6,[esi][060]

0x0F, 0x10, 0x7E, 0x70, //movups xmm7,[esi][070] we've copied 128 bytes of source data

0x85, 0xDB, //test ebx,ebx check if destination address is 16 byte aligned

0x74, 0x21, //jz 000000087 ? go to past if aligned

0x0F, 0x11, 0x07, //movups [edi],xmm0 past first 16 bytes to non-aligned destination address

0x0F, 0x11, 0x4F, 0x10, //movups [edi][010],xmm1 past more

0x0F, 0x11, 0x57, 0x20, //movups [edi][020],xmm2

0x0F, 0x11, 0x5F, 0x30, //movups [edi][030],xmm3

0x0F, 0x11, 0x67, 0x40, //movups [edi][040],xmm4

0x0F, 0x11, 0x6F, 0x50, //movups [edi][050],xmm5

0x0F, 0x11, 0x77, 0x60, //movups [edi][060],xmm6

0x0F, 0x11, 0x7F, 0x70, //movups [edi][070],xmm7 we've pasted 128 bytes of source data

0xEB, 0x1F, //jmps 0000000A6 ? continue

0x0F, 0x2B, 0x07, //movntps [edi],xmm0 past first 16 bytes to aligned destination address

0x0F, 0x2B, 0x4F, 0x10, //movntps [edi][010],xmm1 past more

0x0F, 0x2B, 0x57, 0x20, //movntps [edi][020],xmm2

0x0F, 0x2B, 0x5F, 0x30, //movntps [edi][030],xmm3

0x0F, 0x2B, 0x67, 0x40, //movntps [edi][040],xmm4

0x0F, 0x2B, 0x6F, 0x50, //movntps [edi][050],xmm5

0x0F, 0x2B, 0x77, 0x60, //movntps [edi][060],xmm6

0x0F, 0x2B, 0x7F, 0x70, //movntps [edi][070],xmm7 we've pasted 128 bytes of source data

0x81, 0xC6, 0x80, 0x00, 0x00, 0x00, //add esi,000000080 increment source address by 128

0x81, 0xC7, 0x80, 0x00, 0x00, 0x00, //add edi,000000080 increment destination address by 128

0x83, 0xE9, 0x01, //sub ecx,1 decrement counter

0x0F, 0x85, 0x7A, 0xFF, 0xFF, 0xFF, //jnz 000000035 ? continue if not zero

0xE9, 0x86, 0x00, 0x00, 0x00, //jmp 000000146 ? go to copy the rest of data

0x0F, 0x18, 0x86, 0x80, 0x02, 0x00, 0x00, //prefetchnta [esi][000000280] pre-fetch source data

0x0F, 0x28, 0x06, //movaps xmm0,[esi] copy 128 bytes from aligned source address

0x0F, 0x28, 0x4E, 0x10, //movaps xmm1,[esi][010] copy more

0x0F, 0x28, 0x56, 0x20, //movaps xmm2,[esi][020]

0x0F, 0x18, 0x86, 0xC0, 0x02, 0x00, 0x00, //prefetchnta [esi][0000002C0] pre-fetch more data

0x0F, 0x28, 0x5E, 0x30, //movaps xmm3,[esi][030]

0x0F, 0x28, 0x66, 0x40, //movaps xmm4,[esi][040]

0x0F, 0x28, 0x6E, 0x50, //movaps xmm5,[esi][050]

0x0F, 0x28, 0x76, 0x60, //movaps xmm6,[esi][060]

0x0F, 0x28, 0x7E, 0x70, //movaps xmm7,[esi][070] we've copied 128 bytes of source data

0x85, 0xDB, //test ebx,ebx check if destination address is 16 byte aligned

0x74, 0x21, //jz 000000112 ? go to past if aligned

0x0F, 0x11, 0x07, //movups [edi],xmm0 past 16 bytes to non-aligned destination address

0x0F, 0x11, 0x4F, 0x10, //movups [edi][010],xmm1 past more

0x0F, 0x11, 0x57, 0x20, //movups [edi][020],xmm2

0x0F, 0x11, 0x5F, 0x30, //movups [edi][030],xmm3

0x0F, 0x11, 0x67, 0x40, //movups [edi][040],xmm4

0x0F, 0x11, 0x6F, 0x50, //movups [edi][050],xmm5

0x0F, 0x11, 0x77, 0x60, //movups [edi][060],xmm6

0x0F, 0x11, 0x7F, 0x70, //movups [edi][070],xmm7 we've pasted 128 bytes of data

0xEB, 0x1F, //jmps 000000131 ? continue copy-past

0x0F, 0x2B, 0x07, //movntps [edi],xmm0 past 16 bytes to aligned destination address

0x0F, 0x2B, 0x4F, 0x10, //movntps [edi][010],xmm1 past more

0x0F, 0x2B, 0x57, 0x20, //movntps [edi][020],xmm2

0x0F, 0x2B, 0x5F, 0x30, //movntps [edi][030],xmm3

0x0F, 0x2B, 0x67, 0x40, //movntps [edi][040],xmm4

0x0F, 0x2B, 0x6F, 0x50, //movntps [edi][050],xmm5

0x0F, 0x2B, 0x77, 0x60, //movntps [edi][060],xmm6

0x0F, 0x2B, 0x7F, 0x70, //movntps [edi][070],xmm7 we've pasted 128 bytes of data

0x81, 0xC6, 0x80, 0x00, 0x00, 0x00, //add esi,000000080 increment source address by 128

0x81, 0xC7, 0x80, 0x00, 0x00, 0x00, //add edi,000000080 increment destination address by 128

0x83, 0xE9, 0x01, //sub ecx,1 decrement counter

0x0F, 0x85, 0x7A, 0xFF, 0xFF, 0xFF, //jnz 0000000C0 ? continue copy-past if non-zero

0x8B, 0x8D, 0x6A, 0x01, 0x00, 0x00, //mov ecx,[ebp][00000016A] get number of bytes to copy

0x83, 0xE1, 0x7F, //and ecx,07F get rest number of bytes

0x85, 0xC9, //test ecx,ecx check if there are bytes

0x74, 0x02, //jz 000000155 ? exit if there are no more bytes

0xF3, 0xA4, //rep movsb copy rest of bytes

0x0F, 0xAE, 0xF8, //sfence performs a serializing operation on all store-to-memory instructions

0x61, //popad restore flag register

0xC3, //retn return from our method to C#

0x00, 0x00, 0x00, 0x00, //source buffer address

0x00, 0x00, 0x00, 0x00,

0x00, 0x00, 0x00, 0x00, //destination buffer address

0x00, 0x00, 0x00, 0x00,

0x00, 0x00, 0x00, 0x00, //number of bytes to copy-past

0x00, 0x00, 0x00, 0x00

};

We will call this assembly method via delegate we have created earlier.

This method works in 32 bit mode for now and I will implement the 64 bit mode later.
I will add source code if anyone is interested in it (almost all code is there in the article).

Pay attention, the assembly code throws an exception if it is run under Visual Studio, and I still don't understand why.

Points of Interest

During implementation and testing this method, I have found that prefetchnta command is not very clear described even by the Intel specification, so I did try to figure out it myself and via Google.
Also, pay attention to movntps and movaps instructions as they work with 16-byte memory aligned addresses only.

History

  • Bitmap and 16 byte memory alignment
  • Source code and memory alignment samples were added
  • First version - 06/23/2015
FastMemoryCopy_src.zip (14.4KB)

1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y