Saturday, February 18, 2017

THE FASTEST EVER 3D / 2D FLOOD FILL SCANLINE CODE, IS ALSO THE EASIEST!!! 

Do you want to achieve a 9 million pixels/second floodfill on a single core of i7 7700? using less memory than a generic list? Read on!

Here is a video with a simple graphic illustration:
https://www.youtube.com/watch?v=Xddws-50Igs&feature=youtu.be


If you don't believe me, here is a video filling 4.7 million pixels per second on a 2011 AMD FX: 

I started this quest by researching tutorials, codes, and implementations of floodfill, to understand the methods, logics, tricks, tips. Scanlines, recursive, generic lists...

Most algorithms are very long!!! they use too many instructions, excessively complex neighbor-search logic, scans, scanlines, and processes, and they use stack-disorganized memory or generic lists.

I'll explain the 9 million pixel per second method for a 2D image flood fill. Read attentively:

TRANSFORM THE EQUATION: 

Recursively Found Flood Spaces = 4^(Xpixel*Ypixel) (billions of duplicate finds)

to

Recursively Found Flood Spaces = (Xpixel*Ypixel) 

(deduplicate the recursively found pixels, by overwriting their previous found status, in an XY array address, that you use instead of system memory, to schedule the next recursion. The XY array can execute a maximum ONE recursion per pixel.. the 4^XY instructions become XY instructions)

My code is the same as a recursion, and for 100x100 canvas, it uses 100,00*28 instructions.
A recursion code uses 4^(100,000)*28 instructions, so I am saving many orders of magnitude of instructions being computed on the processor, compared to the original recursion.


  1. Start by writing 12 lines, a four neighbor recursive floodfill code that will cause a stack overflow.
  2. Encapsulate the 12 line recursion in a loop and optimize the resulting program's data flow with some 2D arrays. 
  3. Decide to keep the recursion simplicity and it's tidy compact instruction set, and record it's read and write progress in some orderly 2D arrays, and encapsulate the recursion in a 2D loop, which will work the same as a scan-square rather than a scan-line, forcing the recursions to write lines and squares. (2D arrays are read with 2-3 processor instruction. i.e. "if (FoundNeighborPixel2D[ x , y ] = empty) RecurseOn ( x , y ) ;   
  4. Add 12 lines of code into the recursion, that assign recursion execute commands, to schedule themselves, by write/read their schedule status in an orderly Xpixel*Ypixel 2D array. Encapsulate the recursion in forX; forY; {{ loop, so that the recursion proceeds like a combine harvester in a field, it checks the left square, fills it, checks the left square, fills it, for every loop iteration, and moves up and does the row above, like a very simple scan line.


The PC most naturally progresses through 2D arrays and squares, so that is how we keep the RAM addresses for all stages for the square of the canvas.

Thus you economize a lot of code and a lot of processor executions: encapsulate the 12 line recursion in a loop and optimize the resulting program's data flow with some 2D arrays. 

Using 2x 2D arrays, is like a nuclear submarine where two admiral captains have a key to launching an atomic ballistic missile. Both captains have to supply their key to schedule a missile/recursion. A pixels must be marked as a found neighbor to be processed at all, in the XY loop, and it must be marked checked but not filled, in order to run the recursive function which fills and checks neighbors and marks them in the neighbors array.

So, the 2D arrays are overwriting the found neighbors from the recursion tree when they are found, overwriting into the same 2D array addres for every duplicate found-neighbor. A subsequent recursion for that found neighbor always happens exactly once, marched forwards by the loop iteration... the loop uses 1-2 instructions to check if the current space is uncharted, and 12 lines if is a found neighbor, and then the loop repeats N times until found neighbors = zero, the flood is complete.
( 12 lines means that found neighbours are 1/FILLED AND 2/THEIR NEIGHBORS SEARCHED).

You only have to write duplicates from the tree into two memory arrays: "FoundNeighborPixel" and "FilledNeighborPixel".

Normally, you should have been able to understand the program by now. You can go code your 9 million pixels per second program now! Perhaps more information will confuse the topic? still I have written a different approach:


ORGANIZED RECURSION MEMORY MAP OF A GRID SPACE:


Your recursive loop, proceeding, explodes image data into a tree, 1, 4, 16, 64... branches of neighbor pixels. If we can pack the tree branches back into the image pixel positions, from which they came, i.e. the pixel source of the data, the XY image graph, then we would have a tree with X*Y branches / memory useage. We do that in 12 lines, implementing some 2D arrays to organize the recursion memory.

You place the recursive function into a nested x - y loop, to contain it and give it structure.

for x = 1... the first iteration:

The recursive function finds 4 empty neighbors, it records them for later.

You write those 4 empty neighbors to a 2D array called "FoundNeighborPixel", Do not write them to system memory, else you will explode the neighbors into a tree.

You also prepare a 2D array called "FilledNeighborPixel", Because once you have detected an empty neighbor, you will fill it, and check it's neighbors.

for x = 2... the second iteration:

the recursive function checks pixel number 2, it reads it's status in the "FoundNeighborPixel" Array. 1 instruction returns the value "neighbor marked for checking, check if it is filled and check if it's neighbors are filled/empty/borders" ( If it was not marked for checking, it would have been a border pixel, and we would have break; for that pixel).

If the pixel marked for checking, we will maybe have to fill it. We have an empty 2D array called "FilledNeighborPixel" We check if the pixel is filled in the "FilledNeighborPixel" (which starts off all blank). If no filled, we floodfill that pixel in the "filled" array (and we color that pixel in the original image).
If it IS filled in the fill array, then it has already been fully processed, and it's neighbors will have already have been checked so we break, and save 20 checkneighbor instructions.

The "filled" array will eventually represent the entirety of your floodfill as an XY graph. It is empty at the beginning, so you fill it iteration by iteration, on spaces marked true in  "FoundNeighborPixel" where  "FilledNeighborPixel" is false, and you mark it's 4 neighbors in the "check neighbors" array, otherwise you abort the next 20 instructions of the function and do the next pixel in the loop.

...
...

The loop proceeds like dominoes, the X and Y row gets filled one iteration after then next.

If you click a blank canvas, you can fill a million pixels |(1024x1024) using 20million instructions, that's a bout 0.1 seconds for a Gigaflop processor core like an i7.

When you figure out the above, you will have 9million pixels per second on an i7 processor.

DIFFICULTY OF CODING:
The recursion function code is a 10 line function.
You place it in an XY nested loop, that's 4 lines.
You add 2 break conditions, that's 2 lines.
The 2d-array writing lines slipstreamed into the recursion are another 10 lines of code.
Overall, the entire floodfill code is about 40 lines.

 (perhaps it is not a recursion any more, I am no computing student)

COMPUTATIONAL EXPENSE: 
Individual pixels of the XY loop will use 20 instructions for every floodfill operation, or 2 instructions for every "pixel already found/filled" result.

For a 1024x1024 images, you are using 20 million instructions, about .1 seconds on an i7.

The search loop runs again and again, until it finds zero new Fill operations, then your floodfill space is complete. So it will consume 15 million instructions on the first pass and 2 million on the second pass for example.

To turbo charge the process, you can read the pixels once forwards, once backwards, once forwards, so that the "scanlines" produced by this simple code scan forwards and up for X+ Y+ space on one loop, and fill backwards and down for every second run of the XY nested loop.

Think of your tree branches as your sword, and the image grid as your six pack. may the force be with you.

MEMORY USE:
I processed datasets of 2 billion voxels, using a desktop PC.

2D is the same logic as 3D, it just uses 2D arrays instead of 3D, 2D images instead of minecraft cubes.

for UXGA images, you will need about 10-20 MB of ram.

... THE REST OF THIS PAGE IS WHERE I TRY TO EXPLAIN THE CONCEPT LIKE A CONFUSED TEACHER: THERE IS JS CODE AT THE END. THE REST OF THIS PAGE IS SUPERFLUOUS, IT'S LIKE A CRASH COURSE BY A HIPPY STONED TEACHER:

Flood requires only a search and fill of neighbor cell, if empty, and cells share common neighbors. Instruction duplicates can occur in an exponential tree fashion. That's why people resort to line-scans, rather than using one memory address for every cell, which can deduplicate and overwrite duplicates, and thus lets you use recursion or any search routine you want!!!

For every next search instruction, using small arrays to store the 2-3 steps of your recursion logic/ IF/OR/ comparison your number of instructions by millions of times, resulting in a simple and fast recursive process. The recursion is only 7 lines of code, so why not just write arrays to manage it's comparison results in a duplicate of the original image? Max 3 copies of the image!

My PC was previously crawling on canvases of billion voxels using optimized generic lists and scan lines. 

I found a workaround which employs 2 system arrays vars that both have 6 lines of logic code to read and write, the arrays are addressed using spacial coordinates, i.e. your DIYstack array address is pixel(X, Y) if you use 2D, and it fast keeps track of your recursive progress, for what has been checked, and what to check next.

So you simply recurse through neighbor cells and you memorize your previous checks in an array with as many pixels as there are cells in your canvas/voxel.
The wonder of that, is that you recurse a neighbor pixel search-tree which has trillions of potential duplicate checks/revisited pixels, and any double checks to be performed are written to the same memory coordiante for that pixel X.Y, and then next run of the loop checks those memory coordinates ONCE! and logs it's neighbors for checking again, without duplicates. It finishes as soon as you press the start button.

The recursive floodfill code is 12 lines and another 12 lines of memory array code, to avoid using stacks / generic lists / scanlines.

It's called the SCANBOX ALGORYTHM, because it can iteratively even fill cubes in 3D rather than lines. 

  • 2 million voxels filled per second. AMD single core 3.2Ghz.
  • 2 MB memory/stack used for 1000 x 1000 pixels
  • 2 GB memory for a 1 BILLION voxel space (other people's generic list stack version i tested jammed the PC after 12GB in ram due to excessive branching, Their default run-time stack codes jammed after 10mn voxels)

copyright information is at the end of page, it's a shareware for use in commercial programs.

Instead of the linear memory stack which addresses randomly and automatically, used by default in the computer to record 2D progress. You would use a 2D array to store 2D pixel maps. It's completley logical, but no one did it before!!! it's a recursion IFS type floodfill with a near zero memory/progress impact. weird right?

The virtual RAM block is a boolean array of the same size as the voxels/pixels. 10*10*10 3D voxel canvas is accessed/addressed using z*10*10 + y*10 +x ... 10*10 pixel 2d address is 10*y+x.  It has a spacial memory adressing system, representing your space, any duplicate checks in the check queue are written to the same memory bit.
No duplicate checks can happen in your process, it's perfect organized floodfill memory.

Why is it so optimized compared to other algorythms? The search logic checks neighbor pixels without any pattern or method or scanline, it's rapid! but it can't double check a pixel  for neighbors, and waste time, because pixels marked to check only have one possible array address marked as true/false, in a spatial array.
  • The "neighbor pixels to be checked" overwrite each other if they are detected from different neighbor pixels in your hand built RAM space/stack if on the same space, no recursions are checked twice... it's computationally super efficient for spacial trees. It flood fills by marching in straight lines in every direction.
Using the address: z*10*10 + y*10 +x for every time you read a memory position, i.e. 2% of total computing load of a floodfill iteration.

In 2D you can fill all empty pixels above and to the left of the start pixel in a first iteration through every pixel, and you can fill all the pixels underneath and to the right of every previously filled pixel on the second iteration. It means that if you you are filling 1000x1000 pixels with a picture of an rabbit in the middle, the floodfill will complete in 1.5 or 2 seconds, using only as much memory as 1000x1000 pixels with zero memory redundancy. (max 1mn ram true-false spaces)

In 3D it fills all pixels forward up and left of start point on the first pass, and fills every pixel reverse down and right on the reverse pass, and it can fill space of 1000x1000x1000 pixels with a rabbit in the middle in 2 minutes using 1bn ram array.








Here is code summary of the process:


There are two arrays of true false values. 

One of them is a voxel presence/absence array. A sphere in there will be represented by a sphere of true values in the center.  The array could hold voxels and it could be an image in 2D, marking colors and other qualitative information.

Another equal size array is our magic floodfill memory array, which marks voxels to be checked as true. it's more structured, sequenced and low ram than a stack, using 2d/3d array logic (otpionally a 2d/3d type array), 

You write the corner pixel for example 1.1.1 as your first flood fill point. The recursive function marks that pixel as true in the presence/absence ram block... and it ALSO writes 6 neighbors of pixel 1.1.1 in the neighbor checking array... The pixel in front of it, to the left and up of it, which it finds and fills at end of line and row. 

The condition to do a fill a void voxel and check neighbors via a recursion function is if the check on that position is true and the pixel presence is false, it writes the empty as true and checks the six neighbors.

This creates a scanline which sweeps through the entire space, checking forwards as it goes, and floodfills it instantly, as fast as is virtually possible, using almost no ram, with maximum efficiency, and minimal coding. 

(please paypal me at
thank you!!! ;})

here is the code, it's 40 lines of code, 10 lines of brackets, and most of it is comments. On 64bit single core AMD at 3.4GHz it finds 5 million voxels per second at while cube filling and completes a complex 1 billion voxel fill in well under 10 minutes without optimizing as suggested at end.

THIS IS UNITY3D JS CODE, YOU CAN READ IT IN A JS COLORED CODE EDITING PROGRAM:

mesher.supernormous is the voxel positional array.
var sn2 : boolean[]; is a blank copy of the above

function chk(  ) //floodfill pseudo-recursion algo  
{
var last = 0;//count filled voxels since end of a previous loop used to break
for( var i = 0 ; i <99999 ; i++ )
{ //a very complex fill typically takes 20-25 loops to complete
//if(i%2==0)//recommended!!! use flip-flop condition to alternate between forwards tr++ scan and backwards tr-- of memory block, to fill 3 axes both ways.

for( var tr = 0 ; tr < sn2.Length ; tr++ ) //scan forwards through check ram
{
if(tr%10000000==0)
{print(bcnt+ "   --2runin----     " +
(Time.realtimeSinceStartup-tt)+" -=-=-== " + i+"  "+tr);yield WaitForFixedUpdate(); }//make it yield every 1-10mn checks, which will limit the runtime stack at 1 million function calls for example. the entire loop will end up making 500mn function calls total.
if ( sn2[tr]==true &&  mesher.supernormous[tr]==false ) 
{//the check ram block is called sn2
var z = tr/bsizexy ;//convert linear 3D array to 3d xyz positons
var y = (tr%bsizexy)/bsizex ;
var x = (tr - z*bsizexy - y*bsizex) ; 
boundary2( x,y,z  );
}
}
            
if (last == bcnt){print("The program has run"); break;}
last=bcnt;//if written voxel count doesnt increase in last loop, all points found... if did the tr loop forwards and backwards in my version.

               }
// one note about the above function... It does 500mn voxels in 2-3 minutes, but it has maximum efficiency for complex spaces if you repeat the loop in all 3 axes, in both directions. every time you linearly loop through the check memory, it does scanlines on only 3 axes directions, for example x forward makes scanlines in X, Y, Z, and -x backwards scan throuch checkmemory does -X -Y -Z, so, you can duplicates the tr++ loop 6 times on 3 axes both ways, i believe that you could floodfill a menger sponge from the center outwards in as many sweeps as it has recursions. 

function boundary2(  x : int , y : int , z : int ) //recursive floodfill algo  
{
if ( x < 0 || x >= bsizex || y < 0 || y >= bsizey || z < 0 || z >= bsizez )return;
//{
// print( " x "+ x +" y "+y +" z "+ z +" "++""++""++ )};
if (wst( x , y , z ) == false)
{
wsf(x,y, z,  true) ;

csf(x+1,y,z, true);
csf(x-1,y,z, true);
csf(x,y+1,z, true);
csf(x,y-1,z, true);
csf(x,y,z+1, true);
csf(x,y,z-1, true);
}                         
}

csf = write check memory array as XYZ values
wsf= write voxel presence array as XYZ values
wst= read voxel presence array as XYZ values


function csf( x:int , y:int , z:int,val : boolean) //used for writing
{
var tally : int = z * bsizexy + y * bsizex + x;
if ( x < 0 || x >= bsizex || y < 0 || y >= bsizey || z < 0 || z >= bsizez )return;
else
sn2[ tally ] = val;
}

function wst( x:int , y:int , z:int ) : boolean//true if null
{
var tally : int = z * bsizexy + y * bsizex + x;
if( tally >=0 && tally < cubic )
return mesher.supernormous[ tally ];  //the voxel presence array is called supernormous
else return true;
}

function wsf( x:int , y:int , z:int,val : boolean) //used for writing
{
var tally : int = z * bsizexy + y * bsizex + x;
if( tally >=0 && tally < cubic ){  
mesher.supernormous[ tally ] = val; bcnt+=1;}// count written flood voxels

}

You can use the ultimate scanline trick synergetically with other algorythms if you want. It is the best at managing huge spaces in small RAM requirements and filling them very fast. Afterwards it is slower for the last 1% of spaces which could be hidden in cones in a complex surface. so the most optimal is probably to fill the first 99 percent of pixels using the trick, and then copy the last pixels/voxels to be checked to Vector3 position array in a stack function which can do the last 1 million in less time, this would be faster by 10-20 percent than using only boolean spacial memory process. it works fastest with 6 neighbors, you can use N neighbors, 8, 12, so forth.

Please comment if you find any further optimizations for the boolean memory checking scanbox algorythm and for any feedback. Please feel free to perform some measurements to find if scanbox is faster than all other scanline tricks? 1billion voxels in a complex boundary in 10 minutes is a considerable challenge.

COPYRIGHT INFORMATION: 

This is difficult and proprietary research. It is SHAREWARE;), Definitely give me a payment if you have more than 3 employees.
(All rights are retained by the author of the 2d/3d/Nd data flood fill computation technique/algorythm using a check marker array for spacial flood/floodfill processing with spacial check information stored as array indices).  If i have saved you 4-7 days of research and work and superboosted your program, it would please be fair that you send me a paypal payment of 10-50 usd to my given addy:
 
EDIT- It's a good idea to start the entire project using bitwise encode and decode of bit arrays rahter than boolean, it's fast and it can hold 4 times more data for the same amount of memory than boolean and standard fp arrays. i hit the limit of 2.4bn bools per array in 32bit.

META-Arg-searchwords: void c# .net github stackoverflow python picture image 2D 3D volume PNG medical isosurface scan minecraft image stack fiji 

4 comments:

  1. I have read through this blog 3 times but I am struggling to understand what you're saying due to your (I assume) non-native English writing.

    Please post a complete copy of the source unedited. The fragments in the blog appear to have missing variable declarations and as such are not working.

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Hiya, thanks, I've addressed your query through a video to explain the code graphically. https://www.youtube.com/watch?v=Xddws-50Igs&feature=youtu.be

    My code has different canvas write code, it's 3D, it has voxel despeckling and slicing all on the same page, it has print lines, graphics update lines to maintain FPS while filling 2gb of voxels, I don't have time to rewrite it this morning, sorry!

    ReplyDelete