Web Workers

Parallel processing of neural networks in the browser

Multi-processing with web workers

Amazingly this type of parallel processing works right ‘out-of-the-box’ and it is very effective in using all cores on a multi-core machine with minimal efforts. In combination with typed arrays, which reduce the memory footprint and accellerate calculations, amazing speed-ups can be achieved, as illustrated by the following implementation of a Hopfield network. For simplicity we will run this in the synchronous mode, which is strictly speaking not allowed as it invalidates the original proof of convergence by Hopfield (1982). The code of the main process can be found in file hopfield4.html, which holds a little HTML, which we will not discuss here, and some JavaScript, which we will discuss next.

Web workers

EXPLAIN HERE

Code

The data structure used as follows:

    var n = 10000,
        act = new Float32Array(n),
        net = new Float32Array(n),
        weights = [],
        workers = [];

new Float32Array(n) creates an array of n floats (4-byte precision real numbers). These are an ‘HTML5 exension’ to JavaScript; instead of the normal Array, the strongly typed Float32Array only accepts float values and only allocates memory for n of these float values. Anything you want to put into the act array will be cast into a float (or cause an error if it can’t be cast into a float). There is a whole range of these typed arrays: for integer, unsigned integers, booleans, double precision floats, etc. Here we will use only floats, because their four-byte precision suffices for most neural network simulations, and they are a good speed-compromise between various browsers.

The size of the network is determined by n, the number of neurons. Since, a Hopfield network is fully connected, except for self-connections, the total number of connections approaches 10,000 times 10,000, or 100,000,000. These hundred million connections take up about 4 MB of memory as there is litle overhead per connection weight.

The arrays act and net hold activation and net input values, respectively. A normal JavaScript Array is used to hold all weights, which each are initialized as a Float32Array:

for (i = 0; i < n; ++i)
{
    weights.push(new Float32Array(n));
}

The ‘ordinary’ Array is suitable for slicing, so that we can assign different slices of input weights to different web workers. The workers array will hold all web workers, used to control the parallel processing.

Web workers are created as follows:

// Create `count` workers
function createWorkers(count)
{
    var i, worker, size, offset, to;

    for (i = 0; i < count; i++)
    {
        worker = new Worker('nnww4.js');

        worker.addEventListener('message', onMessage, false);
        worker.addEventListener('error', onError, false);

        size = Math.floor(n/count);
        offset = i*size;
        to = Math.min((i+1)*size,n);

        // Initialize new worker with neuron range and initial weights
        worker.postMessage({cmd:'init', id: i, offset: offset, inweights: weights.slice(offset,to)}); 
        workers.push(worker);
    }        
}

We create count workers, each of which is initialized from the file nnww4.js (nnww4 stands for: neural network web workers (version) 4). The main process listens to two types of messages that may be sent from the worker: ‘message’ and ‘error’.

Next, we initialize the neural network of the worker by giving it a slice of the weights, which are seen as the inweights to a neuron. That is, inweights[5] would be an array of all weights to neuron 5 (i.e., synaptic connections on the dendrite, if neuron 5 were a real neuron). The slice is created by integer division of the total size n by the number of workers count. Initialization itself, is accomplished with the postMessage() method, which here is passed an object that must be given a sensible interpretation by the web worker (i.e., in file nnww4.js, discussed shortly). For example, the message ‘init’ tells the worker that it must assign all the necessary inweights and activations to start running its part of the neural network. For easy debugging and loggin, the workder is also given an id and the offset (size is implicit in the size of the slice).

In file nnww4.js we have routines for actually running the neural network, such as for learning and calculating activation values. These routines use the following data structure:

var id,
    offset,
    act, 
    inweights,
    net_slice,
    act_slice;

The difference between act and act_slice is that act is the array with incoming activations to the input side of the neurons and act_slice is the array with activations calculated after processing the act input. act_slice will be sent back to the main process, which assembles a new activation array from all the workers’ activation slices. This is then sent back to each workers as an act array to be processed in the next iteration.

The act array may be quite large and unfortunately, it will be copied (by structured cloning) for each worker to which it is passed, as there is no data sharing among web workers. This copy operation will usually be dwarfed by the time taken to process a large slice of weights, at least in densely connected, large networks. An enhancement will be considered below, where we pass only changed activations, together with the neuron indices. This may be efficient in sparsely firing networks. Such networks will also have sparse connections, which we will consider as well.

The posted messages to a worker are processed with the following code:

self.addEventListener('message', function(e) 
{
    var data = e.data;

    switch (data.cmd)
    {
        case 'check': self.postMessage({cmd: "msg", contents: "Worker " + id + " reporting for duty!"}); // simple check
             break;

        case 'calc_net_and_act': 
             act = data.contents;
             calcNet();
             calcAct();
             self.postMessage({cmd: "act_slice", act_slice: act_slice, offset: offset}); 
        break;

        case 'init': 
            inweights = data.inweights;
            id = data.id;
            offset = data.offset;
            act_slice = new Int8Array(inweights.length);
            net_slice = new Float32Array(inweights.length);
            break;

        case 'exit':
            for (i = 0; i < inweights.length; i++)
            {
                inweights[i] = null;
                delete inweights[i];
            }
            inweights = null;
            act_slice = null;
            net_slice = null;
            self.close();
        break;

        case 'calc_weights': 
            act = data.act;
            calcWeight();
            break;

        default:
            self.postMessage("Unknown command sent to worker");
    }
});

The data property is always present and holds the object sent by the main process. We see that the ‘init’ command leads to local copying of the inweights slice, the id and the offset and to allocation of a local slice of the neural data:

act_slice = new Int8Array(inweights.length);
net_slice = new Float32Array(inweights.length);

The Hopfield learning rule is triggered by sending the ‘calc_weights’ command, together with a pattern of activations to be learned; these are passed in the data.act property:

        case 'calc_weights': 
            act = data.act;
            calcWeight();
            break;

The function calcWeight() is a simple Hebbian learning rule:

// Assumes that the to-be-stored pattern is present in `act`
function calcWeight()
{
    var i, j,
        wlen = inweights.length,
        alen = act.length,
        a, w;

    if (offset === undefined)
    {
        throw("Must initialize weights before calling `calcWeight`");
    }

    i = 0;
    for (; i < wlen; ++i)
    {
        // Cache as much as possible outside inner loop
        w = inweights[i];
        a = act[i+offset];  // to-neuron must be counted from slice offset `offset`    
        j = 0;
        for (; j < alen; ++j) if (j != i)
        {
            w[j] += a*act[j]; 
        }
    }
}

For sake of speed, as many variables are cached outside the loops as possible. For example, using

for (i = 0; i < inweights.length; ++i) { … }

Might trigger a lookup of both i = 0 and certainly of inweights.length at each iteration.

There is currently just a little bit of error checking:

if (offset === undefined)
{
    throw("Must initialize weights before calling `calcWeight`");
}

The error raised is caught the the onError() function in the main process, which is registered as an event listener (see above). It does little more than to post the error to the web page:

function onError(e) 
{
    document.getElementById('error').textContent = [
      'ERROR: Line ', e.lineno, ' in ', e.filename, ': ', e.message
    ].join('');
}

It is useful to know that line numbers etc. are available to aid in debugging.

Running the network itself is accomplished by posting the ‘calc_net_and_act’ command, which is interpreted as:

case 'calc_net_and_act': 
    act = data.contents;
    calcNet();
    calcAct();
    self.postMessage({cmd: "act_slice", act_slice: act_slice, offset: offset}); 
break;

Like in ‘calc_weights’, the full input act array is extracted. This is followed by calculating first the net input to each neuron and then the resulting activation value:

function actRule(net)
{
    return net < 0.0 ? -1.0 : 1.0;
}

function calcNet()
{
    if (!inweights)
    {
        throw "The nnww worker has not been initialized!";
    }

    var i, j,
        len = act_slice.length,
        w,
        temp;

    i = 0;
    for (; i < len; i++)
    {
        j = 0;
        temp = 0.0;
        w = inweights[i];
        for (; j < len; j++) 
        {
            temp += w[j]*act[j]; 
        }        
        net_slice[i] = temp;
    }
}

function calcAct()
{
    var i = 0,
        len = act_slice.length;

    for (; i < len; ++i)
    {
        act_slice[i] = actRule(net_slice[i]);
    }
}

The code is straightforward, again with variable caching outside the loops. In a number of tests, we found that the overhead of a call of a (small) function was not very large though it did slow down processing a little bit (see http://jsperf.com/array-multiplication/3).

After having calculated the new activations values in the worker’s local network slice, these are sent back to the main process with

    self.postMessage({cmd: "act_slice", act_slice: act_slice, offset: offset}); 

The main process can use the offset property to put the activation slice in its proper place by listening to the ‘act_slice’ message of each worker. When this has been accomplished the resulting new act array can be copied to each worker to start the next iteration. Note that, should any learning have taken place during the activation phase, the weights remain local and need not be distributed to other workers. In a sense, each worker holds a specific ‘brain area’. The code for the putting together is as follows (leaving out some other commands):

function onMessage(e) 
{
    var data = e.data, w, offset, act_slice, i, len;

    switch (data.cmd)
    {
        case 'act_slice': 

            offset = data.offset;
            act_slice = data.act_slice;

            i = offset;
            len = offset + act_slice.length;
            for (; i < len; i++)
            {
                act[i] = act_slice[i-offset];
            }

            if (++wwcount === workers.length)
            {
                if (++iteration < max_iteration)
                {
                    wwcount = 0;
                    wwpost({cmd:'calc_net_and_act', contents: act}); 
                }
                else
                {
                    document.getElementById('result').textContent = "Finished in " 
                        + (new Date() - start_time) + " ms";

                    // Cleanup after simulation with a large network
                    wwpost({cmd:'exit'});
                    for (i = 0; i < weights.length; i++)
                    {
                        weights[i] = null;
                        delete weights[i];
                    }                    
                    weights = null;
                    act = null;
                    net = null;
                }
            }    

        break;
    }
}

If there are 4 workers, the code above is called 4 times, each with a different slice (read from the offset property). A count of this is kept in the wwcount variable. If all workers have reported and all activation slices been accounted for, the new act array is posted back to all workers with:

wwpost({cmd:'calc_net_and_act', contents: act}); 

where wwpost() is a simple function that posts a message to all workers:

function wwpost(obj)
{
    for (var i = 0; i < workers.length; i++)
    {
        workers[i].postMessage(obj);
    }
}

The global variables iteration and max_iteration determine how long the whole process keeps going. We found that it is very important to do proper clean-up after exiting, because 100,000,000 weights take up a lot of memory. The ‘exit’ message, sent to each worker, leads to unassigning all memory of the arrays followed by a destruction of the worker itself with self.close() (see switch statement above, case 'exit':).

So how does this code run and how much do the web workers help? Surprisingly well actually. Most modern browsers support web workers and many will actually assign workders to different cores on multi-processing systems. Most modern laptops and desktop system have at least two cores and some have many more. My system has 8 and this is where it tested the code above, in the latest versions of Chrome and Firefox (using nightly builds).

Performance

In the table below, we see the time in seconds as a function of the number of web workers WW. This was tested on a Dell 8 cores i7 CPU 920 2.67 GHz machine from 2009 with 6 GB RAM on Windows 7 (64 bit). We used the ‘Nightly’ builts with parallel (multi) processing activated. A good browser can place each of the web workers on their own core, which suggests that it is not useful to have more than 8 web workers.

For the simulation we used the constants as shown in the code above: n = 10,000 neurons, each connected to all other neurons, giving 100,000,000 connections, running for 50 iterations.

WW Chrome37 FF35aNightly
1. 15.6 3.1
2. 4.3 3.0
4. 1.8 1.3
6. 1.2 0.81
7. 1.1 (1.0) 0.67
8. 1.1 0.61
16. 2.1(*) 0.50
24. 1.8(*) Freeze

(*) With a lot of variability to 3.7 s etc.

This table was originally compiled with act = new Int8Array(n), using the Float32Array makes it a little faster on Chrome (value between parens) and a little slower on Firefox. We will keep using this though, because of future SIMD extensions, which favor same type calculations (see below).

Inspection of the table shows that anywhere from 7 to 16 web workers give the best result, with 0.5 s as fastest on Firefox (Versin 35a Nightly built). Thinking about this: 50 iterations with 100,000,000 connections in half a second means that the fastest combination (16 workers on FF35a) ran at 10,000,000,000 connections per second! Not bad at all for a scripting language on a 2.67 GHz machine. We did not compare this simulation with a dedicated one in C++, but the theoretical maximum is 8 cores times 2.67 billion, which 8x2.67= 21.36 billion. So, we were running at just below 50% of the maximum speed, which is pretty good, keeping in mind that one particular process will not easily command all of a PCs resources under Windows 7.

We also observe in the table that Firefox here is much faster than Chrome, running more than twice as fast. It also uses the 8 cores better. A generally useful setting for the number of workers, which runs well on both browsers, is C - 1, where C is the number of cores. Presumably, the C-th core is for the main process.

Conclusion

Using web workers, it is quite easy to run neural networks in parallel in the browser nowadays. And it is fast! Of course, native implementations with C or C++ would still give better results, but at the cost of many hours of hard work, yielding code that is harder to maintain, and that can only be run on other platforms after recompiling. The code above runs anywhere that has a modern enough browser. Important is that the actual shape of the algorithms does not need to fit any specific format, like say matrix algebra, though for many neural network paradigms such as backpropagation additional work will be necessary to fit the flow of information through the upsweep and downsweep across the various layers to the slicing and dicing approach used here. Even so, activation and learning rules can still stay the same, even with backpropgation. This is different, however, with another form of parallellism, which will continue to discuss next: SIMD. There we need to rewrite both the data structures and all computations to fit with a particular vector-like format.