At Lemurian, our goal is to make deep learning affordable and available for everyone, from the individual researcher to industry.
The Lemurian SPU (Spatial Processing Unit) is more than an order of magnitude more efficient in terms of memory and power in comparison with legacy processors, whilst also being faster in inference.
Our processor makes use of a novel digital arithmetic which has been designed specifically in order to speed up matrix-vector multiplications which largely make up modern deep neural networks, whilst also reducing hardware complexity.
Our arithmetic and processor architecture results in tremendous power and memory savings, thereby making our processor more affordable and well suited to inference at the edge.
lemurian labs technology
The world's first processor designed from the ground up to accelerate the entire robotics pipeline end-to-end. The SPU delivers jaw-dropping performance out of the box, without sacrificing precision, number of cameras, camera resolution, frame rate, or workload. When building with the SPU, engineers no longer need to compromise.
A lot of companies talk about TOPs/W, which is the theoretical maximum performance that a processor is capable of, but this doesn't give software engineers much to go on. What matters is how well the available performance can be accessed from software and how productive engineers will be using the processor.
We have taken a software-first approach to architecting our processor, and optimizing it for insanely high inferences/second/watt without imposing restrictions on the kinds of workloads engineers want to run. This way engineers don't have to waste precious time performance engineering and can instead focus on the more important task of improving their models.
Deep learning and reinforcement learning are dominated by matrix math and therefore very computationally intensive. Most hardware performance gains have come from optimizing the hardware to run only very specific workloads and from reducing the accuracy of the number format (INT8, INT4, analog, etc). But as they are now learning, this is the wrong approach.
Safe, full autonomy requires floating point-like precision to properly cover the distribution of weights in neural networks. However, >16-bit floating point formats are over-precise and over-sized, not to mention costly in silicon. We have discovered that neural network stability is negatively impacted by quantizing to INT8, leading to misclassifications in the real world.
Number formats are a centrally important parameter in processor design as it affects overall performance and hardware complexity with impacts on storage requirements, processing performance, and power dissipation.We sought out to create a number format that would best meet the needs of AI, while delivering performance and efficiency gains.
We call it PAL8 (parallel adaptive logs). With this new format, we are freed of the constraints all others are forced to adhere to, and we can design the right processor to enable the era of autonomous things.