Wednesday, 11 September 2019

Python threading vs multiprocess

Hi all,

I've been given access to a supercomputer! That's fun!

At first I thought I was doing something clever by importing the threading module into my python script, but I quickly discovered that all was not as it seemed. I'll cut to the nub of it.

I've written some code that rolls Yahtzees (5 of a kind with 6 sided dice). It's a nice simple test that generates CPU load. In the first instance I realised that my code running on a super computer using the Threading module was no faster than my local machine. This was a problem.

After some reading I discovered that actually, although Threading allowed some form of non-serial execution, it wasn't ever designed to do what I considered to be multi-threaded computation.

So I wrote a new version of the code to leverage the multiprocess library instead. All of a sudden we're actually doing some computation. Here are the tests I performed:

4 Threads on 4 Cores:


Multiprocess:


[xxxxx@xxxxx ~]$ cat results.txt
Number of yahtzees: 10000
Number of dice rolls: 3436750
Number of abandoned sets: 1990095
Total sets: 2000095
Running SLURM prolog script on xxxxx.cluster.local
===============================================================================
Job started on Wed Sep 11 10:08:10 BST 2019
Job ID          : 61100
Job name        : processyatzee.sh
WorkDir         : /mainfs/home/xxxx
Command         : /mainfs/home/xxxxxx/processyatzee.sh
Partition       : scavenger
Num hosts       : 1
Num cores       : 4
Num of tasks    : 4
Hosts allocated : xxxxxxx
Job Output Follows ...
===============================================================================
Writing to file
==============================================================================
Running epilogue script on xxxxxxxxx.
Submit time  : 2019-09-11T10:08:06
Start time   : 2019-09-11T10:08:10
End time     : 2019-09-11T10:08:20
Elapsed time : 00:00:10 (Timelimit=00:15:00)
Job Efficiency is: 0.00%

Threading:



[xxxxx@xxxx ~]$ cat results.txt

Number of yahtzees: 10000
Number of dice rolls: 27977719
Number of abandoned sets: 10899334
Total sets: 10909334
Running SLURM prolog script on xxxxx.cluster.local
===============================================================================
Job started on Wed Sep 11 10:07:35 BST 2019
Job ID          : 61098
Job name        : threadyahtzee.sh
WorkDir         : /mainfs/home/xxxxx
Command         : /mainfs/home/xxxxx/threadyahtzee.sh
Partition       : scavenger
Num hosts       : 1
Num cores       : 4
Num of tasks    : 4
Hosts allocated :xxxxx
Job Output Follows ...
===============================================================================
==============================================================================
Running epilogue script on xxxxxx
Submit time  : 2019-09-11T10:07:34
Start time   : 2019-09-11T10:07:35
End time     : 2019-09-11T10:08:48
Elapsed time : 00:01:13 (Timelimit=00:15:00)
Job Efficiency is: 38.01%

Job efficiency is interesting here. Suggesting that Threading is more efficient even though it took 7 times longer. Something to look into.


20 Threads on 4 Cores:

Multiprocess:

Number of yahtzees: 10000
Number of dice rolls: 648418
Number of abandoned sets: 1416784
Total sets: 1426784
Running SLURM prolog script on xxxx.cluster.local
===============================================================================
Job started on Wed Sep 11 10:10:45 BST 2019
Job ID          : 61102
Job name        : processyatzee.sh
WorkDir         : /mainfs/home/xxxxxx
Command         : /mainfs/home/xxxxx/processyatzee.sh
Partition       : scavenger
Num hosts       : 1
Num cores       : 4
Num of tasks    : 4
Hosts allocated : xxxxxxxx
Job Output Follows ...
===============================================================================
Writing to file
==============================================================================
Running epilogue script on xxxxxx.
Submit time  : 2019-09-11T10:10:44
Start time   : 2019-09-11T10:10:45
End time     : 2019-09-11T10:10:55
Elapsed time : 00:00:10 (Timelimit=00:15:00)
Job Efficiency is: 0.00%

Threading: 


Number of yahtzees: 10000
Number of dice rolls: 20623770
Number of abandoned sets: 12071981
Total sets: 12081981
Running SLURM prolog script on xxxxx.cluster.local
===============================================================================
Job started on Wed Sep 11 10:11:06 BST 2019
Job ID          : 61103
Job name        : threadyahtzee.sh
WorkDir         : /mainfs/home/xxxxxxxx
Command         : /mainfs/home/xxxxxx/threadyahtzee.sh
Partition       : scavenger
Num hosts       : 1
Num cores       : 4
Num of tasks    : 4
Hosts allocated : xxxxxx
Job Output Follows ...
===============================================================================
Running epilogue script on xxxxx.
Submit time  : 2019-09-11T10:10:50
Start time   : 2019-09-11T10:11:05
End time     : 2019-09-11T10:12:09
Elapsed time : 00:01:04 (Timelimit=00:15:00)
Job Efficiency is: 38.28%

Efficiency is still 0% for multiprocess, but it is finishing faster. Efficiency for threading is dropping off which is what you'd expect. I'm asking for something silly on a service which isn't hyper-threaded with a library that says threading but actually isn't. And if that makes sense, then nothing else will :-)

20 Threads on 20 Cores:

Process:

Number of yahtzees: 10000
Number of dice rolls: 487953
Number of abandoned sets: 1865100
Total sets: 1875100
[xxxxx@xxxxx1 ~]$ cat slurm-61106.out
Running SLURM prolog script on xxxxxxx.cluster.local
===============================================================================
Job started on Wed Sep 11 10:13:34 BST 2019
Job ID          : 61106
Job name        : processyatzee.sh
WorkDir         : /mainfs/home/xxxxxx
Command         : /mainfs/home/xxxxxx/processyatzee.sh
Partition       : scavenger
Num hosts       : 1
Num cores       : 20
Num of tasks    : 20
Hosts allocated : xxxxx
Job Output Follows ...
===============================================================================
Writing to file
==============================================================================
Running epilogue script on xxxxxxxx.
Submit time  : 2019-09-11T10:13:33
Start time   : 2019-09-11T10:13:33
End time     : 2019-09-11T10:13:47
Elapsed time : 00:00:14 (Timelimit=00:15:00)
Job Efficiency is: 56.79%

Threading:


Number of yahtzees: 10000
Number of dice rolls: 18743572
Number of abandoned sets: 10097238
Total sets: 10107238
Running SLURM prolog script on xxxxxx.cluster.local
===============================================================================
Job started on Wed Sep 11 10:11:53 BST 2019
Job ID          : 61107
Job name        : threadyahtzee.sh
WorkDir         : /mainfs/home/xxxxx
Command         : /mainfs/home/xxxxxx/threadyahtzee.sh
Partition       : scavenger
Num hosts       : 1
Num cores       : 20
Num of tasks    : 20
Hosts allocated : xxxxxx
Job Output Follows ...
===============================================================================
==============================================================================
Running epilogue script on xxxxx.
Submit time  : 2019-09-11T10:13:44
Start time   : 2019-09-11T10:13:50
End time     : 2019-09-11T10:14:57
Elapsed time : 00:01:07 (Timelimit=00:15:00)
Job Efficiency is: 7.61%

This last is very interesting. Why would multiprocess efficiency suddenly jump to 57%? Why would threading fall to 8%? I'm launching a thread per core as per the 4 on 4 test?

The conclusion is a simple one though. If you want parallel compute, don't use the threading module. Use the multiprocess module. It actually does what you want in the first place, and it's just as easy to write for.

Thanks for reading

Friday, 30 September 2016

ITs journey from CAPEX to OPEX?

Hi all,

I thought I might share some thoughts about the financial aspect of the new wave of IT.
IT has always been a cost. It doesn't generate revenue, it's expensive and it's sort of invisible to the end users in the same way people on a train don't care about the rail lines.

It's taken a long time for IT to get accepted by bean counters as having a real function with definate cost benefit and now we are here we need to give them some exciting news.

We're not going to be a CAPEX for very long!

Yeah!

Why would any business who could outsource their hardware to the cloud not do so?

There could be a problem here though. When we rock up to finance and say that we need £1.2m for some tin, at least they can understand that something has been bought. It depreciates, it can be sold and it has value. The the issue now is that with the evolution of cloud offerings the only thing we could end up buying is TIME.

Time does not depreciate. Time is gone the moment it turns up. You can't (easily) sell on your time. Time has no value once its used. The day is not too far away when for some places IT will no longer be a CAPEX but a pure OPEX action.

This could mean our poor finance teams needs a heads up that IT are about to hugely shake up how things are done and may need to adjust their figures accordingly.

Food for thought certainly.

Thanks for reading,

Redefining Software Defined Anything

Hi all,

Last night at the IG16 dinner I did the social thing and asked other people what they thought of the whole SDx thing. Apparently my comment in the Q&A session had hit home with quite a few of the more technical spods and this got us talking about what we would actually describe as "Software Defined".

Fishcake: Quite nice.

First we went over the current definition of what Software Defined is and what we thought was broken with that name. Apparently my opinion was quite well received and we came to a group decision that SDx in it's current framing is actually "Profile Defined". Humans define the Profile. The Profile is pushed to a controller of some kind. The Controller pushes the Profile to the end devices.

Lamb: Really good. Succulent. Great with the ju.

What is an API anyway? Does SDx need to be API driven? We quickly came to the conclusion that the API in all the slides didn't actually have to be an API. It just has to be a method of collecting data or sending new commands. This could be as simple as WMI calls or an SSH session. As long as the controller knows what to do, and it produces an automated response it could even be as simple as dropping a file into a remote directory.

Creme Brulee: Really? Was nice but these things pop up everywhere. This one had fruit.

What should SDx be? We decided that the real innovation in SDx wasn't actually the controllers or the code or the software. The actual innovation here are Vendors providing programmatic ways of configuring devices remotely. The software element in SDx is just replacing the human pressing the buttons. But that's not what SDx should be.

We decided that in order for SDx to be a real thing it needed to do more. It needed to do more than just blindly push human written profiles to devices. It needed to do more than the devices themselves. It needs to become the central hub for the environment. It needs to react to that environment based on how the environment is behaving at the time and it needs to do this automagically. The profiles then are humans telling the controller the acceptable ranges the environment can be in and the controller makes the decisions on how that is achieved. This would be a true Software Defined environment.

Cheese: Cheese.

So what next? Well as our new definition of SDx is more automated and less human controlled it brought into the discussion of other technologies. If SDx could re-route network traffic or turn off switches to save power, for example, would we need Cisco's very clever routing protocol to find "cheaper" routes? Would we need VMWares Storage virtualisation or DRS? Were all of these things work arounds in preparation for SDx and if SDx is better, how do we wind down these technologies?

Thanks for reading,

Thursday, 29 September 2016

Software Defined...anything?

Hi all,

I'm currently at the IG16 conference in Leeds, the title of which is "Software Defined...So what?". So what indeed?


A small part of the time in the talks today has been to define "Software Defined". Reading between the lines, it has almost nothing to do with "Software" defining anything and everything to do with automation and API's.

Let me go back a bit and I'll try to explain why you don't have to worry about "Software Defined...something" and why you have actually been doing this stuff for years.

SDx, the TLA being bandied around here with great abandon, is the act of abstracting the user from the API of a given vendor or set of vendors and giving them a core API that translates ubiquitous commands into Vendor specific commands.

Wow. What a sentence.

Ok. You are a Sys Admin. You have some Dell storage units and some Netapp storage units. You also have access to the API's and some skill in scripting. Now, some monster has asked you to create the same LUNs and configs on both units. So you write a script that will do that by speaking to the 2 companies API's that talks to their kit. Congratulations, you are now doing SDS.

Happy enough so far? Well it gets more interesting.

Lets say you don't actually care about storage configuration automation. You very rarely set up a new LUN, disk array, whatever. However, you've only got block storage (fibre/iscsi) and you want to automate the process of end users getting SMB or NFS shares. What you do is put a server inbetween, write some scripts that create windows/nfs shares on demand through a webpage or whatever and boom. Software Defined Storage at this layer too.

It's exactly the same deal with networking. If you write a script that interacts with some Cisco switches and/or some other kit, that is Software Defined Networking.

I raised a point in the Q&A session that this description is wrong. It isn't "Software Defined X". It's "Profile Defined X". Let's take Storage. What the SDS (Software Defined Storage) layer is actually doing is being your swiss army storage dude. It knows how to speak to all storage API's. You give the SDS a profile for it to follow, a device to poke the profile into and it'll do it configuration for you.

None of this is new. People have been writing scripts to do this for years. The only difference is now companies are getting into it and selling purpose built devices with a profile manager and a Perl script (or similar) to run against the Vendors API's.

The bonus is that because BIG IT are coming round to the idea they can monetise it. As they are trying harder to monetise it we are about to see some API standards being brought in. This means, ironically, it'll be easier for you to write code for the Vendors API's and not need their devices in the first place.

Software Defined X, again, isn't new. However the market is driving a new appproach to it. We need to be receptive to it, and actively encourage it so we can move onto more cool things. But don't forget, as it's as simple as all this, all you need is a computer and an API and you can do Software Defined anything you like.

I'm off to write a Software Defined Coffee API. Makes about as much sense as everything else.

thanks for reading,

Friday, 15 July 2016

Byte patterns - Human DNA

Hi all,

So you may have read my previous blog on counting bytes in a file. Well, I figured that as DNA is just a combination of letters (which stand for the chemicals involved) I figured I could see the ratio of the different chemicals by using my code. Would I find anything interesting?

So here they are:

X-Chromosome

Y-Chromosome


As you can see, not a lot 'interesting'. I was hoping that there would be some definite differences between the two. Saying that however, maybe the similarities and the ratio are the interesting thing about the result.

Oh, in case you are wondering. I haven't discovered a new chemical involved in DNA. That's actually line breaks!

Thanks for reading!




Friday, 8 July 2016

More Pratting Around With Colour and Data...

Hi all,

So there I was trying to work out why data is so huge, trying to work out a way of handling it better, when I had a thought. What actually is data?

Data, in simplest terms, is a string of bytes. In order for that data to be useful we put those bytes into a logical order to describe things such as text, images and the like inside a file. Normally these files have headers and other such human recognisable features to describe how the computer readable data is laid out inside the file.

I digress. Data is (currently) arranged in files for us humans to look at, they have structure, all future files will have structure, so I want to predict what those structures will be. To that end I started playing.

Plain Text file


I wrote a script in Python that can count the occurences of byte values. Its output is a png where the colours tendancy towards red shows its higher frequency compared to the total number of bytes in the file.

As you can see byte values that correspond with language based ascii characters feature heavily. Now what happens if we zip it?

Zipped Plain Text


As you can see the image is a lot flatter. This is because of the way zipping a file works. Essentially it records 1 or a number of bytes and then back references on the next occurance. I can now prove that the maths for different types of compression is different.

7z file


See? Proof positive something else is going on. Ok, ok, but it is interesting isn't it? How it's flattened out? It's something I've noticed with other file types too. Here are some below:

JPG file.

PDF file

MSI file




The smaller the file, the higher the compression, the more lossy the data, the more flat it is.

Nobbing around with data is something I've been doing for years. I am a great believer that we have already reached "L-Space". Everything ever written, ever been written, ever will be written is already out there. With the birth of huge datastores and the internet it should have transpired that every combination of bytes will have been written somewhere.

Thanks for reading,



Friday, 1 July 2016

Hybrid mind from hybrid cloud

Hi all,

The human mind is one of the greatest achievements in the natural world, it's power and utility are still not entirely known, it's limits unfathomable.
There are true geniuses, ones who can consciously direct their thoughts and imagine the universe of the very small and the very large, ones who can describe the beauty of nature in pure mathematics. There are those of us who are terrible at maths, but subconsciously your mind can estimate the speed of objects, the distance required to travel, the angle of the road, any physical limitations of movement to cross a road or catch a ball. It can also prioritise very well. Do we need to concentrate on the Lion in the bushes, or the stone in our shoe? It is a fantastic machine.

But it is not without its limitations. Yes, sometimes the mathematics ability is hidden from us. We can instinctively cross the road or catch a ball, but times tables are still a mystery. It's affected by mood, emotions, tiredness and more importantly, and what I am to focus on, distraction and procrastination.

First though, a quick leap to computing. Computers are one of the greatest inventions of the human race. The power and speed of processors has increased at a near exponential rate. The speed of storage has increased by a phenomenal amount too, since the invention of punch tape or physical switches. In the early days, we were limited to doing things one after the other, but then RAM and multi-core and multi-threaded processors came along. We could do more at once. No longer limited to sequential processing or access, things could be done at the same time and delivered faster.

And here is the rub.

In the IT sphere we have been, for decades now, trying to speed up computers. They are blisteringly fast. Any question you want answered can be retrieved very quickly. Any communication will arrive instantly and pop up on our screen. Email, Instant Messaging, Facebook, that report we wanted, that spreadsheet thats taking a while. All popping up as soon as they are ready. Each fighting for our attention. Each requiring our poor brains to switch focus again and again and again.

This is wrong.

We created computers so we could tell them what to do for us. Now they are driving. All these pop ups and alerts, each one a major distraction, each one has to be prioritised and thought about by our minds before we decide if we need to click on that report that's just come in. Do you need to open that report, or reply to Kevin who has just IM'd you? Do you need to reply to that facebook comment? Do you need to answer that email right now? Even if you don't need to talk to Kevin you still need to open the chat window to prioritise it against other things that have 'just popped up'.

My suggestion is that for our normal devices, our desktops and laptops and pads and phones and All in One's and smart watches (an ad nauseum list in the 21st century), we stop trying to make them faster. We need to make them "smarter". More like us. More for us. More like the way we think.

In order for this to happen we need to become more like the computers from yesteryear. More serial, more focussed. Like we used to be, when we used to make, build and hunt. We need to rest our conscious mind by providing things in a serial manner one job at a time. This will stop distractions. Less stress from swapping thought will also enable our subconcious mind to solve other issues, instead of trying to remember what that last email said. We don't need our computers to be fast, we need them to deliver the answer when we are ready.

At this point, we don't need our technology to be faster, but we do need to become more symbiotic with our technology. Our technology needs to be more sympathetic to our needs, our energy and our minds. Our computers should know when we are ready for another distraction; another report, another message, another picture of a cat even. Instead of the constant bonging, binging, pinging, popups and other distractions that cause our attention to be dragged away from our focus.

Thanks for reading