Wednesday, 21 February 2018

Mapping the ground

You'll have to take my word for it, I'll never find a reference for this story. Ireland was mapped in the 19thC by teams of land-surveyors: chains, theodolites, triangulation and wet socks. Cost of labour has rocketed since the 1840s, so when the Ordnance Survey of Ireland needed to update their maps 30 years ago - because the pace of change and development was hotting up and the 1911 survey didn't show a rattle of by-passes, bridges and housing estates. This obsolescence of old paper maps benefitted me last year when I purchased the whole SE corner of Ireland at 1:50,000 scale. So the OSI put out to tender a contract to capture the landscape in precise series of overlapping aerial photographs. The company that won the tender had looked at Irish weather, the size and speed of their Cessna and reckoned that they would budget 18 months to be sure to be sure of getting enough cloud-free days to cover the country. They started in early Summer in the midst of a rare anticyclone, flew all the hours of daylight and finished the task 15 days later: Win!

The gossip is that the Department of Agriculture does it all by satellite now, snapshotting every field every five days. That may be just a rumour to stop farmers burning their hedge-clippings and throwing their plastic silage wrappers into the bonfire. You may be certain sure that the Dept Ag doesn't have the manpower to analyse all these data: they are too busy drinking tea and doing sudoku at their desks.

Photo-technology has moved on mightily since the 1990s. You just think Google Streetview. My students capture the view down their microscopes with their smartphones . . . which is a little annoying. One of the most amazing technologies matches the macro with the micro and uses LiDAR to detect minutes variation in height over swatches of countriside - even when the hard surface is masked by vegetation. We know what LiDAR is and how it works but their isn't consensus on what the acronym means; eithee  light detection and ranging OR light imaging, detection and ranging. See The Blob on APGAR for a backronym.
The UK Environment Agency has taken out a contract to map the whole country from the air using LiDAR. They are hoping to discover wonders like the layering of a modern motorway and a 19thC farmstead on top of a Roman fort and its associated access roads [L].  Bloomin' amazin'. The amount of data when you map the entire country at 1 metre resolution is impossible to store, let alone process, without a huge server. But these data are going to be made freely available to industry and Josie Public and you may bet your sweet bippy that the data are going to be put to some quite unforeseen creative use.  there are 11 terabytes of data in it which has been downloaded 500,000 times and stored on another server somewhere else. That's a LOT of bytes. But it still is a microdot compared to the trillion crap photos which are uploaded to The Cloud every year . . . never to be seen again.
In England, even if there are some occluding trees, you can still get access to the interesting sites for field work on the ground. It's a bit different in the Central American jungle where the storied vegetation fills the sky from ground to 50m up and then cascades down again as lianas, vines and thorns.  With LiDAR you can strip away the jungle to reveal a whole network of interconnected Mayan cities where nowadays only people with blow-guns hunt for a bit of bush-meat. And the mapping data doesn't stay on the server; chunks of it can be downloaded onto a laptop to create a virtual reality landscape linked to GPS. The archaeologist can image the next pyramid invisible in the impenetrable jungle and make a direct bee-line for the gods of the Maya. Asombrosamente increíble!

Tuesday, 20 February 2018

One letter code

Margaret Oakley [R after she became Margaret O Dayhoff through marriage] was born in Philadelphia in 1925. You cannot underestimate her importance to the development of the tools for making sense of biological sequences. For Dayhoff, the same claim can be made as for Dennis "C and Unix" Ritchie: without them it would all be different. Grace Hopper, inventor of COBOL, was another women in the right place, at the right time, with the right mindset and toolkit and she has a pretty high profile. Margaret Dayhoff otoh really doesn't get the same press but her contributions have had more impact; not least because she kicked off the area of bioinformatics and molecular sequence analysis which has supported me for almost all my working life. Developing a whole new field is chaotic - in the sense that it is sensitive to initial conditions.

I've riffed before on Pointless - the TV quiz game where success is when you can give a correct answer which nobody else has picked. If the question is "Name a female scientist who contributed to biomedical science in the late 20thC" then Margaret Dayhoff will be a winning Pointless answer. The answer to "Which pair of scientists made the first contribution to cracking the genetic code?" is not "Crick and Watson" - they 'just' gave us the physical structure of DNA. It is rather Nirenberg and Matthaei who in 1961 determined that UUU codes for Phenylalanine. That was the first codon assignment. The rest tumbled into place over the next 4 years, revealing that 20 amino acids are the basic inventory from which all proteins - all the enzymes, all the receptors, actin & myosin, haemoglobin, oxytocin, insulin - are constructed. The trouble is that the 20 amino acids were known and named years before the genetic code was AThing. The smallest, glycine, is from γλυκός glycos because it tastes sweet. I'm not sure about the connexion with soya Glycine max. Serine was first isolated from sericum the Latin for silk etc.

Dayhoff's first qualification was in mathematics which she subsequently started to apply to physical chemistry including the nature of chemical bonds. From there she moved into the structure of proteins and applied her mathematical and computing toolkit to the storage, retrieval and analysis of protein sequences - of which an increasing number were coming on stream. In 1960, she was appointed associate director of the National Biomedical Research Foundation in Maryland. Back then, protein sequencing was running in parallel and quite a way ahead of DNA/RNA sequencing. The first substantive piece of RNA sequencing saw RW Holley take a whole year 1965 to work out the 80ish bases of Alanine tRNA. That would now be knocked off in a μ-second. aNNyway, Dayhoff saw that the inventory of protein sequences was growing exponentially and, albeit from a small baseline, was going to get massive. Writing down each sequence on paper wasn't going to be the answer. Accordingly, she started to record sequences on punched cards [prev] and quickly grew dissatisfied with the convention that each amino acid was represented by a three-letter abbreviation based on its first three letters in English: Phe, Gly, Ser have been mentioned above. Dayhoff realised that with only 20 AAs in the inventory, each could be uniquely identified with one of the 26 letters in the Latin alphabet.

But whoops, here are those 20 amino acids: alanine - arginine - asparagine - aspartic acid - cysteine - glutamine - glutamic acid - glycine - histidine - isoleucine - leucine - lysine - methionine - phenylalanine - proline - serine - threonine - tryptophan - tyrosine - valine - and the first thing you note is that 20% of them begin with A!  So her first pass was to assign the easy [unique initial] ones:
  • C H I M S V 
  • it was also easy to assign F to phenylalanine at this stage which freed up 
  • P for proline
  • 8/20 done
the next decision was to give priority to the first in alphabetical order:
  • A = alanine; [G = glutamine]; L=leucine; T = threonine
  • that allowed K for lysine as the next unassigned letter in the alphabet.
  • 13/20 done
hmm, she thought, there are two cluttering overlaps because of the acid side-chains aspartate and glutamate and their amides asparagine and glutamine so:
  • let's reverse a bit to give G = glycine then
  • D = aspartate, the E = glutamate to fill in the early hole between C = Cys and F = Phe
  • N = asparagiNe and Q = glutamine [G looks a bit like Q] fills a similar later hole.
  • note that D precedes E because Aspartate precedes Glutamate
  • (18-1)/20 done
The rest are assigned by their second letter
  • R = aRginine; Y=tYrosine
  • and W the biggest letter is given to the largest amino acid tryptophan
  • and that's it!
  • 20/20 for Margaret Dayhoff
That, now universally agreed convention, was determined by the contingency that Dayhoff spoke English at home. If she's been born in Tampere, and followed the same algorithm then K would be assigned to cysteine [alaniini - arginiini - asparagiini - asparagiinihappo - kysteiini - glutamiini - glutamiinihappo - glysiini - histidiini - isoleusiini - leusiini - lysiini - metioniini - fenyylialaniini - proliini - seriini - treoniini - tryptofaani - tyrosiini - valiini] and all bets would have been off if she'd come from Kiev [аланін - аргінін - аспарагін - аспарагінова кислота - цистеїн - глутамін - глутамінова кислота - гліцин - гістидин - ізолейцин - лейцин - лізин - метіонін - фенілаланін - пролін - серин - треонін - триптофан - тирозин - валін].

Life has gotten more complex since those idyllic simple early days: we've discovered selenocysteine Sec U and pyrrolysine Pyl U. We finally give B to aspar* and Z to glutam* as ambiguity codes because a lot of the chemical protein sequencing protocols render the acids indistinguishable from their amides. Phew! with U and O we have a full set of vowels to play with.

Now the alphabet is almost full [J and X only unassigned] and we can use protein sequences to write names as a kind of geek-code. If you want to out-geek the geeks you can write your name as a peptide using Peptify a toy developed by Nuritas to stop their employees playing solitaire on their lunch-breaks. Nuritas is the spin-off of Nora Khaldi [bloboprev] an entrepreneurial woman in science. Here's PeptoBob me:

Monday, 19 February 2018

Frozen accident

The diagram above, largely due to Willie Taylor, is perhaps the most important chunk of infrastructural information in molecular biology. I've been here before with The Masters of Imm. It shows the common ground as to size, electrical charge, solubility among the 20 amino acids which make up proteins. A 'conservative substitution': is a change in the inventory of amino acids AAs that will cause least structural change to the constituent protein: roughly called if/when two amino acids appear in any one of the gathering-together circles. But some are more conservative than others! Leucine L and isoleucine I are essentially the same; whereas Glutamate D and Lysine K are both 'charged' but one is negative and the other positive, so they are not quite the same. How to quantify this? One way is to count the number of changes in the underlying DNA that will result in a change of amino acid.
green-for-go arrows are AA changes that require only a single change in the DNA, red-for-harder are examples where two changes are needed and some amino acid substitutions require three changes: there is NO commonality in the codons for TRP and ASP or between CYS and MET. It was long ago noted that conservative substitutions tend to be in the same row or column - they involve only a single change to the DNA.  That has implications for the evolution of the genetic code from a simpler arrangement with fewer amino acids which got more complex through a series of mutations which were found to have utility wrt survival and procreation. Perhaps more importantly it provides a bit of stretchability and robustness: few substitutions are going to create huge waves in the structure and integrity of the protein: UCC = Ser to ACC = Thr just adds a small methyl group to the AA side-chain; UAU = Tyr to UUU = Phe still results in a large aromatic hydrophobic amino acid. In other words, the genetic code is not really a 'frozen accident' but is extremely non-random.

These issues are teased out at length by Koonin & Novozhilov here [for free]. They have bearing on the problem of measuring similarity which is currently engaging some of my students in their final year research project.


Sunday, 18 February 2018

Sunday Misc 180217

Really miscellaneous today:

Saturday, 17 February 2018

Give us a hand

I love my job: it's not too hard on the knees, I have a great deal of autonomy, the work is within my competence but it's possible to embrace greater challenge if I'm bored. Right at the beginning of my career I had another wonderful job: working in Diergaarde Blijdorp aka Rotterdam Zoo. The work was physical, dirty and often soggy [my position was in Afdeling Vissen - aquarium-land] but I really looked forward to each working day. Every day was different but enough routine so that institutionalised me didn't go off the rails. My work-mates were a motley crew: a taxidermist; an amateur herpetologist with a flat full of live reptiles; the foreman was Afrikaans; one fellow couldn't wake to an alarm-clock but had to be phoned; two guys who'd done National Service in signals and talked in Morse. They'd all left school in their mid-teens because they loved animals, but many of them had a deeper knowledge of biology than BSc me. The only bloke with any sort of formal higher education was Chris who worked in the book&gift shop.  If I ever wanted to talk about things other than work or animals, I'd drift in to visit with Chris for a couple of minutes. If I kept a bucket in my hand it could pass for work.

Chris's eccentricity was that he couldn't walk, his limbs were banjaxed by a neuro-degenerative disease and he was delivered to work by his full-time night carer in a wheel-chair van and collected in the evening. At work, if he needed to get to the jacks, he'd flag down one his co-workers for the small amount of help needed. Some were more engaged in the helping than others. A couple of years before I appeared on the scene, when it was proposed that Chris might be coming to work, the management asked The Lads if they were willing to facilitate this stranger's transition to gainful employment. The response was 'mixed': some willing but apprehensive; some feeling 'whatever'; some were proud to be given a chance to give back. The only person who was vocally against the whole project was Jan; he got really cross about the imposition and the unspoken peer pressure and denounced the management for ticking social-inclusion boxes.  Turned out that Jan had been in a desperate traffic accident in his early 20s, spent weeks in hospital, and months in rehab - it was touch-and-go whether he would ever walk again. Clearly, he had some justifiable baggage about Project Chris.

Things had settled down to same-old-same-old routine by the time I rocked up. At a certain time in the middle of the morning, we'd all down buckets and brooms and schlep off for coffee and buns in the staff canteen over by the elephant house. Chris had a joy-stick operated motorised wheel-chair and someone was likely to hop on the back axle to cadge a lift. Equally likely, if the weather was fine and The Lads consequently frisky, someone would steer Chris into the shrubbery <ho ho> in the same way that we might throw snowballs at each other if there was a dusting of snow. As a late-comer on the scene, I witnessed that the most attentive person for Chris's welfare and inclusion was Jan. He had completely changed his relationship with disability; in a way Chris had healed the sick. When I learned the back-story, I was quite unaccountably buoyed up for the rest of the day.
This all came flooding back to me when I saw a short film about disability made by some local lads for the Donal Walsh #LiveLife National Film Competition. Donal Walsh was a Kerry teenager who died from cancer in 2013. The Film competition is to continue Donal's I'm done for but you-all should live life to the full message . . . and don't top yersel' ye daft buggers.  This may remind you of Stephen Sutton another early departer and The Boy's hi-jinks driving a wheelchair. The filmlet cited above is far better than the competition! Better story board, better acting, better lighting, better continuity. Most importantly, from my experiences in Blijdorp [above], the story has the ring of truth. If Cormac Lalor doesn't win the competition, I'll be calling "Fix!"

Friday, 16 February 2018

Where do we III come from?

My sense of identity really hasn't exercised me in any emotional way. Coming from Horse-riding-Protestant stock from King's County, my sense of self, and expectation of entitlement, is bred in my bones. Being straight, white, male and middle-class helps too. We know exactly when our family established its patrimony in Ireland - 1643 - and the manor house in Wales from which we migrated. The family takes this ancestry schtick with a pinch of salt, a dollop of humility and a wry smile knowing that my great grandfather was the 'natural son' of the owner of the Big House. My PhD thesis hinged on the idea that by looking at present day populations we could infer something about their ancestry and therefore inform people about the pattern of colonial migration in New England and the Canadian Maritimes. A similar analysis can be driven by looking at much richer and more extensive data of European human genomic DNA variation.

I've looked at this sort of analysis before I through 23andMe, and II though linguistic analysis, and also meanderings about PIE. We are now a little but more confident about where the Brits come from. We used to do this all the time in population genetics and molecular evolution: we inferred ancestral states from present day DNA because the dead are dead and disintegrated beyond yielding sequencable DNA.

Not any more! The technology for making sense of ancient DNA has moved on really fast and far in this century. The person who has delivered the most quality ancient human material into the public domain could well be Lara Cassidy [R], a 20-something PhD candidate working in Dan Bradley's [bloboprev] Archaeological Genetics lab in Trinity College Dublin. Ancient DNA work is really difficult: you need to be a good pair of hands: dexterous, meticulous, painstaking, tidy in your habits and careful of your data.  Any DNA that exists from hundreds or thousands of years ago is going to be degraded, fragmentary and hard to recover. The least bit of contamination: a fallen eyelash or a fingerprint; something left from your last experiment; a cough from the cleaner; will deliver enough contaminating DNA to swamp out any signal from your current sample. Then you must have a completely different set of skills in computational analysis, number crunching and programming. Ancient DNA is like running a time machine: from a fragment of bone [preferably the 'petrous' bone near the ear-hole] we can see certain attributes of the long-ago dead: their sex; their skin, eye and hair colour; their probable height; their susceptibility to disease. It's as if Achilles or Cúchulainn walks again. Cassidy has knocked off numerous ancient DNA genomes! A life-time's work in 4 years.

One of the most fraught questions in Irish departments of archaeology and anthropology is whether we are the direct, genetic, descendants of the builders of Newgrange and the folk of antient legend. Did those people adopt the cultural practices and borrow the tools of more sophisticated neighbours or were those ancient people displaced by the bearers of those tools and artefacts? 100+ years ago, with the British Empire as the invisible background to cultural discourse, the consensus was that superior migrants had brought culture to the benighted West. The next generation was throwing shapes about national identity after a bloody war of independence and 'migration' became a dirty word. The next generation after that adopted a bit of this a bit of that compromise position. Cassidy and her co-workers have now dumped a sackful of data on the fossil-cluttered desks of archaeologists and shown, maybe uncomfortably, that the colonial invasion / physical displacement model is most likely true. Here's the data graphed out [explanatory background yest]:

The further back you go, the thinner the seam of data gets. The earliest DNAccesible human bones in Ireland were discovered by spelunkers in a limestone cave in the NW. They have been carbon dated to the Mesolithic and their owners /users were probably hunter-gatherers. But in terms of Eurogenetics, those bones are on another planet.
A more recent, and much better preserved skull [L reconstructed head of Our Lady of Ballynahatty] was unearthed at Ballynahatty near Belfast. She is Neolithic and from an era that had embraced farming. The largest cultural artifact of that era, 120km due S but still in Ireland is Newgrange: a mighty pile of engineered stones, some decorated, some sorted by colour, protecting a portal tomb whose access passage precisely aligns with the Winter Solstice. It is older than Stonehenge, older than the Pyramids at Giza.  Her DNA profile [marked Bh above] bares no genetic resemblance to modern Irish people but slides neatly into place between Spain and Sardinia; she was clearly European but not our sort of European.

1,000 years later, another cultural transition appears in the archaeological record. It is fatuous and just wrong to think of the Neolithic society which created Newgrange [and the Ringstone, of which we are Guardians] as "banging two rocks together". That society was cohesive, sophisticated, religious, hierarchical and driven by the aesthetic. But the metalworkers from The East bringing copper and tin together in durable bronze weapons & knives; gold fancy-goods; and distinctive domestic pottery were a different culture altogether. The team from Trinity have shown clearly that they were different genetically as as well. Three Bronze Age skulls from Rathlin Island off the Antrim coast of Northern Ireland have now also had their genomes exposed to the public gaze. They are a bit on the edge of the local modern demographic [marked Ra on the genetic map above] but recognisably of and from these WEA islands. It's all been published in, the prestigious, PNAS.

You might think that Lara Cassidy is lucky to have gotten such a fabulous project with which to get her start in science [7 peer-reviewed pubs; two as 1st author; 1 in Science 2 in PNAS; not to mention all the press coverage]. It is not always like that: with the best will and skill in the world you can sign up to a project which has no hope of working out because it has been poorly conceived or grossly underfunded or has an terrible supervisor. But that project was lucky to get Cassidy because her telegraphic CV indicates quite extraordinary levels of achievement = determination and dedication. I've suggested before that you make your own luck, through finding a good fit to your talents and working damned hard at your craft. You can almost hear a Professor Bracknell echoing Oscar Wilde with "To sequence one ancient genome, Ms Cassidy, looks like fortune; to sequence several looks like carefulness." 
Bob B'godde Bracknell I wish I'd said that.
Bracknell: You will, Bob, you will next [last] time you are invited to Commons at TCD

but it's not about me, it's about More women in science.

Thursday, 15 February 2018

Eurogene - the map

I told y'all that you should go to Dublin on Darwinday to hear Dan Bradley talk about the Genetic Origins of the Irish. But I know that some things you can't delegate: you just have to do them yourself. Accordingly, I leapt into the Little Red Yaris at 1705hrs and drove to Dublin to hear the news from the frontiers of biogeography. But the news is always based on the olds and the most beautiful and informative picture of my 2018 [at top: far better copy] was published in 2008! I may well have been entranced by that map when it came out ten years ago, but I've since forgotten all about it. Heck, I've forgotten my car-keys and where I left my glasses as well.

That map is Fig 1 in a paper in Nature: Genes mirror geography within Europe which sampled the sequenced genomes of 3000 Europeans (and four Turks) and tallied up each person's state at 500,000 different variable sites in their DNA sequence. That's a shed-load of data and you can't make much headway by ticking off (3000 x 3000)/2 x 500,000 cases of Sean is different from Jean here but the same there, while Giovanni is different again. Well actually you can, and that's what John Novembre et al. did in 2008. They put the whole dataset into a hopper called Principle Components Analysis and gave it all a good shake and a jiggle. PCA reconciles all the internal inconsistencies, and calculates the position of each person in N-dimensional hyperspace. No, I too only have a hazy notion of what that really means but in practice it calculates how near or far each person is from each other person in the dataset. It will come as no surprise when it turned out that the quartet of Turks looked really similar to each other genetically and rather different from the Europeans . . . and the Irish too: like each other, quite similar to Brits and Scots and less like Poles and Greeks. A lot of that difference will smooth itself out over the next 100 years as our 200,000 resident Poles make babies with their Irish neighbours.

You can do these sort of studies because the cost of generating the primary data has collapsed over the last 30 years. The first ever chunk of genomic DNA, yeast Saccharomyces cerevisiae chromosome III, was contracted by the EEC (=EU) 30 years ago at 320,000 ecus (=€) or €1 /base. We carried out the first non-trivial added-value analysis of that data - one of my three big ideas in science. With that stepping stone achieved, planners looked to sequencing The Human Genome: it cost €300,000,000 (10c per base) and took ten years. Now you can sequence A human genome for €1,000; it will take a day; and there is enough server power to do many genomes in parallel. So 3,000 genomes is quite affordable in a big science sort of way.

What is most striking about the distribution of genomes across the most explanatory axes of the PCA landscape is how closely it maps onto the geography of Europe.  The pale blue of Greece and the Balkans is nearest to Turkey over on the right; the grey Italian peninsula runs parallel and a little more distant; and further away again is a purple peninsula of Iberian genomes. At the opposite end of the continent, the Irish intercalate with the Brits; the Scandinavians have both shared and separate identity etc. etc. If you look closely, you can see Paddy-No-Pals off on his own in the sea like as sort of Uber-Irish outlier. Maybe she is not Paddy at all but Caitlín Ní Uallacháin. Also note the five rogue ITs in the sea at bottom left of the diagram; they do indeed have Italian passports but they are actually Sardinians. There is no evidence here that the compatriots of Szilárd, Wigner, von Neumann and Teller come from Mars.

I was doing a similar analysis waaaay back in 1980s. I took me 2 years of tramping the streets of towns and cites in New England and the Canadian Maritimes, scoring genetic variation in domestic cats Felis catus to gather a sample of 10,000 cats in 35 different populations diagnosed for 7 genetic variants. (35 x 35)/2 x 7 is quite a bit smaller than (3000 x 3000)/2 x 500,000 !! But it was all my own work. One finding was that genetic distance was correlated (highly significant statistically) with geographic distance but that relationship only explained 16% of the variability in the sample. 84% of the variation was noise - some of which could be accounted for by the history of the patterns of French, English and Dutch colonisation in the 1600s. That was what my PhD thesis concluded aNNyway.

When you cough up your $100 to get your DNA sequenced, 23andMe will compare your DNA to a database like this one and place your genome on the map. Unless you are truly and incestuously descended from the Pharaohs, your genome will be a mess of fragments from the miscegenation of your ancestors. 23andMe will give you a summary sound-byte like "50% Irish; 25% English; 20% French; a toe from the Maghreb and a Neanderthal fingernail". You may take that assessment with a huge pinch of salt because the data will be inherently noisy.