Thursday, July 13, 2023

DNA Reconstruction: Mathematics Segments and AI

 My buddy is a techie who is helping me with this project. He is sharpening his AI skills with this. I am not sure it was as easy as I initially thought and I am confident AI will have an answer one day. 


Until then I have changed my goal to segment isolation and prediction.   This means finding triangulated DNA segments, assigning them to ancestors and extrapolating what their inheritance means for all the other ancestors and descendants. 


Definition of a segment.

A segment is a string of DNA codes inherited from an ancestor. A person inherits 23 Segments from their mother and 23 from their father. 

A shared segment is an identical segment inherited by more than one descendent.  


Sibling shared segments

With siblings there is a special relationship. We can be fully identical, half identical or non matching.  Fully identical is when we inherit the exact same DNA from both our parents at a specific location on a chromosome.  Half identical is when we inherit the exact same DNA from only one parent  at a specific location on a chromosome. Non matching is when we inherit entirely different DNA from both parents at a specific location on a chromosome.

Saying "at a specific location on a chromosome" can get longwinded so lets just assume from now on when I refer to fully, half or non matching, that is implied.  If siblings were actually fully identical, they would be twins. 

When fully identical, the siblings, say Alice and Bryan have inherited their DNA from the same grandparents. For example the maternal grandfather and the paternal grandmother. A sibling that is only half identical, say Catherine, to these siblings would have inherited their DNA from only one of those grandparents, say the maternal grandmother and the paternal grandmother.  A third half identical sibling to Alice and Bryan, say Daniel,  could have inherited the DNA from the maternal grandfather and the paternal grandfather making him non matching to Catherine.

Alice= MGF and PGM  Bryan= MGF and PGM Catherine=MGM and PGM Daniel=MGF and PGF

When we compare our DNA to our siblings we will find random multiple areas of Fully, Half and Non matching DNA throught the genome.  At any point all siblings maybe fully identical or any combination.  There will be times when a pair of siblings goes from Half Identical to Fully Identical on a single chromosome. In fact siblings can go from Non matching, to Half to Fully to Half and back to non matching all across one chromosome.  What will never happen is for a sibling to go from Fully identical to non matching. 

The switches between the states (FI, HI, NM)  were set at conception when the genes were inherited from our parents (grandparents). Randomly at various points across our genome we inherited from either our maternal grandfather our our maternal grandmother, but only one or the other. Also randomly we inherited from our paternal grandparents. These random shifts will very rarely (read never) happen at exactly exactly same point on our maternal and paternal sides. That is what would have to happen if we went from fully matching to non matching. It is just not going to happen.


The locations on our chromsomes where we shift from one grandparent to the other I will call switches. These switch points are the beginning and ending of our segments shared with our siblings. These points can be labelled and will never move. These switch points will affect how you relate to all of your siblings. If Alice were to switch from being fully identical with Bryan, she would then be fully identical to Daniel or to Catherine.  Then until another switch point in one of the sibllings the inheritance would be known. If Alice were FI to Daniel, we would know she is no longer inheriting DNA from paternal grandmother but would be instead inheriting from the paternal grandfather. 

Alice= MGF and PGF  Bryan= MGF and PGM Catherine=MGM and PGM Daniel=MGF and PGF

What else do we know? We know that a parent has only inherited 50% of their DNA from each parent. Hence, a parent, at any location only has the DNA from either a paternal grandfather or grandmother and the DNA from either the maternal grandfather or grandmother. 

Lets take a step back here for now. I have identified the grandparents above, but in reality the only way to do this, is find a third person who can help us label those segments. The segment Alice continues to share with Bryan is from their maternal grandfather. One way to identify this is a relative of the maternal grandfather who matches both Alice and Bryan along this segment. We would also see this same relative matching Daniel.  Another way to identify this would be if there was a relative of their MGM that matched Catherine that did not match any of the other siblings. Using pure logic there are many ways of determining the value of the segments once we have a cousin more distant than full first cousin. 

One final item. When we switch segments from HI to FI and back to HI to NM, it is impossible to know, looking at just two siblings if the switch was in a maternal or paternal grandparent. Lets say Alice and Bryan began a sharing a segment at position 0 and they were HI. Lets say this is from their PGM. Since they are HI, one is inheriting from the MGM (Alice though we wouldnt know) and the other from MGF (Bryan). Lets they become FI at position 5000, one (Bryan) has switched, lets assume they are now both MGM. When they switch back to HI at 7000, one of them has switched a grandparent. There is no way to determine, just from looking at these two siblings which grandparent and which sibling switched.  At position 9000 they no longer match.

This means means they could share the dna from their PGM from 0 to 7000 or from 0 to 9000. They could be sharing the dna from their MGM from 5000-7000, or from 5000-9000.  Though we will see the shared segment(s) from 0 - 9000 easily, and the FI segment from 5000-7000 easily, it is impossible to know, without more information, if the sharing is one long segment from one grandparent and one small segment entirely within that range of another grandparent, or two medium segments with their ranges partially overlapping. 

Example:

Here you can see my brother and I compared on Chromosome 3. We match on almost the entire chromosome exept for the part in the center in red.  The two green areas are where we are FI, the yellow is HI. What is happening here is GEDMATCH is comparing the allele values in my DNA to my brothers DNA. When both alleles match at a certain postion a green tick is added, when just one matches it is a yellow tick and when none match it is a red tick.  Where it is green we have inherited from both of our parents the same DNA. This means either our parent's father's DNA or their mother's DNA.  Where it is yellow we inherited the same DNA from one parent, but different DNA from the other. where it is red, we did not inherit any shared DNA - perhaps I inherited all grandmother (maternal/paternal) DNA and he inherited all grandfather DNA. Using DNA matches and logic we should be able to determine which DNA we inherited from which grandparent. Because of the red areas, we know all four grandparents are represented here at least once.  If the only data were my brother and I, the values in the yellow area would mean we have lost the DNA from one grandparent, and where it is green, we have lost the values of two grandparents. I look at FI segments as actually 2 HI segments. Each segment can ultimately be assigned to one of four grandparents. 

Half matching segments:


Full matching segments:

Full First Cousin Shared Segments

Once we get to first cousins, segments are no longer classified as FI or HI. That is because they can only be HI by definition, inherited only from one of our parents. So when we see a segment we share with a maternal first cousin, we know that segment came from our mother.  Lets say our mother's were sisters. 

Since we have a segment that is HI and shared between siblings, our mothers, applying the rules for siblings, we know we inherited this segment from only one of our mothers' grandparents.  

Example:


Here is data from our comparision to our maternal first cousin Angela.  Because only I match her from 20.7-26.4 and  72.1-89.0 we can deduce that the half matching segment shown above between my brother and I in that range, was DNA from my father. Likewise, we can deduce that the portion we share, from 93.5 to 112.9 is from my mother.  Jumping ahead, this means the allele values over this entire section that  we share with my cousin, are the same allele values that my mother had.  Therefore that section of my mothers DNA is known. Conversly, the values that are left over are my fathers values. Even futher, where only one I matche Angela, we can determine that the corresponding umatched alleles belong to one paternal grandparent and Todd's segment that does not match either of us contains allelels from my other paternal grandparent and the other maternal grandparent. 

Matches to first cousin: 



Segments shared with Nieces/Nephews

Though only HI, it is a big more complicated. When we share a segment with a Niece or Nephew that lies completely within a non switching HI range shared with their parent, we can be sure that segment can be attributed to a single GP. However, in regions where we are FI with a sibling, segments shared with their child cannot be so easily attributed. 









No comments: