Discussion about this post

User's avatar
lissnup's avatar

Thoroughly enjoyable read. It's refreshing to find so much detail presented in such a clear and accessible form.

Once you see these similarities laid out, it's striking how such a comparative analysis hasn't been attempted before now. I hope this stimulates lots of discussion and even further analysis.

Two of the characters you mention have not gained as much attention as the likes of Fauci, Daszak, Shi or even Baric - I'm referring to Lin-Fa Wang and Garry Crameri - their role in the SARS-COV-2 drama is worthy of scrutiny imho.

Congrats on your first post, welcome to substack and I hope we'll have a chance to read more from you.

Expand full comment
henjin's avatar

The following code downloads FASTA files for nucleotide and amino acid sequences of SARS-like viruses, it aligns the spike protein sequences, and it sorts the sequence by their number of mismatches to Tor2 in the region which features the DATSTGNYNYKYRYLR sequence in Tor2:

brew install mafft seqkit brewsci/bio/snp-dists xmlstarlet

curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1j-YFiMYG4DkVKSget2fYW-gaJDy6NCkW' # 335 aligned sequences of SARS-like viruses from GenBank

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta_cds_aa&id='$(seqkit seq -ni sarslike.fa|paste -sd, -)>sarslike.aa

seqkit grep -nrp spike\|surface sarslike.aa|mafft ->spike.aln

snp-dists sarslike.fa>sarslike.dist

xml fo -D sarslike.xml|xml sel -t -m //GBSeq -v GBSeq_accession-version -o $'\t' -v GBSeq_definition -o $'\t' -v GBSeq_create-date -o $'\t' -v './/GBQualifier[GBQualifier_name="collection_date"]/GBQualifier_value' -o $'\t' -v '(.//GBAuthor)[1]' -o ... -v '(.//GBAuthor)[last()]' -o $'\t' -v '(.//GBReference_title[text()!="Direct Submission"])[last()]' -o $'\n'>sarslike.tsv

tab(){ awk '{if(NF>m)m=NF;for(i=1;i<=NF;i++){a[NR][i]=$i;l=length($i);if(l>b[i])b[i]=l}}END{for(h in a){for(i=1;i<=m;i++)printf(i==m?"%s\n":"%-"(b[i]+n)"s",a[h][i])}}' "${1+FS=$1}" "n=${2-1}";} # `tab \\t` is like `column -ts$'\t'` but it doesn't get thrown off by empty fields

x=NC_004718.3;seqkit subseq -r490:506 spike.aln|seqkit fx2tab|sed $'s/_prot_[^\t]*//;s/lcl|//'|gawk '{l=length($2);for(i=1;i<=l;i++)a[$1][i]=substr($2,i,1);b[$1]=$2}END{for(i in a){d=0;for(j=1;j<=l;j++)if(a[targ][j]!=a[i][j])d++;print i"\t"b[i]"\t"d}}' targ=$x|awk 'NR==FNR{a[$1]=$2;next}{print$3,$2,a[$1],$1}' {,O}FS=\\t <(seqkit seq -n sarslike.fa|sed $'s/ /\t/;s/, complete genome//') -|sort -n|awk -F\\t 'NR==FNR{a[$1]=$2;next}{print a[$4]"\t"$0}' <(awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i==x)break;next}{print$1 FS$i}' x=$x sarslike.dist) -|sort -n|awk 'NR==FNR{a[$1]=$3 FS$4 FS$5;next}{print$0"\t"a[$NF]}' {,O}FS=\\t sarslike.tsv -|tab \\t

I posted the output of the shell commands here: https://pastebin.com/raw/GDm9PNqD.

Eight bat SARS viruses featured the sequence DATSTGNHNYKYRYLRH which has only one mismatch: BtRs-BetaCoV/YN2018B, Rs9401, Rs7327, YN2016C, YN2016D, YN2016E, YN2016A, YN2016B. They all have between 1254 and 1283 nucleotide changes from Tor2. WIV1 has about a hundred fewer nucleotide changes from Tor2 (1150) but it has two mismatches (DATQTGNYNYKYRSLRH). The only genome with three mismatches is "Rhinolophus affinis coronavirus isolate LYRa11" (DATSSGNFNYKYRSLRH), where the number of mismatches is pretty low considering that the whole genome has 2672 nucleotide changes from Tor2. The LYRa11 sequence was published in 2014 as part of a paper titled "Identification of Diverse Alphacoronaviruses and Genomic Characterization of a Novel Severe Acute Respiratory Syndrome-Like Coronavirus from Bats in China".

The Y?Y?Y pattern of three Y residues interspaced by single other residues is also featured in Wuhan-Hu-1: DSKVGGNYNYLYRLFRK. The region is identical in BANAL-52, BANAL-236, and BANAL-103. But in RaTG13 the first four residues DAKE instead of DSKV. And ZC45 has deletions in the middle of the sequence: "DV---GN--YFYRSHRS".

Expand full comment
38 more comments...

No posts