G‐Quadruplex Recognition by DARPIns through Epitope/Paratope Analogy

Abstract We investigated the mechanisms leading to the specific recognition of Guanine Guadruplex (G4) by DARPins peptides, which can lead to the design of G4 s specific sensors. To this end we carried out all‐atom molecular dynamic simulations to unravel the interactions between specific nucleic acids, including human‐telomeric (h‐telo), Bcl‐2, and c‐Myc, with different peptides, forming a DARPin/G4 complex. By comparing the sequences of DARPin with that of a peptide known for its high affinity for c‐Myc, we show that the recognition cannot be ascribed to sequence similarity but, instead, depends on the complementarity between the three‐dimensional arrangement of the molecular fragments involved: the α‐helix/loops domain of DARPin and the G4 backbone. Our results reveal that DARPins tertiary structure presents a charged hollow region in which G4 can be hosted, thus the more complementary the structural shapes, the more stable the interaction.


Introduction
In addition to the well-known double helical arrangement, the important biological role of non-canonical nucleic acid is nowadays widely recognized. Among the different non-canonical DNA or RNA structures, guanine quadruplexes (G4 s) are highly studied and characterized. [1][2][3] From a chemical point of view, G4 s are formed in guanine-rich nucleic acids, whose nucleobases develop primarily cooperative Hoogsteen-type hydrogen bonds. Because of the specificity of this interaction, DNA (or RNA) is then organized in a series of stacked quartets, whose macromolecular arrangement is further stabilized by a metal cation occupying the central channel. Three main topologies can be adopted by non-canonical G4 structures depending on the 5'-3' orientation of the strands forming the G4 s backbone: parallel, antiparallel and hybrid ( Figure 1). In the parallel conformation, all the strands are oriented in the same way, in the antiparallel conformation the strands are inversely oriented two by two, while in the hybrid conformation only one strand is inversely oriented with respect to the other three. [4] G4 s have been identified in cellular nucleic acids and have been associated to the control of key biological functions. As a matter of fact, G4 s are involved in gene regulations, [5,6] in [a] T. Miclot  neurodegenerative diseases, [7,8] in the induction of DNA damages and in oncogenesis. [9][10][11][12] Furthermore, they have been recognized to play a role in the regulation of cellular cycles and in the regulation of post-translational modification in proteins. [12][13][14][15] G4 s have also been identified in both DNA and RNA viral genomes, including SARS-CoV-2, [16][17][18] where they may exert vital functions in regulating viral infection cycles. [19][20][21][22] Obviously, all these processes can only take place through a molecular machinery involving proteins selectively recognizing specific DNA or RNA G4 s. Among them we may cite ATPdependent DNA/RNA helicase DHX36, [23] G-rich sequence factor 1, [24] or fragile X mental retardation protein (SFMRP). [25] Thus, the development of artificial or biomimetic specific G4 binders, capable to recognize either DNA or RNA, is highly valuable. Furthermore, such ligands may be exploited either in a therapeutic context or for the rapid identification and localization of G4 s in cells or cellular compartments. [26][27][28][29] Protein engineering has also led to the development of antibodies presenting specificity and selectivity toward G4 s. In this case, the recognition of G4 s proceeds through the epitope/paratope mechanism, in which the G4 acts as the epitope of an antibody. [30] However, the design of antibodies is definitively not straightforward, and their use is typically limited to the identification of the subcellular localization of G4 s. [30][31][32][33] Smaller peptides specifically recognizing G4 s and even discriminating between different G4 types have also been proposed. This is typically the case of DARPins, [34] a class of synthetic proteins derived from the modification of natural ankyrins and mostly known as chaperone agents in crystallography. [35] In addition, DARPins have also been used as cellular markers in biological imaging and for therapeutic purposes. [36,37] Understanding the factors underlying the specific recognition of G4 s by DARPins can facilitate the design of sensors able to discriminate the G4 s subcellular localization and their specific sequences. In this contribution, we model the DARPin/ G4 interaction and, thus, unravel the specific recognition modes by combining molecular docking and long-scale all atom molecular dynamics (MD) simulations. We focus on 2E4 DARPin, which is specific for the G4 present in the c-Myc oncogene promoter, and 2G10, which has a slight specificity for different G4 s. [34] As for the nucleic acid counterpart, we restrict our study to the human telomeric G4, as well as the G4 s in the c-Myc [34] and Bcl-2 promoters.

Results and Discussion
MD simulations shed light on the structural details underlying the specific DARPins/G4 interaction. Indeed, by sampling the conformational space through different initial interaction positions, it is possible to analyze whether the interaction is conserved, the binding of the G4 affects the flexibility, the nucleic acid rearranges to reach a more stable pose, or if the proposed DARPin/DNA complex is not stable and separates. In our case most of the G4 s/DARPin complexes are persistent and stable all along the MD and the peptides interact with the G4 s through regions composed of large loops and helices, which overlap well with the recognized canonical interaction zones of the DARPins.
The only exceptions can be highlighted for 2E4/c-Myc which in one of the poses leads to a very labile and mobile interaction between G4 and the protein as confirmed by clustering yielding two dominant structures, representing 41.71 % and 30.56 % of the trajectory, respectively. On the other hand, 2E4/h-Telo (64.33 % of the trajectory) and 2G10/c-Myc (75.10 %) yield dominant clusters that interacts only through the loop ends of DARPins and one or two nucleotides of the flexible G4 s loops. (All the clustered structures can be found in the Supporting Information).

Residue-scale analysis of the G4/DARPin complexes
Before exposing the structural details of the G4 s/DARPin complexes at the atomistic scale, it is interesting to consider the interaction at a residue-level scale, and in particular classify the different interaction patterns in terms of the number of involved nucleic acid or protein residues. Figure 2 shows all the residues which remains within a cutoff of 3 Å from either the protein or the nucleic acid with a frequency at least equal to 50 % of the simulation time. On average six nucleic acid residues of h-Telo and eight amino acids of 2G10 can be identified. However, 2G10 interacts persistently through only five amino acids with c-Myc and Bcl-2 which in turn only bring a maximum of two or three nucleotides into persistent contact with the protein. Conversely, the 2E4/h-Telo interaction appears to be driven by three nucleotides and six amino acids. For 2E4, the interaction gathers eight amino acids with both c-Myc and Bcl-2, yet a different number of nucleic acid residues is involved, that is, four for Bcl-2 and seven for c-Myc.
This first analysis, at the residue level, already draws a general picture of the specific recognition of G4 by DARPins, confirming that the protein/nucleic acid recognition is favored by a high number of interacting residues. However, it needs to be completed identifying the exact nature of the interacting residues, their specific frequency, and the specific structural features.

Three different modes of interaction leading to the G4/DARPin complexes
To improve the global scale analysis presented in the previous subsection we should identify a region of the G4 s that is selectively recognized by DARPins. As a matter of fact, h-Telo does not show any specific interacting region or hotspot with 2E4 or 2G10 (see Figure 3A). This could confirm that the nonspecific recognition of h-Telo is due to the absence of a welldefined target region on this nucleotide. In contrast, more pronounced specific interaction regions may be recognized in the two others G4 s. Indeed, it can be seen in Figure 3(B) that two interaction areas clearly stand out for Bcl-2 interacting with either 2EA or 2G10, that is, the one including residues dC5 to dG8 and the one involving residues dG20 to dG22. Since the same nucleic acid regions are evidenced for both DARPins, the interaction mode can be classified as structurally similar in each complex. Thus, no specific recognition of Bcl-2 by 2E4 or 2G10 can be inferred, since such specific recognition should involve interaction areas that must differ between two different DARPins. This is, indeed, the case for the G4 present in the c-Myc promoter. Figure 3(C) clearly shows regions of very pronounced contact and different for each of the DARPins. For the interaction with 2G10, the hot spot includes residues dG15 to dT20, although the contact frequencies are still quite spread across the whole G4. On the other hand, 2E4 highlights two very strong and localized contact points. The most important one concerns residues dG6 to dA12, while the second one corresponds to the last two residues of G4, dA21 and dA22. Thus, the specific recognition of c-Myc by 2E4 could be achieved either through the recognition of its sequence, or through a specific structural motif. Our analysis indicates three possible scenarios: 1) a rather general interaction that does not involve any specific G4 region or sequence (h-telo); 2) a nonspecific interaction involving particular G4 s regions, which are however recognized by all the proteins (Bcl-2); 3) a specific interaction driven by a few nucleotides having very high contact frequencies with specific DARPin (c-Myc).

Identification of a putative selective DARPin interaction area
Repeating the same analysis while focusing on the protein counterpart we identify the amino acids mostly involved in the recognition of the non-canonical DNA structure. Similarly, to what has already been observed for G4 s, selectivity should correlate with few specific amino acids having high contact frequencies with G4 s. Conversely, for non-selective recognition a more scattered distribution of the interaction frequencies should be observed.
The distribution of the interaction contacts of 2G10 with the three G4 s ( Figure 4A) shows three distinct peaks. The first one corresponds to residues N34 and I35, the second one gathers residues R67, W68, R70, K78 and W79, while the last one comprises residues K100 and K101. While these localized protein areas certainly correspond to a strong interaction with G4, they appear rather non-specific since they are present for all the three G4 s. However, the interaction with h-Telo is also driven by amino acids whose contact frequency was low or zero for the other G4S. This case concerns mainly residues H107, L108, I111, R112, K133, F134, K136, and I141. However, caution should be taken to avoid overinterpretation of this result, since 2G10 is not showing any specificity for h-Telo. [34] The contact frequency for 2E4 ( Figure 4B) shows the emergence of even more defined trends presenting stronger and more localized maxima. In particular, we can mention residues K5, E9, and R12 as well as the regions spanning residues R34, W35, and M46, and residues H67, W68 and R70. However, only a relatively small difference in the interaction patterns between the three G4 s can be highlighted. In particular in the case of c-Myc residues Y45, R70, L75, S78, R79, and G80 develop persistent contacts, and hence could be regarded as potential hot-spots for the selectivity of 2E4 towards this G4.
Although, the 2G10/G4 complex involves a larger number of residues developing more persistent contacts than 2EG/G4, this should not be necessarily correlated to a higher affinity towards G4. Indeed, few residues developing stable and persistent interactions may be regarded as more favorable than an extended weakly interacting region. Furthermore, 2G10 is larger than 2E4, thus the higher number of contacting residues may be also regarded as an obvious statistical effect.

Alignment between 2E4 and a c-Myc-specific peptide reveals no sequence similarity
Several examples of peptides able to selectively bind G4 s are reported in the literature. Usually, they are derived from the DHX36 helicase whose α-helix provides the binding interface with the nucleic acid. [38,39] Notably, Minard et al. [40] designed a specific peptide, DM102 (PGHLKGRRIGLWYASKQGQKNK), which is able to preferentially recognize G4 in the c-Myc promoter. Since 2E4 is also specific to c-Myc, it is legitimate to ask whether there is a sequence similarity between DM102 and 2E4. It is also important to note that the artificial peptide DM102 has a hydrocarbon staple (i, i + 7) between residues R8 and S15, which enforces a defined α-helix. Hence, in addition to sequence similarity, one must also look for stable α-helical secondary structure, which is, indeed, a structural motif frequently present in DARPins. Two algorithms were used: Clustal Omega and M-coffee (see Figure S37). Here it is important to specify that Clustal Omega is an individual method of alignment, while M-coffee is an algorithm that combines results from several individual methods. [41] Interestingly, the two algorithms show divergent results. Clustal Omega previews similarity mainly concerning the 2E4 region spanning M46 to V62. The representation of the 3D structure of this region (see Figure S38) reveals a helical arrangement, which could partially support the 2E4 selectivity conditions. However, this region is also common to all DARPins,

Chemistry-A European Journal
Research Article doi.org/10.1002/chem.202201824 except for residue 58 (residue 70 following Scholz et al. notation), [34] which is embedded in the similarity region, and residues 45 and 46, which border it. Furthermore, the mutated residues at position 46 and 58 are structurally very distant, suggesting a low quality of the alignment. Conversely, M-Coffee alignment highlights three subunits. Two of them have no significance, the first being located at the previously invalidated region, and the third pertaining to the N-terminal region common to all DARPins. Instead, the second subunit is aligned with the R70-R79 region of DM102, as visually represented in Figure 5 by the transparent shaded area. This observation is also coherent with our MD simulations which indicate an increase of the DARPin/c-Myc contact frequency for the residues belonging to this region. Furthermore, from a structural point of view, the R70-R79 region is organized in α-helix motif and is located towards the canonical recognition zone of DARPins. Yet, this region is highly conserved among the DARPins designed by Scholz et al. [34] and only the residues bordering the helix, that is, R70, S78, and R79, have been mutated. Indeed, when M-Coffee alignment between DM102 and 2G10 the same DARPin region is evidenced (see Figure S39).
Thus, the search for a conserved sequence between DM102 and 2E4, does not unambiguously justify the selectivity of the DARPin. Going a step further, this could suggest that the recognition of c-Myc's by 2E4 does not necessarily involve sequence similarity between DM102 and 2E4. This is also supported by the fact that the mutations of the wild type sequence as performed by Scholz et al. [34] are mainly concentrated on the peripheral protein loops. Consequently, we decided to focus on structural features which should add up to the rather modest sequence effects and, ultimately, drive the selectivity.

2E4 recognizes a particular structural motif of c-Myc
From our MD simulations two most important factors should be considered when analyzing the local structural arrangements of the DARPin/G4 the contact region. First, the DARPin canonical interaction zone is not consistently interacting throughout the whole MD simulation. Instead, as highlighted in Figure 4(B), other amino acids either located in α-helixes or in peripheral loops develop more persistent interactions.
Furthermore, the analysis of the interaction networks shows that a DARPins/G4 complex is mainly stabilized by electrostatic interaction between positively charged amino acids and the negatively charged backbone of the nucleic acid. The paper by Scholz et al. [35] clearly excludes any interaction between DAR-Pins and canonical double strand DNA. This fact also confirms that the recognition of the nucleic acid should involve important and specific structural motifs, as confirmed by our study. In addition, π-cation interactions are also present mainly when the extended conjugated system of a tetrad faces the DARPin. Even if this interaction appears persistent along the MD simulation it should be confirmed by using quantum chemistry-based modeling, or even hybrid quantum/classical approaches, to avoid any spurious force field artifact and precisely calculate energy interaction terms. However, such a study, even if highly interesting would be out of the scope of the present contribution. Finally, DARPin associates with G4 through its canonical interaction zone involving the α-helix, but also via interactions mediated by the peripheral loops. The interaction with the loop is most pronounced, but not unique, in the case of h-Telo, which in the course of the MD simulation departs from the initial docking pose and slides over the DARPin surface until an interaction between its quartet and the peripheral loops is established at around 150 ns ( Figure 6). Interestingly, the electrostatic interactions involving the G4 backbone take place mainly through the G4 external loops rather than the tetrad core. However, this conformation appears as scarcely stable, and as a matter of fact the G4 oscillates and reverts to a more classic interaction mode involving one of its accessible quartets. These observations are also consistent with the frequency distribution reported in Figure 3 and explain the specific behavior of h-Telo, which due to its high mobility spans different interaction poses and develops rather non-specific contacts with a high number of 2E4 and 2G10 residues.
Thus, the interaction mode involving a quartet is not leading to a specific recognition mode. Hence, interactions between c-Myc or Bcl-2 and DARPin which would be driven by the G4 s quartets (Figure 7) will most probably be trapped in a non-specific recognition and cannot be used to infer on the specific recognition. On the contrary, specificity may be established when DARPins interact mainly with the nucleic acid backbone. The behavior of Bcl-2's, which interacts in a similar non-specific way with 2E4, confirms nicely this statement. Indeed, despite different initial conditions, the G4 again positions itself exposing a quartet to the 2E4 DARPin interaction region. On the contrary, the interaction with 2G10 leads to the exposure of the nucleic acid backbone to the contact region of DARPin and hence, to a selective recognition. As a matter of fact, these results are also coherent with the contact frequency analysis showing that Bcl-2 interaction with 2G10 is mostly driven by highly conserved and persistent amino acids.

2E4 recognizes a peculiar structural motif of c-Myc
c-Myc is the G4 more consistently promoting an interaction via its backbone (Figure 8). This, in turn, could also point to a greater specificity of its recognition, although c-Myc's is able to interact via its backbone with both 2E4 and 2G10. Thus, to

Chemistry-A European Journal
Research Article doi.org/10.1002/chem.202201824 further justify the selectivity of 2E4 a structural motif specifically recognized by this DARPin should be identified.
The main factor that could lead to a recognized structural motif includes the presence of a backbone folding involving the nucleotides most frequently in contact with the DARPin. This feature can be easily assessed by clustering the MD simulation while checking the maintenance of the interaction patterns in the most populated clusters. By highlighting the E24 highest frequency contact nucleotides, i. e., G6 to A12, A21, and A22, we see that they are involved in the interaction with the protein for

Chemistry-A European Journal
Research Article doi.org/10.1002/chem.202201824 the two most important clusters (Figures 8 and 9A, B). However, the two clusters differ by a rotation of about 180°of the G4 on the protein surface (pose 1-1: 78.47 % of the MDs and pose 6-4: 79.55 % of the MDs), as shown in Figure 10. Nonetheless, a well conversed structural motif is evidenced, determined by the folding of the G4 backbone into a U-shaped loop with an extruded nucleotide, further completed by a horizontal extension to the right, and overlaid by a dangling segment ( Figure 9). Interestingly, all the structural characteristics are well evidenced in the most populated cluster, while in the secondary structures their identification remains more elusive. Indeed, if the linear extension remains evident, as well as the extruded nucleotide, the U-shaped loop and the appendix are more scarcely visible.
The amino acids located in the interaction site are also conserved between the two most populated clusters (Figure 10). Residues Y45, R70, and R79 organizes around the loop at an average distance of 3 Å, while Y35 is oriented towards the appendix and R34 points towards the linear extension. At a slightly higher distance of around 5 Å, M46 is interacting with the U-shaped loop, W68 is oriented towards the appendix, while R12 and D33 flank the linear extension. In addition, S78 stays close to the extruded nucleotide, probably assuring a further stabilization.
By superposing the two most populated clusters of the 2E4/ c-Myc complexes a similar positioning of the G4 on the protein site is also observed, which is again consistent with the recognized backbone-based structural motif ( Figure 10D). Finally, the increase of the contact frequency observed in the analysis of the MD simulations correlates well with the specific recognition of c-Myc by 2E4. Indeed, the region presenting the highest increase in the contact frequency corresponds to the residues recognizing the U-shaped loop and the extruded nucleotide. This, together with the similarity of the interaction pattern found in the superposition of the two G4 s poses, further validates the hypothesis that the selective recognition of c-Myc by 2E4 is driven by the structural motif we have identified. Because of this structural-based recognition, and the often-invoked analogy between DARPin and antibodies, it is tempting to characterize this interaction pattern as an epitope/ paratope recognition. Here the paratope-like element being the 2E4 interaction site, and the epitope-like region the G4 structural motif identified for c-Myc. Figure 8. c-Myc G-quadruplex interacts preferentially through its backbone A) in the dynamics resulting from the most stable docking pose and B) in the dynamics resulting from a less favorable pose in which the G-quadruplex reorients itself to interact in a pose like the most stable pose found by docking. The previously identified high frequency nucleotides are colored A) in cyan and B) in steel blue, respectively.

Conclusion
Our results highlight two modes of interaction for DARPins/G4 complexes. The first one is a non-specific recognition that is established when G4 interacts through its guanine tetrad, or through peripheral nucleotides π-stacked with the tetrads. The second binding mode is driven by the specific recognition of the conformation of the G4 backbone and leads to a DARPin/ G4 paratope/epitope like recognition. This specific mode, which we have identified for the 2E4/c-Myc complex is based on a peculiar folding motif of the G4 backbone and presents a Ushaped loop with a linear extension and an overhanging short appendix. Consequently, a large extension of the U-shaped loop, also including extruded nucleotides should enhance selective recognition of the G4 s. Conversely, the identification of backbone-based recognition motifs could also improve the rational design of DARPins. Indeed, the quest for selective G4 ligands has a tremendous significance, especially in the proposition of specific anticancer or antiviral agents. Our results, and the first identification of paratope/epitope specific structural recognition may lead to significant development in the design of potentially therapeutics peptides targeting specific G4 arrangements.

Experimental Section
Structure of G4 and reconstruction of DARPins: The structure of the h-Telo G4 was retrieved from PDB data bank 1KF1, [42] as well as that of the c-Myc (1XAV) [43] and Bcl-2 (6ZX7). [44] The sequences of 2E4 and 2G10 DARPins were obtained from the Supporting Information of Ref. [34] and their structure reconstructed with the SWISS-MODEL server. [45] 2E4 were reconstructed based on high similarity with the PDB entry 2CH4 [46] and 2G10 was reconstructed on the basis of the 1SVX structure [47] similarity.
Sequence alignment. DM102 peptide and DARPins sequences were aligned using the Clustal Omega on EBI server [48] and the M-coffee server, [49] using their default parameters.
Docking and selection of initial structures: The reconstructed DARPins and G4 were loaded onto the HADDOCK server [50] to perform protein/nucleic acid docking while searching the entire protein and the entire G4 structure and using the standard HADDOCK parameters. Three poses were selected from the docking results ( Figure 11), always including the most favorable one. The selection was based on the relative position of G4 with respect to the DARPin. The three poses correspond to an interaction with the G4 backbone, an interaction with the tetrads facing the nucleotides and an interaction developed in a peripheral region of the DARPin. This choice allowed to assure a significant sampling of a complex conformational space, also including rather unfavorable interaction areas, such as the one corresponding to the peripheral binding.
Molecular dynamics simulations: MD simulations has been performed for 2E4 and 2G10 interacting with c-Myc, Bcl-2 and h-Telo. Three poses for each complex have been used as starting conditions. Each system was calculated in two independent replicates of 1 μs each, thus a total of 36 simulations of DARPin/G4

Chemistry-A European Journal
Research Article doi.org/10.1002/chem.202201824 complexes have been performed. In addition. simulations of the free G4 s and DARPin have also been obtained as a control. All simulations have been run using the NAMD software [51,52] with the Amber parm99 force field [53] including the bsc1 corrections [54] for nucleic acids. A truncated octahedral box of TIP3P [55] water was used to solvate the systems, using periodic boundary condition (PBC). All the calculations were performed in the isothermal and isobaric (NPT) ensemble at a temperature of 300 K and a pressure of 1 atm. A minimal concentration of K + ions was added to assure charge equilibration. Hydrogen Mass Repartitioning (HMR) [56] was consistently used, allowing, in combination with the Rattle and Shake algorithms, [57] a timestep of 4 fs to integrate Newton's equations of motion. Finally, the trajectories were analyzed and visualized with VMD, [58] as well as a dedicated script to retrieve G4 structural parameters, [59] while CPPTRAJ [60] was used for clustering.