PyLipID: A Python Package for Analysis of Protein–Lipid Interactions from Molecular Dynamics Simulations

Lipids play important modulatory and structural roles for membrane proteins. Molecular dynamics simulations are frequently used to provide insights into the nature of these protein–lipid interactions. Systematic comparative analysis requires tools that provide algorithms for objective assessment of such interactions. We introduce PyLipID, a Python package for the identification and characterization of specific lipid interactions and binding sites on membrane proteins from molecular dynamics simulations. PyLipID uses a community analysis approach for binding site detection, calculating lipid residence times for both the individual protein residues and the detected binding sites. To assist structural analysis, PyLipID produces representative bound lipid poses from simulation data, using a density-based scoring function. To estimate residue contacts robustly, PyLipID uses a dual-cutoff scheme to differentiate between lipid conformational rearrangements while bound from full dissociation events. In addition to the characterization of protein–lipid interactions, PyLipID is applicable to analysis of the interactions of membrane proteins with other ligands. By combining automated analysis, efficient algorithms, and open-source distribution, PyLipID facilitates the systematic analysis of lipid interactions from large simulation data sets of multiple species of membrane proteins.


SI FIGURES SI
. PyLipID package design: func, plot and util modules. These three modules are the inner layer of the package that contain functions for doing the heavy-lifting. The func module includes functions for calculation of lipid interactions, and can be divided into four sub-modules: interaction that contains functions/classes for calculation of durations, occupancy and lipid count; kinetics that contains functions for calculation residence time/koff; clusterer that contains functions to cluster the bound poses; and binding_site that contains functions for calculation of binding site related properties. The plot module provides the assisting functions for plotting, and can be divided into three sub-modules: koff that contains the function for plotting koffs; plot1d that contains functions for plotting properties as a function of residue index or binding site index; and plot2d that contains the function for plotting correlation. The util module provides various utility functions including creating directories, saving coordinates, and defining functions independent of PyLipID workflow.

SI SECTION S1
Choice of cut-offs and its impact on the reported values.
PyLipID defines contacts based on the measure of minimum distances, thus the accuracy of the calculation of lipid interactions, especially the calculation of binding site, will be determined by the choice of the distance cut-offs. By default, PyLipID uses a dual cut-off scheme in which a continuous contact starts when the minimum distance to the target residue is smaller than the lower cut-off and ends when the minimum distance is larger than the upper cut-off. This scheme was created to deal with the 'rattling in the cage' effect in coarse-grained simulations, in which molecules undertake rapid motions at their site without real dissociation (SI Fig S2). The contact distance and cutoff distances may vary among different lipid species. Comparison of the probability densities of cholesterol and PIP2 molecules from the A2aR and GABAAR (PDB id 6HUP) simulations revealed that PIP2 contact starts ~0.44 nm and the first peak of annular lipids locates at 0.49-0.5 nm, whereas cholesterol contact start at ~0.39 nm and the first peak of annular lipids at ~0.5-0.51 nm (SI Fig S3).
Based on how contacts are calculated in PyLipID, it is anticipated that the lower cut-off will affect the detection of lipid contacts and lipid binding sites, whereas the upper cut-off will affect the durations of these contacts. To illustrate the impact of cut-offs, we plotted the number of binding sites, the average durations of contacts, the average binding site surface area and the number of contacting residues for a sequence of dual cut-offs on a couple of GPCRs, including A2aR, A1R, µOR, ß2AR, GCGR and GLP1R, and a ligand-gate ion channel GABAAR (SI Fig S4). The lower cut-off ranged from 0.3 nm to 0.55 nm and the upper cut-off from 0.6 nm to 0.85 nm. For cholesterol interactions, GABAAR and GPCRs showed different changes of e.g. the number of binding sites, the averaged interaction durations, the number of contacting residues, in response to different cut-off values, due to the different protein structures and chemical properties of the protein surfaces of these two receptor families. For example, cholesterol has shorter contact durations with GABAAR regardless of the cutoff values, and more calculated binding sites for the lower cut-offs of 0.45 nm and 0.475nm, compared to the tested GPCRs. The diverse profiles of the two protein families could help to find some consistent rules concerning the choice of cut-off values.
Let us first look at the lower cut-off. Cholesterol contacts started to be detected at the lower cut-off of 0.4 nm in both protein families, whereas cholesterol binding sites were detected at 0.45 nm in the GPCRs and at the 0.425 nm in GABAAR. The starting distance for binding site detection is governed by the structure and geometry of the transmembrane region. The shorter lower cut-off for detection of binding sites in the latter system is due to that the GABAAR has straight transmembrane helices and the cholesterol binding sites on this protein largely consist of two adjacent helices and most on the surface of the protein. In comparison, GPCRs have more complex structures in the transmembrane region with crooked and tilted helices and the binding sites often span cross multiple helices and can reach deeper into the helix clefts. The binding sites detected with the minimal lower cut-off, i.e. 0.45 nm in the GPCRs and 0.425 nm in the GABAAR, were very fragmented, consisted of three to four residues, as indicated by their small surface areas. This is because that this small cut-off can only detect the synergistic binding activities of residues that showed the closet contacts with the lipid molecules. Whereas the use of this small lower cut-off may reveal some critical residues that showed strongest correlation of binding lipids, it can miss many important binding information. As illustrated in the comparison of cholesterol binding sites in β2AR using two dual-cutoffs, i.e. 0.45 nm -0.8 nm and 0.475 nm -0.8 nm (SI Fig S5), some of the binding sites that comprised of helices further apart from each other (i.e. the green binding site between TM6 and TM7 in SI Fig S5) cannot be detected by the lower cut-off of 0.45 nm. In addition, the larger lower cut-off of 0.475 nm also gave a fuller list of residues for each binding site which provides a better description of the binding sites.
In both protein systems, the number of binding sites started to decrease at the lower cut-off that is 0.5 nm larger than the minimal lower cut-off for binding site calculation, i.e. 0.5 nm in GPCRs and 0.475 nm in the GABAAR. This suggests that too large lower cut-offs can merge binding sites and lower the resolution of reported results. Therefore, a general rule for determining the value for the lower cut-off is to take the value just before the number of binding sites starts to decrease, i.e. 0.475 nm for the GPCRs and 0.45 nm for the GABAAR.
The upper cut-off mostly affected the interaction durations, as shown in the increase of averaged interaction durations with the increase of upper cut-off. Due to the dynamical behaviour of protein conformations, the interaction network can fluctuate with different interaction durations, i.e. different upper cut-offs. Thus, we saw fluctuations of the number of detected binding sites and the surface areas when the lower cut-off was fixed but the upper cut-off increased from 0.6 nm to 0.85 nm. Based on our observations, the upper cutoffs ranging from 0.7 nm to 0.85 nm gave quite consistent results for binding site calculation, whereas too small or too large an upper cutoff can generate deviations (SI Fig S6).
The effective cut-off values can vary for different lipid species. For PIP2 interactions on the same set of simulations systems, the lipid contacts started to appear with the lower cut-off of 0.425 nm and the lipid binding sites at 0.45 nm (SI Fig S7). This agrees with the distribution of minimum distances which started at a larger distance than that of cholesterol. The number of binding sites and the binding site surface areas also showed larger fluctuations with different upper cut-off, largely due to the high dynamics of lipid tails. For PIP2 calculations, we suggest that the lower cut-off in the range of 0.5 nm to 0.6 nm normally worked well. We also checked the cardiolipin interactions with the E coli protein formate dehydrogenase-N (Fdn-N) (SI Fig S8). The dependence on cut-offs again different from the other two lipid species.

Choice of cut-offs and its impact on the representative bound poses.
The representative bound poses, i.e. bound pose with the largest samplings from simulations, can always be picked up regardless of the choice of cut-off values if these values can effectively detect binding sites. This is because PyLipID uses a density-based scoring function to value bound poses, so that frequently sampled poses always rank high regardless of the binding site definition. The cutoff values, however, affects how many representative bound poses being generated via affecting the number of binding sites being detected. For example, in the case of using 0.45 nm -0.8 nm cut-offs, PyLipID detected two cholesterol binding sites around ICL2 of A2aR, one is between TM3 and TM5 and the other between TM3 and TM4, but the former has longer interaction residence time and more samplings. The representative bound poses for these two binding sites were generated by PyLipID and shown in SI Fig S9. When we used a larger lower cut-off, i.e. 0.55 nm -0.8 nm, these two binding sites were merged into one and the representative bound pose was the one between TM3 and TM5. The possible bound poses in a binding site can be checked by clustering the bound poses, and the clustered results can reveal the less frequently sampled bound poses. In this case, when we checked the all the clustered bound poses, the one between TM3 and TM4 was revealed. Thus, too big a lower cut-off may hide some of the interesting poses, whereas small lower cut-offs may generate unwanted noises and also lose some of the geometry features of binding sites.

System setup
For the 10 GPCR simulations, the clean and processed receptor coordinates were downloaded from MemProtMD database 4 (http://memprotmd.bioch.ox.ac.uk). The protein coordinates were converted into MARTINI coarse-grained models using martinize.py 5 , and then embedded in an asymmetrical membrane using insane.py 6 with a lipid composition shown in SI Table 1. Systems were solubilized with Martini waters and ions to a neutral charge. Systems were minimized using the steepest descent method, then equilibrated for 100 ns. The equilibration step used a semi-isotropic Berendsen barostat 7 at 1 bar, and a velocity-rescaling thermostat 8 at 310 K. Production simulations were then run using the Parrinello-Rahman barostat 9 at 1 bar using 20 fs. These simulations were run using Gromacs 2020 10 .
For the PC2 simulations, the structure was obtained from the PDB (PDB id 6T9N) and missing loops modelled using MODELLER 9.20 11 . The structure was converted to coarse-grained resolution using the MARTINI 2.2 forcefield and martinize.py. The ElNeDyn elastic network with a cut-off of 0.9 nm and a force constant of 1000 kJ mol -1 nm -2 was applied. The protein was embedded into the bilayer and solvated with MARTINI water using insane.py. NaCl was added to an approximate concentration of 0.15 M. The system was minimized via the steepest descent method and equilibrated in 2x 100 ns steps. 3x 100 μs CG simulations were performed using a 20 fs timestep. The temperature was maintained at 310K using the V-rescale thermostat and the pressure maintained at 1 bar using the Parrinello-Rahman barostat. Simulations were run using Gromacs 2019.
For simulation details of the formaet dehydrogenase-N system, see 2 , and details of the McpB system, see 3 .

Data Analysis
To assist structural analysis, the bound poses from coarse-grained simulations were converted to atomistic structures using CG2AT2 12 . Images were made with VMD 13 and PyMol 14 . Plots made with Matplotlib 15 .